Deprecate CONF.workarounds.enable_numa_live_migration

Once a deployment has been fully upgraded to Train, the CONF.workarounds.enable_numa_live_migration config option is no longer necessary. This patch changes the conductor check to only apply if the cell's (cross-cell live migration isn't supported) minimum service version is old. Implements blueprint numa-aware-live-migration Change-Id: If649218db86a04db744990ec0139b4f0b1e79ad6
2019-02-28 06:38:05 -05:00 · 2019-02-28 06:38:05 -05:00 · 083bafc353
commit 083bafc353
parent b335d0c157
5 changed files with 123 additions and 25 deletions
--- a/doc/source/common/numa-live-migration-warning.txt
+++ b/doc/source/common/numa-live-migration-warning.txt
@ -1,10 +1,12 @@
 .. important::
-   Unless :oslo.config:option:`specifically enabled
+   In deployments older than Train, or in mixed Stein/Train deployments with a
-   <workarounds.enable_numa_live_migration>`, live migration is not currently
+   rolling upgrade in progress, unless :oslo.config:option:`specifically
-   possible for instances with a NUMA topology when using the libvirt driver.
+   enabled <workarounds.enable_numa_live_migration>`, live migration is not
-   A NUMA topology may be specified explicitly or can be added implicitly due
+   possible for instances with a NUMA topology when using the libvirt
-   to the use of CPU pinning or huge pages. Refer to `bug #1289064`__ for more
+   driver. A NUMA topology may be specified explicitly or can be added
-   information.
+   implicitly due to the use of CPU pinning or huge pages. Refer to `bug
   #1289064`__ for more information. As of Train, live migration of instances
   with a NUMA topology when using the libvirt driver is fully supported.
   __ https://bugs.launchpad.net/nova/+bug/1289064
--- a/nova/conductor/tasks/live_migrate.py
+++ b/nova/conductor/tasks/live_migrate.py
@ -175,7 +175,9 @@ class LiveMigrationTask(base.TaskBase):
                    method='live migrate')
    def _check_instance_has_no_numa(self):
-        """Prevent live migrations of instances with NUMA topologies."""
+        """Prevent live migrations of instances with NUMA topologies.
        TODO(artom) Remove this check in compute RPC 6.0.
        """
        if not self.instance.numa_topology:
            return
@ -189,17 +191,32 @@ class LiveMigrationTask(base.TaskBase):
        if hypervisor_type.lower() != obj_fields.HVType.QEMU:
            return
-        msg = ('Instance has an associated NUMA topology. '
+        # We're fully upgraded to a version that supports NUMA live
-               'Instance NUMA topologies, including related attributes '
+        # migration, carry on.
-               'such as CPU pinning, huge page and emulator thread '
+        if objects.Service.get_minimum_version(
-               'pinning information, are not currently recalculated on '
+                self.context, 'nova-compute') >= 40:
-               'live migration. See bug #1289064 for more information.'
+            return
               )
        if CONF.workarounds.enable_numa_live_migration:
-            LOG.warning(msg, instance=self.instance)
+            LOG.warning(
                'Instance has an associated NUMA topology, cell contains '
                'compute nodes older than train, but the '
                'enable_numa_live_migration workaround is enabled. Live '
                'migration will not be NUMA-aware. The instance NUMA '
                'topology, including related attributes such as CPU pinning, '
                'huge page and emulator thread pinning information, will not '
                'be recalculated. See bug #1289064 for more information.',
                instance=self.instance)
        else:
-            raise exception.MigrationPreCheckError(reason=msg)
+            raise exception.MigrationPreCheckError(
                reason='Instance has an associated NUMA topology, cell '
                       'contains compute nodes older than train, and the '
                       'enable_numa_live_migration workaround is disabled. '
                       'Refusing to perform the live migration, as the '
                       'instance NUMA topology, including related attributes '
                       'such as CPU pinning, huge page and emulator thread '
                       'pinning information, cannot be recalculated. See '
                       'bug #1289064 for more information.')
    def _check_can_migrate_pci(self, src_host, dest_host):
        """Checks that an instance can migrate with PCI requests.
--- a/nova/conf/workarounds.py
+++ b/nova/conf/workarounds.py
@ -157,14 +157,25 @@ Related options:
    cfg.BoolOpt(
        'enable_numa_live_migration',
        default=False,
        deprecated_for_removal=True,
        deprecated_since='20.0.0',
        deprecated_reason="""This option was added to mitigate known issues
 when live migrating instances with a NUMA topology with the libvirt driver.
 Those issues are resolved in Train. Clouds using the libvirt driver and fully
 upgraded to Train support NUMA-aware live migration. This option will be
 removed in a future release.
 """,
        help="""
 Enable live migration of instances with NUMA topologies.
-Live migration of instances with NUMA topologies is disabled by default
+Live migration of instances with NUMA topologies when using the libvirt driver
-when using the libvirt driver. This includes live migration of instances with
+is only supported in deployments that have been fully upgraded to Train. In
-CPU pinning or hugepages. CPU pinning and huge page information for such
+previous versions, or in mixed Stein/Train deployments with a rolling upgrade
-instances is not currently re-calculated, as noted in `bug #1289064`_.  This
+in progress, live migration of instances with NUMA topologies is disabled by
-means that if instances were already present on the destination host, the
+default when using the libvirt driver. This includes live migration of
 instances with CPU pinning or hugepages. CPU pinning and huge page information
 for such instances is not currently re-calculated, as noted in `bug #1289064`_.
 This means that if instances were already present on the destination host, the
 migrated instance could be placed on the same dedicated cores as these
 instances or use hugepages allocated for another instance. Alternately, if the
 host platforms were not homogeneous, the instance could be assigned to
--- a/nova/tests/unit/conductor/tasks/test_live_migrate.py
+++ b/nova/tests/unit/conductor/tasks/test_live_migrate.py
@ -187,7 +187,7 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
        self.flags(enable_numa_live_migration=False, group='workarounds')
        self.task.instance.numa_topology = None
        mock_get.return_value = objects.ComputeNode(
-            uuid=uuids.cn1, hypervisor_type='kvm')
+            uuid=uuids.cn1, hypervisor_type='qemu')
        self.task._check_instance_has_no_numa()
    @mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
@ -201,25 +201,47 @@ class LiveMigrationTaskTestCase(test.NoDBTestCase):
        self.task._check_instance_has_no_numa()
    @mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
-    def test_check_instance_has_no_numa_passes_workaround(self, mock_get):
+    @mock.patch.object(objects.Service, 'get_minimum_version',
                       return_value=39)
    def test_check_instance_has_no_numa_passes_workaround(
            self, mock_get_min_ver, mock_get):
        self.flags(enable_numa_live_migration=True, group='workarounds')
        self.task.instance.numa_topology = objects.InstanceNUMATopology(
            cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
                                            memory=1024)])
        mock_get.return_value = objects.ComputeNode(
-            uuid=uuids.cn1, hypervisor_type='kvm')
+            uuid=uuids.cn1, hypervisor_type='qemu')
        self.task._check_instance_has_no_numa()
        mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
    @mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
-    def test_check_instance_has_no_numa_fails(self, mock_get):
+    @mock.patch.object(objects.Service, 'get_minimum_version',
                       return_value=39)
    def test_check_instance_has_no_numa_fails(self, mock_get_min_ver,
                                              mock_get):
        self.flags(enable_numa_live_migration=False, group='workarounds')
        mock_get.return_value = objects.ComputeNode(
-            uuid=uuids.cn1, hypervisor_type='QEMU')
+            uuid=uuids.cn1, hypervisor_type='qemu')
        self.task.instance.numa_topology = objects.InstanceNUMATopology(
            cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
                                            memory=1024)])
        self.assertRaises(exception.MigrationPreCheckError,
                          self.task._check_instance_has_no_numa)
        mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
    @mock.patch.object(objects.ComputeNode, 'get_by_host_and_nodename')
    @mock.patch.object(objects.Service, 'get_minimum_version',
                       return_value=40)
    def test_check_instance_has_no_numa_new_svc_passes(self, mock_get_min_ver,
                                                       mock_get):
        self.flags(enable_numa_live_migration=False, group='workarounds')
        mock_get.return_value = objects.ComputeNode(
            uuid=uuids.cn1, hypervisor_type='qemu')
        self.task.instance.numa_topology = objects.InstanceNUMATopology(
            cells=[objects.InstanceNUMACell(id=0, cpuset=set([0]),
                                            memory=1024)])
        self.task._check_instance_has_no_numa()
        mock_get_min_ver.assert_called_once_with(self.context, 'nova-compute')
    @mock.patch.object(objects.Service, 'get_by_compute_host')
    @mock.patch.object(servicegroup.API, 'service_is_up')
--- a/releasenotes/notes/numa-aware-live-migration-4297653974458ee1.yaml
+++ b/releasenotes/notes/numa-aware-live-migration-4297653974458ee1.yaml
@ -0,0 +1,46 @@
 ---
 features:
  - |
    With the libvirt driver, live migration now works correctly for instances
    that have a NUMA topology. Previously, the instance was naively moved to
    the destination host, without updating any of the underlying NUMA guest to
    host mappings or the resource usage. With the new NUMA-aware live migration
    feature, if the instance cannot fit on the destination the live migration
    will be attempted on an alternate destination if the request is
    setup to have alternates. If the instance can fit on the destination, the
    NUMA guest to host mappings will be re-calculated to reflect its new
    host, and its resource usage updated.
 upgrade:
  - |
    For the libvirt driver, the NUMA-aware live migration feature requires the
    conductor, source compute, and destination compute to be upgraded to Train.
    It also requires the conductor and source compute to be able to send RPC
    5.3 - that is, their ``[upgrade_levels]/compute`` configuration option must
    not be set to less than 5.3 or a release older than "train".
    In other words, NUMA-aware live migration with the libvirt driver is not
    supported until:
    * All compute and conductor services are upgraded to Train code.
    * The ``[upgrade_levels]/compute`` RPC API pin is removed (or set to
      "auto") and services are restarted.
    If any of these requirements are not met, live migration of instances with
    a NUMA topology with the libvirt driver will revert to the legacy naive
    behavior, in which the instance was simply moved over without updating its
    NUMA guest to host mappings or its resource usage.
    .. note:: The legacy naive behavior is dependent on the value of the
              ``[workarounds]/enable_numa_live_migration`` option. Refer to the
              Deprecations sections for more details.
 deprecations:
  - |
    With the introduction of the NUMA-aware live migration feature for the
    libvirt driver, ``[workarounds]/enable_numa_live_migration`` is
    deprecated. Once a cell has been fully upgraded to Train, its value is
    ignored.
    .. note:: Even in a cell fully upgraded to Train, RPC pinning via
              ``[upgrade_levels]/compute`` can cause live migration of
              instances with a NUMA topology to revert to the legacy naive
              behavior. For more details refer to the Upgrade section.