Disable the heal instance info cache periodic task

The _heal_instance_info_cache periodic task predates the introduction of the server external events API which is now the canonical way to refresh the cache. This change updates the default value of ``[compute]heal_instance_info_cache_interval`` to -1 disabling it by default. The nova-ovs-hybrid-plug job is extended to test the legacy configuration value and the config override is removed from nova-next Closes-Bug: #1996094 Related-Bug: #2089225 Change-Id: I33ac91bb4f3ead51af2f7005002d5eb5078540d9
2025-01-16 18:02:57 +00:00 · 2025-01-16 18:02:57 +00:00 · b3f8815720
commit b3f8815720
parent 26d174b65d
3 changed files with 50 additions and 5 deletions
--- a/.zuul.yaml
+++ b/.zuul.yaml
@ -210,6 +210,9 @@
          $NEUTRON_CONF:
            nova:
              live_migration_events: True
          $NOVA_CPU_CONF:
              compute:
                heal_instance_info_cache_interval: 60
    group-vars:
      subnode:
        devstack_localrc:
@ -250,6 +253,9 @@
            $NEUTRON_CONF:
              nova:
                live_migration_events: True
            $NOVA_CPU_CONF:
              compute:
                heal_instance_info_cache_interval: 60
    post-run: playbooks/nova-live-migration/post-run.yaml
 - job:
@ -422,10 +428,6 @@
              # reduce the number of placement calls in steady state. Added in
              # Stein.
              resource_provider_association_refresh: 0
              # Neutron networking backends today are expected to work without
              # the periodic healing of the cache in Nova. Turn it off to gain
              # additional performance.
              heal_instance_info_cache_interval: -1
            workarounds:
              # This wa is an improvement on hard reboot that cannot be turned
              # on unconditionally. But we know that ml2/ovs sends plug time
--- a/nova/conf/compute.py
+++ b/nova/conf/compute.py
@ -1085,7 +1085,7 @@ Related options:
  to be synchronized manually.
 """),
    cfg.IntOpt('heal_instance_info_cache_interval',
-        default=60,
+        default=-1,
        help="""
 Interval between instance network information cache updates.
--- a/releasenotes/notes/disable_heal_instance_info_cache_interval-0d9ae7c12793bf7b.yaml
+++ b/releasenotes/notes/disable_heal_instance_info_cache_interval-0d9ae7c12793bf7b.yaml
@ -0,0 +1,43 @@
 ---
 upgrade:
  - |
    ``[compute]heal_instance_info_cache_interval`` now defaults to -1.
    In the early days of Nova, all networking was internal, then ``quantum``,
    now known as ``neutron`` was introduced.
    When the networking subsystem was being externalized and neutron was
    optional Nova still needed to keep track of the ports associated with an
    instance.
    To that end, to avoid these expensive calls to an optional service the
    instance info cache was extended to include network information and a
    periodic task was introduced to update it in
    ``08fa534a0d28fa1be48aef927584161becb936c7`` as part of the
    ``Essex`` release.
    As we have learned over the years per compute periodic tasks that call
    other services do not scale well as the number of compute nodes increases.
    In ``ce936ea5f3ae0b4d3b816a7fe42d5f0100b20fca`` the os-server-external-events
    API was introduced. The server external events API allows external systems
    such as Neutron to trigger cache refreshes on demand, this was part
    of the Icehouse release. With the introduction of this API, neutron was
    modified to send network-changed events on a per-port basis as API actions
    are performed on neutron ports. When that was introduced the default value
    of ``[compute]heal_instance_info_cache_interval`` was not changed
    to ensure there was no upgrade impact.
    In``ba44c155ce1dcefede9741722a0525820d6da2b8`` as part of bug #1751923
    the _heal_instance_info_cache periodic task was modified to pass a
    "force_refresh" forcing Nova to lookup the current state of all ports for
    the instance from neutron and fully rebuild the info_cache. This has the
    side effect of making the already poor scaling of this optional periodic
    task even worse.
    In this release, the default behaviour of Nova has been changed to
    disable the periodic, optimizing for performance, scale, power consumption
    and typical deployment topologies, where the instance network information
    is updated by neutron via the external event API as ports are modified.
    This should significantly reduce the background neutron API load in
    medium to large clouds. If you have a neutron backend that does not
    reliably send network-changed event notifications to Nova you can
    re-enable this periodic task by setting
    ``[compute]heal_instance_info_cache_interval`` to a value greater than 0.