nova/doc/source/user/conductor.rst
Matt Riedemann 1308d644bb Remove resize caveat from conductor docs
This document was written back in the liberty release [1]
and says that conductor is not used for orchestrating the
resize/migrate flow, but given the description of how
conductor is used to orchestrate scheduling and reschedules
during a server create, it is unclear why the doc says that
resize is not used the same way since it is used for rescheduling
when prep_resize fails in a selected dest compute. This removes
the caveat to reflect reality.

[1] Ieb9134302d21a11fe9b9ee876bb7b0dd32b437e1

Change-Id: I932a7ac6870a3f9d26556c23c9074115963b3c27
2019-03-15 08:02:52 -04:00

3.0 KiB

Conductor as a place for orchestrating tasks

In addition to its roles as a database proxy and object backporter the conductor service also serves as a centralized place to manage the execution of workflows which involve the scheduler. Rebuild, resize/migrate, and building an instance are managed here. This was done in order to have a better separation of responsibilities between what compute nodes should handle and what the scheduler should handle, and to clean up the path of execution. Conductor was chosen because in order to query the scheduler in a synchronous manner it needed to happen after the API had returned a response otherwise API response times would increase. And changing the scheduler call from asynchronous to synchronous helped to clean up the code.

To illustrate this the old process for building an instance was:

  • API receives request to build an instance.
  • API sends an RPC cast to the scheduler to pick a compute.
  • Scheduler sends an RPC cast to the compute to build the instance, which means the scheduler needs to be able to communicate with all computes.
    • If the build succeeds it stops here.
    • If the build fails then the compute decides if the max number of scheduler retries has been hit. If so the build stops there.
      • If the build should be rescheduled the compute sends an RPC cast to the scheduler in order to pick another compute.

This was overly complicated and meant that the logic for scheduling/rescheduling was distributed throughout the code. The answer to this was to change to process to be the following:

  • API receives request to build an instance.
  • API sends an RPC cast to the conductor to build an instance. (or runs locally if conductor is configured to use local_mode)
  • Conductor sends an RPC call to the scheduler to pick a compute and waits for the response. If there is a scheduler fail it stops the build at the conductor.
  • Conductor sends an RPC cast to the compute to build the instance.
    • If the build succeeds it stops here.
    • If the build fails then compute sends an RPC cast to conductor to build an instance. This is the same RPC message that was sent by the API.

This new process means the scheduler only deals with scheduling, the compute only deals with building an instance, and the conductor manages the workflow. The code is now cleaner in the scheduler and computes.