From a338a4da11258d2864db791de32a0138852ff5b5 Mon Sep 17 00:00:00 2001
From: Nikola Dipanov <ndipanov@redhat.com>
Date: Wed, 17 Jun 2015 14:02:39 +0100
Subject: [PATCH] Add documentation for block device mapping

This commit adds some (long overdue) documentation around block device
mapping and how it's used in Nova.

blueprint devref-refresh-liberty

Change-Id: Idca142f3b34ad896ab99f02a3f9eb72a6a3b4778
---
 doc/source/block_device_mapping.rst | 204 ++++++++++++++++++++++++++++
 doc/source/index.rst                |   1 +
 2 files changed, 205 insertions(+)
 create mode 100644 doc/source/block_device_mapping.rst

diff --git a/doc/source/block_device_mapping.rst b/doc/source/block_device_mapping.rst
new file mode 100644
index 000000000000..4b849f8944c1
--- /dev/null
+++ b/doc/source/block_device_mapping.rst
@@ -0,0 +1,204 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+Block Device Mapping in Nova
+============================
+
+Nova has a concept of block devices that can be exposed to cloud instances.
+There are several types of block devices an instance can have (we will go into
+more details about this later in this document), and which ones are available
+depends on a particular deployment and the usage limitations set for tenants
+and users. Block device mapping is a way to organize and keep data about all of
+the block devices an instance has.
+
+When we talk about block device mapping, we usually refer to one of two things
+
+1. API/CLI structure and syntax for specifying block devices for an instance
+   boot request
+
+2. The data structure internal to Nova that is used for recording and keeping,
+   which is ultimately persisted in the block_device_mapping table. However,
+   Nova internally has several "slightly" different formats for representing
+   the same data. All of them are documented in the code and or presented by
+   a distinct set of classes, but not knowing that they exist might trip up
+   people reading the code. So in addition to BlockDeviceMapping [1]_ objects
+   that mirror the database schema, we have:
+
+   2.1 The API format - this is the set of raw key-value pairs received from
+   the API client, and is almost immediately transformed into the object;
+   however, some validations are done using this format. We will refer to this
+   format as the 'API BDMs' from now on.
+
+   2.2 The virt driver format - this is the format defined by the classes in
+   :mod: `nova.virt.block_device`. This format is used and expected by the code
+   in the various virt drivers. These classes, in addition to exposing a
+   different format (mimicking the Python dict interface), also provide a place
+   to bundle some functionality common to certain types of block devices (for
+   example attaching volumes which has to interact with both Cinder and the
+   virt driver code). We will refer to this format as 'Driver BDMs' from now
+   on.
+
+
+Data format and its history
+----------------------------
+
+In the early days of Nova, block device mapping general structure closely
+mirrored that of the EC2 API. During the Havana release of Nova, block device
+handling code, and in turn the block device mapping structure, had work done on
+improving the generality and usefulness. These improvements included exposing
+additional details and features in the API. In order to facilitate this, a new
+extension was added to the v2 API called `BlockDeviceMappingV2Boot` [2]_, that
+added an additional `block_device_mapping_v2` field to the instance boot API
+request.
+
+Block device mapping v1 (aka legacy)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This was the original format that supported only cinder volumes (similar to how
+EC2 block devices support only EBS volumes). Every entry was keyed by device
+name (we will discuss why this was problematic in its own section later on
+this page), and would accept only:
+
+* UUID of the Cinder volume or snapshot
+* Type field - used only to distinguish between volumes and Cinder volume
+  snapshots
+* Optional size field
+* Optional `delete_on_termination` flag
+
+While all of Nova internal code only uses and stores the new data structure, we
+still need to handle API requests that use the legacy format. This is handled
+by the Nova API service on every request. As we will see later, since block
+device mapping information can also be stored in the image metadata in Glance,
+this is another place where we need to handle the v1 format. The code to handle
+legacy conversions is part of the :mod: `nova.block_device` module.
+
+Intermezzo - problem with device names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Using device names as the primary per-instance identifier, and exposing them in
+the API, is problematic for Nova mostly because several hypervisors Nova
+supports with its drivers can't guarantee that the device names the guest OS
+assigns are the ones the user requested from Nova. Exposing such a detail
+in the public API of Nova is obviously not ideal, but it needed to stay for
+backwards compatibility. It is also required for some (slightly obscure)
+features around overloading a block device in a Glance image when booting an
+instance [3].
+
+The plan for fixing this was to allow users to not specify the device name of a
+block device, and Nova will determine it (with the help of the virt driver), so
+that it can still be discovered through the API and used when necessary, like
+for the features mentioned above (and preferably only then).
+
+Another use for specifying the device name was to allow the "boot from volume"
+functionality, by specifying a device name that matches the root device name
+for the instance (usually `/dev/vda`).
+
+Currently (mid Liberty) users are discouraged from specifying device names
+for all calls requiring or allowing block device mapping, except when trying to
+override the image block device mapping on instance boot, and it will likely
+remain like that in the future. Libvirt device driver will outright override
+any device names passed with it's own values.
+
+Block device mapping v2
+^^^^^^^^^^^^^^^^^^^^^^^
+
+New format was introduced in an attempt to solve issues with the original
+block device mapping format discussed above, and also to allow for more
+flexibility and addition of features that were not possible with the simple
+format we had.
+
+New block device mapping is a list of dictionaries containing the following
+fields (in addition to the ones that were already there):
+
+* source_type - this can have one of the following values:
+
+    * `image`
+    * `volume`
+    * `snapshot`
+    * `blank`
+
+* dest_type  - this can have one of the following values:
+
+    * `local`
+    * `volume`
+
+Combination of the above two fields would define what kind of block device the
+entry is referring to. We currently support the following combinations:
+
+    * `image` -> `local` - this is only currently reserved for the entry
+      referring to the Glance image that the instance is being booted with (it
+      should also be marked as a boot device). It is also worth noting that an
+      API request that specifies this, also has to provide the same Glance uuid
+      as the `image_ref` parameter to the boot request (this is done for
+      backwards compatibility and may be changed in the future). This
+      functionality might be extended to specify additional Glance images
+      to be attached to an instance after boot (similar to kernel/ramdisk
+      images) but this functionality is not supported by any of the current
+      drivers.
+    * `volume` -> `volume` - this is just a Cinder volume to be attached to the
+      instance. It can be marked as a boot device.
+    * `snapshot` -> `volume` - this works exactly as passing `type=snap` does.
+      It would create a volume from a Cinder volume snapshot and attach that
+      volume to the instance. Can be marked bootable.
+    * `image` -> `volume` - As one would imagine, this would download a Glance
+      image to a cinder volume and attach it to an instance. Can also be marked
+      as bootable. This is really only a shortcut for creating a volume out of
+      an image before booting an instance with the newly created volume.
+    * `blank` -> `volume` - Creates a blank Cinder volume and attaches it. This
+      will also require the volume size to be set.
+    * `blank` -> `local` - Depending on the guest_format field (see below),
+      this will either mean an ephemeral blank disk on hypervisor local
+      storage, or a swap disk (instances can have only one of those).
+
+* guest_format - Tells Nova how/if to format the device prior to attaching,
+  should be only used with blank local images. Denotes a swap disk if the value
+  is `swap`.
+
+* device_name - See the previous section for a more in depth explanation of
+  this - currently best left empty (not specified that is), unless the user
+  wants to override the existing device specified in the image metadata.
+  In case of Libvirt, even when passed in with the purpose of overriding the
+  existing image metadata, final set of device names for the instance may still
+  get changed by the driver.
+
+* disk_bus and device_type - low level details that some hypervisors (currently
+  only libvirt) may support. Some example disk_bus values can be: `ide`, `usb`,
+  `virtio`, `scsi`, while device_type may be `disk`, `cdrom`, `floppy`, `lun`.
+  This is not an exhaustive list as it depends on the virtualization driver,
+  and may change as more support is added. Leaving these empty is the most
+  common thing to do.
+
+* boot_index - Defines the order in which a hypervisor will try devices when
+  attempting to boot the guest from storage. Each device which is capable of
+  being used as boot device should be given a unique boot index, starting from
+  0 in ascending order. Some hypervisors may not support booting from multiple
+  devices, so will only consider the device with boot index of 0. Some
+  hypervisors will support booting from multiple devices, but only if they are
+  of different types - eg a disk and CD-ROM. Setting a negative value or None
+  indicates that the device should not be used for booting. The simplest
+  usage is to set it to 0 for the boot device and leave it as None for any
+  other devices.
+
+
+Nova will not allow mixing of two formats in a single request, and will do
+basic validation to make sure that the requested block device mapping is valid
+before accepting a boot request.
+
+.. [1] In addition to the BlockDeviceMapping Nova object, we also have the
+   BlockDeviceDict class in :mod: `nova.block_device` module. This class
+   handles transforming and validating the API BDM format.
+.. [2] This work predates API microversions and thus the only way to add it was
+   by means of an API extension.
+.. [3] This is a feature that the EC2 API offers as well and has been in Nova
+   for a long time, although it has been broken in several releases. More info
+   can be found on `this bug <https://launchpad.net/bugs/1370250>`
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 4a0e6fe4dcdf..2ac9c8777704 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -141,6 +141,7 @@ Open Development.
    filter_scheduler
    rpc
    hooks
+   block_device_mapping
    addmethod.openstackapi
 
 Architecture Evolution Plans