Gracefull recovery when attaching volume fails

When trying to attach a volume to an already running instance the nova-api
requests the nova-compute service to create a BlockDeviceMapping. If the
nova-api does not receive a response within `rpc_response_timeout` it will
treat the request as failed and raise an exception.

There are multiple cases where nova-compute actually already processed the
request and just the reply did not reach the nova-api in time (see bug report).
After the failed request the database will contain a BlockDeviceMapping entry
for the volume + instance combination that will never be cleaned up again.
This entry also causes the nova-api to reject all future attachments of this
volume to this instance (as it assumes it is already attached).

To work around this we check if a BlockDeviceMapping has already been created
when we see a messaging timeout. If this is the case we can safely delete it
as the compute node has already finished processing and we will no longer pick
it up.
This allows users to try the request again.

A previous fix was abandoned but without a clear reason ([1]).

[1]: https://review.opendev.org/c/openstack/nova/+/731804

Closes-Bug: 1960401
Change-Id: I17f4d7d2cb129c4ec1479cc4e5d723da75d3a527
This commit is contained in:
Felix Huettner 2022-02-09 12:03:15 +01:00
parent f7fa3bf5fc
commit 9eb116b99c
3 changed files with 56 additions and 4 deletions

View File

@ -4803,10 +4803,24 @@ class API:
This method is separated to make it possible for cells version
to override it.
"""
volume_bdm = self._create_volume_bdm(
context, instance, device, volume, disk_bus=disk_bus,
device_type=device_type, tag=tag,
delete_on_termination=delete_on_termination)
try:
volume_bdm = self._create_volume_bdm(
context, instance, device, volume, disk_bus=disk_bus,
device_type=device_type, tag=tag,
delete_on_termination=delete_on_termination)
except oslo_exceptions.MessagingTimeout:
# The compute node might have already created the attachment but
# we never received the answer. In this case it is safe to delete
# the attachment as nobody will ever pick it up again.
with excutils.save_and_reraise_exception():
try:
objects.BlockDeviceMapping.get_by_volume_and_instance(
context, volume['id'], instance.uuid).destroy()
LOG.debug("Delete BDM after compute did not respond to "
f"attachment request for volume {volume['id']}")
except exception.VolumeBDMNotFound:
LOG.debug("BDM not found, ignoring removal. "
f"Error attaching volume {volume['id']}")
try:
self._check_attach_and_reserve_volume(context, volume, instance,
volume_bdm,

View File

@ -520,6 +520,36 @@ class _ComputeAPIUnitTestMixIn(object):
mock_attach.assert_called_once_with(self.context,
instance, fake_bdm)
@mock.patch.object(compute_rpcapi.ComputeAPI, 'reserve_block_device_name')
@mock.patch.object(
objects.BlockDeviceMapping, 'get_by_volume_and_instance')
@mock.patch.object(objects.BlockDeviceMappingList, 'get_by_volume')
def test_attach_volume_reserve_bdm_timeout(
self, mock_get_by_volume, mock_get_by_volume_and_instance,
mock_reserve):
mock_get_by_volume.side_effect = exception.VolumeBDMNotFound(
volume_id='fake-volume-id')
fake_bdm = mock.MagicMock(spec=objects.BlockDeviceMapping)
mock_get_by_volume_and_instance.return_value = fake_bdm
instance = self._create_instance_obj()
volume = fake_volume.fake_volume(1, 'test-vol', 'test-vol',
None, None, None, None, None)
mock_reserve.side_effect = oslo_exceptions.MessagingTimeout()
mock_volume_api = mock.patch.object(self.compute_api, 'volume_api',
mock.MagicMock(spec=cinder.API))
with mock_volume_api as mock_v_api:
mock_v_api.get.return_value = volume
self.assertRaises(oslo_exceptions.MessagingTimeout,
self.compute_api.attach_volume,
self.context, instance, volume['id'])
mock_get_by_volume_and_instance.assert_called_once_with(
self.context, volume['id'], instance.uuid)
fake_bdm.destroy.assert_called_once_with()
@mock.patch.object(compute_rpcapi.ComputeAPI, 'reserve_block_device_name')
@mock.patch.object(objects.BlockDeviceMappingList, 'get_by_volume')
@mock.patch.object(compute_rpcapi.ComputeAPI, 'attach_volume')

View File

@ -0,0 +1,8 @@
---
fixes:
- |
The `bug 1960401`_ is fixed which can cause invalid `BlockDeviceMappings`
to accumulate in the database. This prevented the respective volumes from
being attached again to the instance.
.. _bug 1960401: https://bugs.launchpad.net/nova/+bug/1960401