global request id spec
This proposes a path forward on the global request_id which would be of use to many operators running OpenStack. Change-Id: I65de8261746b25d45e105394f4eeb95b9cb3bd42
This commit is contained in:
parent
4b7dd6e512
commit
70a11a4921
378
specs/pike/global-req-id.rst
Normal file
378
specs/pike/global-req-id.rst
Normal file
@ -0,0 +1,378 @@
|
||||
====================
|
||||
Global Request IDs
|
||||
====================
|
||||
|
||||
https://blueprints.launchpad.net/oslo?searchtext=global-req-id
|
||||
|
||||
Building a complex resource, like a boot instance, requires not only
|
||||
touching a number of Nova processes, but also other services such as
|
||||
Neutron, Glance, and possibly Cinder. When we make those service jumps
|
||||
we currently generate a new request-id, which makes tracing those
|
||||
flows quite manual.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
When a user creates a resource, such as a server, they are given a
|
||||
request-id back. This is generated very early in the paste pipeline of
|
||||
most services. It is eventually embedded into the ``context``, which
|
||||
is then used implicitly for logging all activities related to the
|
||||
request. This works well for tracing requests inside of a single
|
||||
service as it passes through its workers, but breaks down when an
|
||||
operation spans multiple services. A common example of this is a
|
||||
server build, which requires Nova to call out multiple times to
|
||||
Neutron and Glance (and possibly other services) to create a server on
|
||||
the network.
|
||||
|
||||
It is extremely common for clouds to have an ELK (Elastic Search,
|
||||
Logstash, Kibana) infrastructure that is consuming their logs. The
|
||||
only way to query these flows is if there is a common identifier
|
||||
across all relevant messages. A global request-id immediately makes
|
||||
existing deployed tooling better for managing OpenStack.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
The high level solution is as follows (details on specific points
|
||||
later):
|
||||
|
||||
- accept an inbound X-OpenStack-Request-ID header on requests. Require
|
||||
that it looks like a uuid to prevent injection issues. Set this to
|
||||
the value of ``global_request_id``
|
||||
- Keep the auto generated existing request_id
|
||||
- update oslo.log to default also log ``global_request_id`` when it is
|
||||
in a context logging mode.
|
||||
|
||||
|
||||
Paste pipelines
|
||||
---------------
|
||||
|
||||
The processing of incoming requests happens piecemeal through the set
|
||||
of paste pipelines. These are mostly common between projects, but
|
||||
there are enough local variation to highlight what this looks like for
|
||||
the base IaaS services, which will be the initial targets of this spec.
|
||||
|
||||
Neutron [#f1]_
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[composite:neutronapi_v2_0]
|
||||
use = call:neutron.auth:pipeline_factory
|
||||
noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
|
||||
keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
|
||||
# ^ ^
|
||||
# request_id generated here -------+ |
|
||||
# context built here ------------------------------------------------+
|
||||
|
||||
Glance [#f2]_
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
# Use this pipeline for keystone auth
|
||||
[pipeline:glance-api-keystone]
|
||||
pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context rootapp
|
||||
# ^
|
||||
# request_id & context built here -----------------------------------------------------+
|
||||
|
||||
Cinder [#f3]_
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[composite:openstack_volume_api_v3]
|
||||
use = call:cinder.api.middleware.auth:pipeline_factory
|
||||
noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
|
||||
keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
|
||||
# ^ ^
|
||||
# request_id generated here -------+ |
|
||||
# context built here ------------------------------------------------------------------+
|
||||
|
||||
Nova [#f4]_
|
||||
~~~~~~~~~~~
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
[composite:openstack_compute_api_v21]
|
||||
use = call:nova.api.auth:pipeline_factory_v21
|
||||
noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
|
||||
keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
|
||||
# ^ ^
|
||||
# request_id generated here -------+ |
|
||||
# context built here ----------------------------------------------------------------------+
|
||||
|
||||
|
||||
oslo.middleware.request_id
|
||||
--------------------------
|
||||
|
||||
In nearly all services the request_id generation happens very early,
|
||||
well before any local logic. The middleware sets an
|
||||
X-OpenStack-Request-ID response header, as well as variables in the
|
||||
environment that are later consumed by oslo.context.
|
||||
|
||||
We would accept an inbound X-OpenStack-Request-ID, and validate that
|
||||
it looked like ``req-$UUID`` before accepting it as the
|
||||
``global_request_id``.
|
||||
|
||||
The returned X-OpenStack-Request-ID would be the existing
|
||||
``request_id``. This is like the parent process getting the child
|
||||
process id on a fork() call.
|
||||
|
||||
oslo.context from_environ
|
||||
-------------------------
|
||||
|
||||
Fortunately for us most projects now use the oslo.context
|
||||
``from_environ`` constructor. This means that we can add content to
|
||||
the context, or adjust the context, without needing to change every
|
||||
project. For instance in Glance the context constructor looks like
|
||||
[#f5]_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
kwargs = {
|
||||
'owner_is_tenant': CONF.owner_is_tenant,
|
||||
'service_catalog': service_catalog,
|
||||
'policy_enforcer': self.policy_enforcer,
|
||||
'request_id': request_id,
|
||||
}
|
||||
|
||||
ctxt = glance.context.RequestContext.from_environ(req.environ,
|
||||
**kwargs)
|
||||
|
||||
As all logging happens *after* the context is built. All required
|
||||
parts of the context will be there before logging starts.
|
||||
|
||||
oslo.log
|
||||
--------
|
||||
|
||||
oslo.log defaults should include ``global_request_id`` during context
|
||||
logging. This is something which can be done late, as users can always
|
||||
override there context logging string format.
|
||||
|
||||
projects and clients
|
||||
--------------------
|
||||
|
||||
With the infrastructure above implemented it will be a small change to
|
||||
python clients to save and emit the ``global_request_id`` when
|
||||
created. For instance, Nova calling Neutron, during the get_client
|
||||
call ``context.request_id`` would be stored in the client. [#f6]_:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
||||
def _get_available_networks(self, context, project_id,
|
||||
net_ids=None, neutron=None,
|
||||
auto_allocate=False):
|
||||
"""Return a network list available for the tenant.
|
||||
The list contains networks owned by the tenant and public networks.
|
||||
If net_ids specified, it searches networks with requested IDs only.
|
||||
"""
|
||||
if not neutron:
|
||||
neutron = get_client(context)
|
||||
|
||||
if net_ids:
|
||||
# If user has specified to attach instance only to specific
|
||||
# networks then only add these to **search_opts. This search will
|
||||
# also include 'shared' networks.
|
||||
search_opts = {'id': net_ids}
|
||||
nets = neutron.list_networks(**search_opts).get('networks', [])
|
||||
else:
|
||||
# (1) Retrieve non-public network list owned by the tenant.
|
||||
search_opts = {'tenant_id': project_id, 'shared': False}
|
||||
if auto_allocate:
|
||||
# The auto-allocated-topology extension may create complex
|
||||
# network topologies and it does so in a non-transactional
|
||||
# fashion. Therefore API users may be exposed to resources that
|
||||
# are transient or partially built. A client should use
|
||||
# resources that are meant to be ready and this can be done by
|
||||
# checking their admin_state_up flag.
|
||||
search_opts['admin_state_up'] = True
|
||||
nets = neutron.list_networks(**search_opts).get('networks', [])
|
||||
# (2) Retrieve public network list.
|
||||
search_opts = {'shared': True}
|
||||
nets += neutron.list_networks(**search_opts).get('networks', [])
|
||||
|
||||
_ensure_requested_network_ordering(
|
||||
lambda x: x['id'],
|
||||
nets,
|
||||
net_ids)
|
||||
|
||||
return nets
|
||||
|
||||
.. note::
|
||||
|
||||
There are some usage patterns where a client is built and kept for
|
||||
long running operations. In these cases we'd want to change the
|
||||
model to assume that clients are ephemeral, and should be discarded
|
||||
at the end of their flows.
|
||||
|
||||
This will also help tracking non user initiated tasks such as
|
||||
periodic jobs that touch other services for information refresh.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Log in the Caller
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
There was a previous OpenStack cross project spec to completely handle
|
||||
this in the caller - https://review.openstack.org/#/c/156508/. That
|
||||
was merged over 2 years ago, but has yet to gain traction.
|
||||
|
||||
It had a number of disadvantages. It turns out the client code is far
|
||||
less standardized here, so fixing every client was substantial
|
||||
work.
|
||||
|
||||
It also requires some standard convention for writing these things out
|
||||
to logs on the caller side that is consistent between all services.
|
||||
|
||||
It also **does not** allow people to use Elastic Search to trace their
|
||||
logs (which all large sites have running). A custom piece of analysis
|
||||
tooling would need to be built.
|
||||
|
||||
Verify trust in callers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A long time ago, in a galaxy far far away, in a summit room I was not
|
||||
in, I was told there was a concern about clients flooding this
|
||||
field. There has been no documented attack that seems feasable here if
|
||||
we strictly validate the inbound data.
|
||||
|
||||
There is a way we could use Service roles to validate trust here, but
|
||||
without a compelling case for why that is needed, we should do the
|
||||
simpler thing.
|
||||
|
||||
For reference Glance already accepts a user provided request-id of 64
|
||||
characters or less. This has existed there for a long time, with no
|
||||
reports as to yet for abuse. We could consider dropping the last
|
||||
constraint and not doing role validation.
|
||||
|
||||
|
||||
Swift multipart transaction id
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Swift has a related approach where their transaction id, which is a
|
||||
multipart id that includes a piece generated by the server on inbound
|
||||
request, a timestamp piece, a fixed server piece (for tracking
|
||||
multiple clusters), and a user provided piece. Swift is not currently
|
||||
using any of the above oslo infrastructure, and targets syslog as
|
||||
their primary logging mechanism.
|
||||
|
||||
While there are interesting bits in this approach, it's a less
|
||||
straight forward chunk of work to transition to, given the oslo
|
||||
components. Also, oslo.log has many structured log back ends (like
|
||||
json stream, fluentd, and systemd journal) where we really would want
|
||||
the global and local as separate fields so there is no heuristic
|
||||
parsing required.
|
||||
|
||||
Impact on Existing APIs
|
||||
-----------------------
|
||||
|
||||
oslo.middleware request_id contract will change so that it accepts an
|
||||
inbound header, and sets a second env variable. Both are backwards
|
||||
compatible.
|
||||
|
||||
oslo.context will accept a new local_request_id. This requires
|
||||
plumbing local_request_id into all calls that take request_id. This
|
||||
looks fully backwards compatible.
|
||||
|
||||
oslo.log will need to be adjusted to support logging both
|
||||
request_ids. It should probably be enabled to do that by default,
|
||||
though log_context string is a user configured variable, so they can
|
||||
set whatever site local format works for them. An upgrade release note
|
||||
would be appropriate when this happens.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
There previously was a concern about trusting request ids from the
|
||||
user. It is an inbound piece of user data, so care should be taken.
|
||||
|
||||
* Ensure it is not allowed to be so big as to create a DOS vector
|
||||
(size validation)
|
||||
* Ensure that it is not a possible code injection vector (strict
|
||||
validation)
|
||||
|
||||
These items can be handled with strict validation of the content that
|
||||
it looks like a valid uuid.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Minimal. This is a few extra lines of instruction in existing through
|
||||
paths. No expensive activity is done in this new code.
|
||||
|
||||
Configuration Impact
|
||||
--------------------
|
||||
|
||||
The only configuration impact will be on the oslo.log context string.
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
Developers will now have much easier tracing of build requests in
|
||||
their devstack environments!
|
||||
|
||||
Testing Impact
|
||||
--------------
|
||||
|
||||
Unit tests provided with various oslo components.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
sdague
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
.. note::
|
||||
|
||||
Could definitely use help to get this through the gauntlet, there
|
||||
are lots of little patches here to get right.
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Target Milestone for completion: TBD
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
TBD
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
TBD - but presumably some updates to operators guide on tracing across
|
||||
services.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [#f1] https://github.com/openstack/neutron/blob/5691f29e8fd1212bb22b1a48d32dbbddf7e0587d/etc/api-paste.ini#L6-L9
|
||||
.. [#f2] https://github.com/openstack/glance/blob/5caf1c739e190338e87be8bcd880cb88b0920299/etc/glance-api-paste.ini#L13-L15
|
||||
.. [#f3] https://github.com/openstack/cinder/blob/81ece6a9f2ac9b4ff3efe304bab847006f8b0aef/etc/cinder/api-paste.ini#L24-L28
|
||||
.. [#f4] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/etc/nova/api-paste.ini#L29-L32
|
||||
.. [#f5]
|
||||
https://github.com/openstack/glance/blob/70d51c7c5c09b070588041a65905eba789ae871b/glance/api/middleware/context.py#L179-L187
|
||||
.. [#f6] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/nova/network/neutronv2/api.py#L317-L354
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
Loading…
x
Reference in New Issue
Block a user