global request id spec
This proposes a path forward on the global request_id which would be of use to many operators running OpenStack. Change-Id: I65de8261746b25d45e105394f4eeb95b9cb3bd42
This commit is contained in:
parent
4b7dd6e512
commit
70a11a4921
378
specs/pike/global-req-id.rst
Normal file
378
specs/pike/global-req-id.rst
Normal file
@ -0,0 +1,378 @@
|
|||||||
|
====================
|
||||||
|
Global Request IDs
|
||||||
|
====================
|
||||||
|
|
||||||
|
https://blueprints.launchpad.net/oslo?searchtext=global-req-id
|
||||||
|
|
||||||
|
Building a complex resource, like a boot instance, requires not only
|
||||||
|
touching a number of Nova processes, but also other services such as
|
||||||
|
Neutron, Glance, and possibly Cinder. When we make those service jumps
|
||||||
|
we currently generate a new request-id, which makes tracing those
|
||||||
|
flows quite manual.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
When a user creates a resource, such as a server, they are given a
|
||||||
|
request-id back. This is generated very early in the paste pipeline of
|
||||||
|
most services. It is eventually embedded into the ``context``, which
|
||||||
|
is then used implicitly for logging all activities related to the
|
||||||
|
request. This works well for tracing requests inside of a single
|
||||||
|
service as it passes through its workers, but breaks down when an
|
||||||
|
operation spans multiple services. A common example of this is a
|
||||||
|
server build, which requires Nova to call out multiple times to
|
||||||
|
Neutron and Glance (and possibly other services) to create a server on
|
||||||
|
the network.
|
||||||
|
|
||||||
|
It is extremely common for clouds to have an ELK (Elastic Search,
|
||||||
|
Logstash, Kibana) infrastructure that is consuming their logs. The
|
||||||
|
only way to query these flows is if there is a common identifier
|
||||||
|
across all relevant messages. A global request-id immediately makes
|
||||||
|
existing deployed tooling better for managing OpenStack.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
The high level solution is as follows (details on specific points
|
||||||
|
later):
|
||||||
|
|
||||||
|
- accept an inbound X-OpenStack-Request-ID header on requests. Require
|
||||||
|
that it looks like a uuid to prevent injection issues. Set this to
|
||||||
|
the value of ``global_request_id``
|
||||||
|
- Keep the auto generated existing request_id
|
||||||
|
- update oslo.log to default also log ``global_request_id`` when it is
|
||||||
|
in a context logging mode.
|
||||||
|
|
||||||
|
|
||||||
|
Paste pipelines
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The processing of incoming requests happens piecemeal through the set
|
||||||
|
of paste pipelines. These are mostly common between projects, but
|
||||||
|
there are enough local variation to highlight what this looks like for
|
||||||
|
the base IaaS services, which will be the initial targets of this spec.
|
||||||
|
|
||||||
|
Neutron [#f1]_
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. code-block:: ini
|
||||||
|
|
||||||
|
[composite:neutronapi_v2_0]
|
||||||
|
use = call:neutron.auth:pipeline_factory
|
||||||
|
noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
|
||||||
|
keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
|
||||||
|
# ^ ^
|
||||||
|
# request_id generated here -------+ |
|
||||||
|
# context built here ------------------------------------------------+
|
||||||
|
|
||||||
|
Glance [#f2]_
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. code-block:: ini
|
||||||
|
|
||||||
|
# Use this pipeline for keystone auth
|
||||||
|
[pipeline:glance-api-keystone]
|
||||||
|
pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context rootapp
|
||||||
|
# ^
|
||||||
|
# request_id & context built here -----------------------------------------------------+
|
||||||
|
|
||||||
|
Cinder [#f3]_
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. code-block:: ini
|
||||||
|
|
||||||
|
[composite:openstack_volume_api_v3]
|
||||||
|
use = call:cinder.api.middleware.auth:pipeline_factory
|
||||||
|
noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
|
||||||
|
keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
|
||||||
|
# ^ ^
|
||||||
|
# request_id generated here -------+ |
|
||||||
|
# context built here ------------------------------------------------------------------+
|
||||||
|
|
||||||
|
Nova [#f4]_
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. code-block:: ini
|
||||||
|
|
||||||
|
[composite:openstack_compute_api_v21]
|
||||||
|
use = call:nova.api.auth:pipeline_factory_v21
|
||||||
|
noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
|
||||||
|
keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
|
||||||
|
# ^ ^
|
||||||
|
# request_id generated here -------+ |
|
||||||
|
# context built here ----------------------------------------------------------------------+
|
||||||
|
|
||||||
|
|
||||||
|
oslo.middleware.request_id
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
In nearly all services the request_id generation happens very early,
|
||||||
|
well before any local logic. The middleware sets an
|
||||||
|
X-OpenStack-Request-ID response header, as well as variables in the
|
||||||
|
environment that are later consumed by oslo.context.
|
||||||
|
|
||||||
|
We would accept an inbound X-OpenStack-Request-ID, and validate that
|
||||||
|
it looked like ``req-$UUID`` before accepting it as the
|
||||||
|
``global_request_id``.
|
||||||
|
|
||||||
|
The returned X-OpenStack-Request-ID would be the existing
|
||||||
|
``request_id``. This is like the parent process getting the child
|
||||||
|
process id on a fork() call.
|
||||||
|
|
||||||
|
oslo.context from_environ
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Fortunately for us most projects now use the oslo.context
|
||||||
|
``from_environ`` constructor. This means that we can add content to
|
||||||
|
the context, or adjust the context, without needing to change every
|
||||||
|
project. For instance in Glance the context constructor looks like
|
||||||
|
[#f5]_:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
kwargs = {
|
||||||
|
'owner_is_tenant': CONF.owner_is_tenant,
|
||||||
|
'service_catalog': service_catalog,
|
||||||
|
'policy_enforcer': self.policy_enforcer,
|
||||||
|
'request_id': request_id,
|
||||||
|
}
|
||||||
|
|
||||||
|
ctxt = glance.context.RequestContext.from_environ(req.environ,
|
||||||
|
**kwargs)
|
||||||
|
|
||||||
|
As all logging happens *after* the context is built. All required
|
||||||
|
parts of the context will be there before logging starts.
|
||||||
|
|
||||||
|
oslo.log
|
||||||
|
--------
|
||||||
|
|
||||||
|
oslo.log defaults should include ``global_request_id`` during context
|
||||||
|
logging. This is something which can be done late, as users can always
|
||||||
|
override there context logging string format.
|
||||||
|
|
||||||
|
projects and clients
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
With the infrastructure above implemented it will be a small change to
|
||||||
|
python clients to save and emit the ``global_request_id`` when
|
||||||
|
created. For instance, Nova calling Neutron, during the get_client
|
||||||
|
call ``context.request_id`` would be stored in the client. [#f6]_:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
|
||||||
|
def _get_available_networks(self, context, project_id,
|
||||||
|
net_ids=None, neutron=None,
|
||||||
|
auto_allocate=False):
|
||||||
|
"""Return a network list available for the tenant.
|
||||||
|
The list contains networks owned by the tenant and public networks.
|
||||||
|
If net_ids specified, it searches networks with requested IDs only.
|
||||||
|
"""
|
||||||
|
if not neutron:
|
||||||
|
neutron = get_client(context)
|
||||||
|
|
||||||
|
if net_ids:
|
||||||
|
# If user has specified to attach instance only to specific
|
||||||
|
# networks then only add these to **search_opts. This search will
|
||||||
|
# also include 'shared' networks.
|
||||||
|
search_opts = {'id': net_ids}
|
||||||
|
nets = neutron.list_networks(**search_opts).get('networks', [])
|
||||||
|
else:
|
||||||
|
# (1) Retrieve non-public network list owned by the tenant.
|
||||||
|
search_opts = {'tenant_id': project_id, 'shared': False}
|
||||||
|
if auto_allocate:
|
||||||
|
# The auto-allocated-topology extension may create complex
|
||||||
|
# network topologies and it does so in a non-transactional
|
||||||
|
# fashion. Therefore API users may be exposed to resources that
|
||||||
|
# are transient or partially built. A client should use
|
||||||
|
# resources that are meant to be ready and this can be done by
|
||||||
|
# checking their admin_state_up flag.
|
||||||
|
search_opts['admin_state_up'] = True
|
||||||
|
nets = neutron.list_networks(**search_opts).get('networks', [])
|
||||||
|
# (2) Retrieve public network list.
|
||||||
|
search_opts = {'shared': True}
|
||||||
|
nets += neutron.list_networks(**search_opts).get('networks', [])
|
||||||
|
|
||||||
|
_ensure_requested_network_ordering(
|
||||||
|
lambda x: x['id'],
|
||||||
|
nets,
|
||||||
|
net_ids)
|
||||||
|
|
||||||
|
return nets
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
There are some usage patterns where a client is built and kept for
|
||||||
|
long running operations. In these cases we'd want to change the
|
||||||
|
model to assume that clients are ephemeral, and should be discarded
|
||||||
|
at the end of their flows.
|
||||||
|
|
||||||
|
This will also help tracking non user initiated tasks such as
|
||||||
|
periodic jobs that touch other services for information refresh.
|
||||||
|
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
Log in the Caller
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
There was a previous OpenStack cross project spec to completely handle
|
||||||
|
this in the caller - https://review.openstack.org/#/c/156508/. That
|
||||||
|
was merged over 2 years ago, but has yet to gain traction.
|
||||||
|
|
||||||
|
It had a number of disadvantages. It turns out the client code is far
|
||||||
|
less standardized here, so fixing every client was substantial
|
||||||
|
work.
|
||||||
|
|
||||||
|
It also requires some standard convention for writing these things out
|
||||||
|
to logs on the caller side that is consistent between all services.
|
||||||
|
|
||||||
|
It also **does not** allow people to use Elastic Search to trace their
|
||||||
|
logs (which all large sites have running). A custom piece of analysis
|
||||||
|
tooling would need to be built.
|
||||||
|
|
||||||
|
Verify trust in callers
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
A long time ago, in a galaxy far far away, in a summit room I was not
|
||||||
|
in, I was told there was a concern about clients flooding this
|
||||||
|
field. There has been no documented attack that seems feasable here if
|
||||||
|
we strictly validate the inbound data.
|
||||||
|
|
||||||
|
There is a way we could use Service roles to validate trust here, but
|
||||||
|
without a compelling case for why that is needed, we should do the
|
||||||
|
simpler thing.
|
||||||
|
|
||||||
|
For reference Glance already accepts a user provided request-id of 64
|
||||||
|
characters or less. This has existed there for a long time, with no
|
||||||
|
reports as to yet for abuse. We could consider dropping the last
|
||||||
|
constraint and not doing role validation.
|
||||||
|
|
||||||
|
|
||||||
|
Swift multipart transaction id
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Swift has a related approach where their transaction id, which is a
|
||||||
|
multipart id that includes a piece generated by the server on inbound
|
||||||
|
request, a timestamp piece, a fixed server piece (for tracking
|
||||||
|
multiple clusters), and a user provided piece. Swift is not currently
|
||||||
|
using any of the above oslo infrastructure, and targets syslog as
|
||||||
|
their primary logging mechanism.
|
||||||
|
|
||||||
|
While there are interesting bits in this approach, it's a less
|
||||||
|
straight forward chunk of work to transition to, given the oslo
|
||||||
|
components. Also, oslo.log has many structured log back ends (like
|
||||||
|
json stream, fluentd, and systemd journal) where we really would want
|
||||||
|
the global and local as separate fields so there is no heuristic
|
||||||
|
parsing required.
|
||||||
|
|
||||||
|
Impact on Existing APIs
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
oslo.middleware request_id contract will change so that it accepts an
|
||||||
|
inbound header, and sets a second env variable. Both are backwards
|
||||||
|
compatible.
|
||||||
|
|
||||||
|
oslo.context will accept a new local_request_id. This requires
|
||||||
|
plumbing local_request_id into all calls that take request_id. This
|
||||||
|
looks fully backwards compatible.
|
||||||
|
|
||||||
|
oslo.log will need to be adjusted to support logging both
|
||||||
|
request_ids. It should probably be enabled to do that by default,
|
||||||
|
though log_context string is a user configured variable, so they can
|
||||||
|
set whatever site local format works for them. An upgrade release note
|
||||||
|
would be appropriate when this happens.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
There previously was a concern about trusting request ids from the
|
||||||
|
user. It is an inbound piece of user data, so care should be taken.
|
||||||
|
|
||||||
|
* Ensure it is not allowed to be so big as to create a DOS vector
|
||||||
|
(size validation)
|
||||||
|
* Ensure that it is not a possible code injection vector (strict
|
||||||
|
validation)
|
||||||
|
|
||||||
|
These items can be handled with strict validation of the content that
|
||||||
|
it looks like a valid uuid.
|
||||||
|
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
Minimal. This is a few extra lines of instruction in existing through
|
||||||
|
paths. No expensive activity is done in this new code.
|
||||||
|
|
||||||
|
Configuration Impact
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
The only configuration impact will be on the oslo.log context string.
|
||||||
|
|
||||||
|
Developer Impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Developers will now have much easier tracing of build requests in
|
||||||
|
their devstack environments!
|
||||||
|
|
||||||
|
Testing Impact
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Unit tests provided with various oslo components.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
sdague
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
None
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Could definitely use help to get this through the gauntlet, there
|
||||||
|
are lots of little patches here to get right.
|
||||||
|
|
||||||
|
Milestones
|
||||||
|
----------
|
||||||
|
|
||||||
|
Target Milestone for completion: TBD
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
TBD
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
TBD - but presumably some updates to operators guide on tracing across
|
||||||
|
services.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. [#f1] https://github.com/openstack/neutron/blob/5691f29e8fd1212bb22b1a48d32dbbddf7e0587d/etc/api-paste.ini#L6-L9
|
||||||
|
.. [#f2] https://github.com/openstack/glance/blob/5caf1c739e190338e87be8bcd880cb88b0920299/etc/glance-api-paste.ini#L13-L15
|
||||||
|
.. [#f3] https://github.com/openstack/cinder/blob/81ece6a9f2ac9b4ff3efe304bab847006f8b0aef/etc/cinder/api-paste.ini#L24-L28
|
||||||
|
.. [#f4] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/etc/nova/api-paste.ini#L29-L32
|
||||||
|
.. [#f5]
|
||||||
|
https://github.com/openstack/glance/blob/70d51c7c5c09b070588041a65905eba789ae871b/glance/api/middleware/context.py#L179-L187
|
||||||
|
.. [#f6] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/nova/network/neutronv2/api.py#L317-L354
|
||||||
|
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0
|
||||||
|
Unported License.
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
Loading…
x
Reference in New Issue
Block a user