global request id spec

This proposes a path forward on the global request_id which would be of use to many operators running OpenStack. Change-Id: I65de8261746b25d45e105394f4eeb95b9cb3bd42
2017-05-15 16:40:06 -04:00 · 2017-05-15 16:40:06 -04:00 · 70a11a4921
commit 70a11a4921
parent 4b7dd6e512
1 changed files with 378 additions and 0 deletions
--- a/specs/pike/global-req-id.rst
+++ b/specs/pike/global-req-id.rst
@ -0,0 +1,378 @@
+====================
+ Global Request IDs
+====================
+
+https://blueprints.launchpad.net/oslo?searchtext=global-req-id
+
+Building a complex resource, like a boot instance, requires not only
+touching a number of Nova processes, but also other services such as
+Neutron, Glance, and possibly Cinder. When we make those service jumps
+we currently generate a new request-id, which makes tracing those
+flows quite manual.
+
+Problem description
+===================
+
+When a user creates a resource, such as a server, they are given a
+request-id back. This is generated very early in the paste pipeline of
+most services. It is eventually embedded into the ``context``, which
+is then used implicitly for logging all activities related to the
+request. This works well for tracing requests inside of a single
+service as it passes through its workers, but breaks down when an
+operation spans multiple services. A common example of this is a
+server build, which requires Nova to call out multiple times to
+Neutron and Glance (and possibly other services) to create a server on
+the network.
+
+It is extremely common for clouds to have an ELK (Elastic Search,
+Logstash, Kibana) infrastructure that is consuming their logs. The
+only way to query these flows is if there is a common identifier
+across all relevant messages. A global request-id immediately makes
+existing deployed tooling better for managing OpenStack.
+
+Proposed change
+===============
+
+The high level solution is as follows (details on specific points
+later):
+
+- accept an inbound X-OpenStack-Request-ID header on requests. Require
+  that it looks like a uuid to prevent injection issues. Set this to
+  the value of ``global_request_id``
+- Keep the auto generated existing request_id
+- update oslo.log to default also log ``global_request_id`` when it is
+  in a context logging mode.
+
+
+Paste pipelines
+---------------
+
+The processing of incoming requests happens piecemeal through the set
+of paste pipelines. These are mostly common between projects, but
+there are enough local variation to highlight what this looks like for
+the base IaaS services, which will be the initial targets of this spec.
+
+Neutron [#f1]_
+~~~~~~~~~~~~~~
+
+.. code-block:: ini
+
+   [composite:neutronapi_v2_0]
+   use = call:neutron.auth:pipeline_factory
+   noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
+   keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
+   #                                  ^                                 ^
+   # request_id generated here -------+                                 |
+   # context built here ------------------------------------------------+
+
+Glance [#f2]_
+~~~~~~~~~~~~~
+
+.. code-block:: ini
+
+   # Use this pipeline for keystone auth
+   [pipeline:glance-api-keystone]
+   pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context  rootapp
+   #                                                                                      ^
+   # request_id & context built here -----------------------------------------------------+
+
+Cinder [#f3]_
+~~~~~~~~~~~~~
+
+.. code-block:: ini
+
+   [composite:openstack_volume_api_v3]
+   use = call:cinder.api.middleware.auth:pipeline_factory
+   noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
+   keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
+   #                                  ^                                                   ^
+   # request_id generated here -------+                                                   |
+   # context built here ------------------------------------------------------------------+
+
+Nova [#f4]_
+~~~~~~~~~~~
+
+.. code-block:: ini
+
+   [composite:openstack_compute_api_v21]
+   use = call:nova.api.auth:pipeline_factory_v21
+   noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
+   keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
+   #                                  ^                                                       ^
+   # request_id generated here -------+                                                       |
+   # context built here ----------------------------------------------------------------------+
+
+
+oslo.middleware.request_id
+--------------------------
+
+In nearly all services the request_id generation happens very early,
+well before any local logic. The middleware sets an
+X-OpenStack-Request-ID response header, as well as variables in the
+environment that are later consumed by oslo.context.
+
+We would accept an inbound X-OpenStack-Request-ID, and validate that
+it looked like ``req-$UUID`` before accepting it as the
+``global_request_id``.
+
+The returned X-OpenStack-Request-ID would be the existing
+``request_id``. This is like the parent process getting the child
+process id on a fork() call.
+
+oslo.context from_environ
+-------------------------
+
+Fortunately for us most projects now use the oslo.context
+``from_environ`` constructor. This means that we can add content to
+the context, or adjust the context, without needing to change every
+project. For instance in Glance the context constructor looks like
+[#f5]_:
+
+.. code-block:: python
+
+   kwargs = {
+      'owner_is_tenant': CONF.owner_is_tenant,
+      'service_catalog': service_catalog,
+      'policy_enforcer': self.policy_enforcer,
+      'request_id': request_id,
+   }
+
+   ctxt = glance.context.RequestContext.from_environ(req.environ,
+                                                     **kwargs)
+
+As all logging happens *after* the context is built. All required
+parts of the context will be there before logging starts.
+
+oslo.log
+--------
+
+oslo.log defaults should include ``global_request_id`` during context
+logging. This is something which can be done late, as users can always
+override there context logging string format.
+
+projects and clients
+--------------------
+
+With the infrastructure above implemented it will be a small change to
+python clients to save and emit the ``global_request_id`` when
+created. For instance, Nova calling Neutron, during the get_client
+call ``context.request_id`` would be stored in the client. [#f6]_:
+
+.. code-block:: python
+
+
+    def _get_available_networks(self, context, project_id,
+                                net_ids=None, neutron=None,
+                                auto_allocate=False):
+        """Return a network list available for the tenant.
+        The list contains networks owned by the tenant and public networks.
+        If net_ids specified, it searches networks with requested IDs only.
+        """
+        if not neutron:
+            neutron = get_client(context)
+
+        if net_ids:
+            # If user has specified to attach instance only to specific
+            # networks then only add these to **search_opts. This search will
+            # also include 'shared' networks.
+            search_opts = {'id': net_ids}
+            nets = neutron.list_networks(**search_opts).get('networks', [])
+        else:
+            # (1) Retrieve non-public network list owned by the tenant.
+            search_opts = {'tenant_id': project_id, 'shared': False}
+            if auto_allocate:
+                # The auto-allocated-topology extension may create complex
+                # network topologies and it does so in a non-transactional
+                # fashion. Therefore API users may be exposed to resources that
+                # are transient or partially built. A client should use
+                # resources that are meant to be ready and this can be done by
+                # checking their admin_state_up flag.
+                search_opts['admin_state_up'] = True
+            nets = neutron.list_networks(**search_opts).get('networks', [])
+            # (2) Retrieve public network list.
+            search_opts = {'shared': True}
+            nets += neutron.list_networks(**search_opts).get('networks', [])
+
+        _ensure_requested_network_ordering(
+            lambda x: x['id'],
+            nets,
+            net_ids)
+
+        return nets
+
+.. note::
+
+   There are some usage patterns where a client is built and kept for
+   long running operations. In these cases we'd want to change the
+   model to assume that clients are ephemeral, and should be discarded
+   at the end of their flows.
+
+   This will also help tracking non user initiated tasks such as
+   periodic jobs that touch other services for information refresh.
+
+
+Alternatives
+------------
+
+Log in the Caller
+~~~~~~~~~~~~~~~~~
+
+There was a previous OpenStack cross project spec to completely handle
+this in the caller - https://review.openstack.org/#/c/156508/. That
+was merged over 2 years ago, but has yet to gain traction.
+
+It had a number of disadvantages. It turns out the client code is far
+less standardized here, so fixing every client was substantial
+work.
+
+It also requires some standard convention for writing these things out
+to logs on the caller side that is consistent between all services.
+
+It also **does not** allow people to use Elastic Search to trace their
+logs (which all large sites have running). A custom piece of analysis
+tooling would need to be built.
+
+Verify trust in callers
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A long time ago, in a galaxy far far away, in a summit room I was not
+in, I was told there was a concern about clients flooding this
+field. There has been no documented attack that seems feasable here if
+we strictly validate the inbound data.
+
+There is a way we could use Service roles to validate trust here, but
+without a compelling case for why that is needed, we should do the
+simpler thing.
+
+For reference Glance already accepts a user provided request-id of 64
+characters or less. This has existed there for a long time, with no
+reports as to yet for abuse. We could consider dropping the last
+constraint and not doing role validation.
+
+
+Swift multipart transaction id
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Swift has a related approach where their transaction id, which is a
+multipart id that includes a piece generated by the server on inbound
+request, a timestamp piece, a fixed server piece (for tracking
+multiple clusters), and a user provided piece. Swift is not currently
+using any of the above oslo infrastructure, and targets syslog as
+their primary logging mechanism.
+
+While there are interesting bits in this approach, it's a less
+straight forward chunk of work to transition to, given the oslo
+components. Also, oslo.log has many structured log back ends (like
+json stream, fluentd, and systemd journal) where we really would want
+the global and local as separate fields so there is no heuristic
+parsing required.
+
+Impact on Existing APIs
+-----------------------
+
+oslo.middleware request_id contract will change so that it accepts an
+inbound header, and sets a second env variable. Both are backwards
+compatible.
+
+oslo.context will accept a new local_request_id. This requires
+plumbing local_request_id into all calls that take request_id. This
+looks fully backwards compatible.
+
+oslo.log will need to be adjusted to support logging both
+request_ids. It should probably be enabled to do that by default,
+though log_context string is a user configured variable, so they can
+set whatever site local format works for them. An upgrade release note
+would be appropriate when this happens.
+
+Security impact
+---------------
+
+There previously was a concern about trusting request ids from the
+user. It is an inbound piece of user data, so care should be taken.
+
+* Ensure it is not allowed to be so big as to create a DOS vector
+  (size validation)
+* Ensure that it is not a possible code injection vector (strict
+  validation)
+
+These items can be handled with strict validation of the content that
+it looks like a valid uuid.
+
+
+Performance Impact
+------------------
+
+Minimal. This is a few extra lines of instruction in existing through
+paths. No expensive activity is done in this new code.
+
+Configuration Impact
+--------------------
+
+The only configuration impact will be on the oslo.log context string.
+
+Developer Impact
+----------------
+
+Developers will now have much easier tracing of build requests in
+their devstack environments!
+
+Testing Impact
+--------------
+
+Unit tests provided with various oslo components.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  sdague
+
+Other contributors:
+  None
+
+.. note::
+
+   Could definitely use help to get this through the gauntlet, there
+   are lots of little patches here to get right.
+
+Milestones
+----------
+
+Target Milestone for completion: TBD
+
+Work Items
+----------
+
+TBD
+
+Documentation Impact
+====================
+
+TBD - but presumably some updates to operators guide on tracing across
+services.
+
+Dependencies
+============
+
+None
+
+References
+==========
+
+.. [#f1] https://github.com/openstack/neutron/blob/5691f29e8fd1212bb22b1a48d32dbbddf7e0587d/etc/api-paste.ini#L6-L9
+.. [#f2] https://github.com/openstack/glance/blob/5caf1c739e190338e87be8bcd880cb88b0920299/etc/glance-api-paste.ini#L13-L15
+.. [#f3] https://github.com/openstack/cinder/blob/81ece6a9f2ac9b4ff3efe304bab847006f8b0aef/etc/cinder/api-paste.ini#L24-L28
+.. [#f4] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/etc/nova/api-paste.ini#L29-L32
+.. [#f5]
+   https://github.com/openstack/glance/blob/70d51c7c5c09b070588041a65905eba789ae871b/glance/api/middleware/context.py#L179-L187
+.. [#f6] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/nova/network/neutronv2/api.py#L317-L354
+
+
+.. note::
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode