global request id spec

This proposes a path forward on the global request_id which would be of use to many operators running OpenStack. Change-Id: I65de8261746b25d45e105394f4eeb95b9cb3bd42
2017-05-15 16:40:06 -04:00 · 2017-05-15 16:40:06 -04:00 · 70a11a4921
commit 70a11a4921
parent 4b7dd6e512
1 changed files with 378 additions and 0 deletions
--- a/specs/pike/global-req-id.rst
+++ b/specs/pike/global-req-id.rst
@ -0,0 +1,378 @@
 ====================
 Global Request IDs
 ====================
 https://blueprints.launchpad.net/oslo?searchtext=global-req-id
 Building a complex resource, like a boot instance, requires not only
 touching a number of Nova processes, but also other services such as
 Neutron, Glance, and possibly Cinder. When we make those service jumps
 we currently generate a new request-id, which makes tracing those
 flows quite manual.
 Problem description
 ===================
 When a user creates a resource, such as a server, they are given a
 request-id back. This is generated very early in the paste pipeline of
 most services. It is eventually embedded into the ``context``, which
 is then used implicitly for logging all activities related to the
 request. This works well for tracing requests inside of a single
 service as it passes through its workers, but breaks down when an
 operation spans multiple services. A common example of this is a
 server build, which requires Nova to call out multiple times to
 Neutron and Glance (and possibly other services) to create a server on
 the network.
 It is extremely common for clouds to have an ELK (Elastic Search,
 Logstash, Kibana) infrastructure that is consuming their logs. The
 only way to query these flows is if there is a common identifier
 across all relevant messages. A global request-id immediately makes
 existing deployed tooling better for managing OpenStack.
 Proposed change
 ===============
 The high level solution is as follows (details on specific points
 later):
 - accept an inbound X-OpenStack-Request-ID header on requests. Require
  that it looks like a uuid to prevent injection issues. Set this to
  the value of ``global_request_id``
 - Keep the auto generated existing request_id
 - update oslo.log to default also log ``global_request_id`` when it is
  in a context logging mode.
 Paste pipelines
 ---------------
 The processing of incoming requests happens piecemeal through the set
 of paste pipelines. These are mostly common between projects, but
 there are enough local variation to highlight what this looks like for
 the base IaaS services, which will be the initial targets of this spec.
 Neutron [#f1]_
 ~~~~~~~~~~~~~~
 .. code-block:: ini
   [composite:neutronapi_v2_0]
   use = call:neutron.auth:pipeline_factory
   noauth = cors http_proxy_to_wsgi request_id catch_errors extensions neutronapiapp_v2_0
   keystone = cors http_proxy_to_wsgi request_id catch_errors authtoken keystonecontext extensions neutronapiapp_v2_0
   #                                  ^                                 ^
   # request_id generated here -------+                                 |
   # context built here ------------------------------------------------+
 Glance [#f2]_
 ~~~~~~~~~~~~~
 .. code-block:: ini
   # Use this pipeline for keystone auth
   [pipeline:glance-api-keystone]
   pipeline = cors healthcheck http_proxy_to_wsgi versionnegotiation osprofiler authtoken context  rootapp
   #                                                                                      ^
   # request_id & context built here -----------------------------------------------------+
 Cinder [#f3]_
 ~~~~~~~~~~~~~
 .. code-block:: ini
   [composite:openstack_volume_api_v3]
   use = call:cinder.api.middleware.auth:pipeline_factory
   noauth = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler noauth apiv3
   keystone = cors http_proxy_to_wsgi request_id faultwrap sizelimit osprofiler authtoken keystonecontext apiv3
   #                                  ^                                                   ^
   # request_id generated here -------+                                                   |
   # context built here ------------------------------------------------------------------+
 Nova [#f4]_
 ~~~~~~~~~~~
 .. code-block:: ini
   [composite:openstack_compute_api_v21]
   use = call:nova.api.auth:pipeline_factory_v21
   noauth2 = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler noauth2 osapi_compute_app_v21
   keystone = cors http_proxy_to_wsgi compute_req_id faultwrap sizelimit osprofiler authtoken keystonecontext osapi_compute_app_v21
   #                                  ^                                                       ^
   # request_id generated here -------+                                                       |
   # context built here ----------------------------------------------------------------------+
 oslo.middleware.request_id
 --------------------------
 In nearly all services the request_id generation happens very early,
 well before any local logic. The middleware sets an
 X-OpenStack-Request-ID response header, as well as variables in the
 environment that are later consumed by oslo.context.
 We would accept an inbound X-OpenStack-Request-ID, and validate that
 it looked like ``req-$UUID`` before accepting it as the
 ``global_request_id``.
 The returned X-OpenStack-Request-ID would be the existing
 ``request_id``. This is like the parent process getting the child
 process id on a fork() call.
 oslo.context from_environ
 -------------------------
 Fortunately for us most projects now use the oslo.context
 ``from_environ`` constructor. This means that we can add content to
 the context, or adjust the context, without needing to change every
 project. For instance in Glance the context constructor looks like
 [#f5]_:
 .. code-block:: python
   kwargs = {
      'owner_is_tenant': CONF.owner_is_tenant,
      'service_catalog': service_catalog,
      'policy_enforcer': self.policy_enforcer,
      'request_id': request_id,
   }
   ctxt = glance.context.RequestContext.from_environ(req.environ,
                                                     **kwargs)
 As all logging happens *after* the context is built. All required
 parts of the context will be there before logging starts.
 oslo.log
 --------
 oslo.log defaults should include ``global_request_id`` during context
 logging. This is something which can be done late, as users can always
 override there context logging string format.
 projects and clients
 --------------------
 With the infrastructure above implemented it will be a small change to
 python clients to save and emit the ``global_request_id`` when
 created. For instance, Nova calling Neutron, during the get_client
 call ``context.request_id`` would be stored in the client. [#f6]_:
 .. code-block:: python
    def _get_available_networks(self, context, project_id,
                                net_ids=None, neutron=None,
                                auto_allocate=False):
        """Return a network list available for the tenant.
        The list contains networks owned by the tenant and public networks.
        If net_ids specified, it searches networks with requested IDs only.
        """
        if not neutron:
            neutron = get_client(context)
        if net_ids:
            # If user has specified to attach instance only to specific
            # networks then only add these to **search_opts. This search will
            # also include 'shared' networks.
            search_opts = {'id': net_ids}
            nets = neutron.list_networks(**search_opts).get('networks', [])
        else:
            # (1) Retrieve non-public network list owned by the tenant.
            search_opts = {'tenant_id': project_id, 'shared': False}
            if auto_allocate:
                # The auto-allocated-topology extension may create complex
                # network topologies and it does so in a non-transactional
                # fashion. Therefore API users may be exposed to resources that
                # are transient or partially built. A client should use
                # resources that are meant to be ready and this can be done by
                # checking their admin_state_up flag.
                search_opts['admin_state_up'] = True
            nets = neutron.list_networks(**search_opts).get('networks', [])
            # (2) Retrieve public network list.
            search_opts = {'shared': True}
            nets += neutron.list_networks(**search_opts).get('networks', [])
        _ensure_requested_network_ordering(
            lambda x: x['id'],
            nets,
            net_ids)
        return nets
 .. note::
   There are some usage patterns where a client is built and kept for
   long running operations. In these cases we'd want to change the
   model to assume that clients are ephemeral, and should be discarded
   at the end of their flows.
   This will also help tracking non user initiated tasks such as
   periodic jobs that touch other services for information refresh.
 Alternatives
 ------------
 Log in the Caller
 ~~~~~~~~~~~~~~~~~
 There was a previous OpenStack cross project spec to completely handle
 this in the caller - https://review.openstack.org/#/c/156508/. That
 was merged over 2 years ago, but has yet to gain traction.
 It had a number of disadvantages. It turns out the client code is far
 less standardized here, so fixing every client was substantial
 work.
 It also requires some standard convention for writing these things out
 to logs on the caller side that is consistent between all services.
 It also **does not** allow people to use Elastic Search to trace their
 logs (which all large sites have running). A custom piece of analysis
 tooling would need to be built.
 Verify trust in callers
 ~~~~~~~~~~~~~~~~~~~~~~~
 A long time ago, in a galaxy far far away, in a summit room I was not
 in, I was told there was a concern about clients flooding this
 field. There has been no documented attack that seems feasable here if
 we strictly validate the inbound data.
 There is a way we could use Service roles to validate trust here, but
 without a compelling case for why that is needed, we should do the
 simpler thing.
 For reference Glance already accepts a user provided request-id of 64
 characters or less. This has existed there for a long time, with no
 reports as to yet for abuse. We could consider dropping the last
 constraint and not doing role validation.
 Swift multipart transaction id
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Swift has a related approach where their transaction id, which is a
 multipart id that includes a piece generated by the server on inbound
 request, a timestamp piece, a fixed server piece (for tracking
 multiple clusters), and a user provided piece. Swift is not currently
 using any of the above oslo infrastructure, and targets syslog as
 their primary logging mechanism.
 While there are interesting bits in this approach, it's a less
 straight forward chunk of work to transition to, given the oslo
 components. Also, oslo.log has many structured log back ends (like
 json stream, fluentd, and systemd journal) where we really would want
 the global and local as separate fields so there is no heuristic
 parsing required.
 Impact on Existing APIs
 -----------------------
 oslo.middleware request_id contract will change so that it accepts an
 inbound header, and sets a second env variable. Both are backwards
 compatible.
 oslo.context will accept a new local_request_id. This requires
 plumbing local_request_id into all calls that take request_id. This
 looks fully backwards compatible.
 oslo.log will need to be adjusted to support logging both
 request_ids. It should probably be enabled to do that by default,
 though log_context string is a user configured variable, so they can
 set whatever site local format works for them. An upgrade release note
 would be appropriate when this happens.
 Security impact
 ---------------
 There previously was a concern about trusting request ids from the
 user. It is an inbound piece of user data, so care should be taken.
 * Ensure it is not allowed to be so big as to create a DOS vector
  (size validation)
 * Ensure that it is not a possible code injection vector (strict
  validation)
 These items can be handled with strict validation of the content that
 it looks like a valid uuid.
 Performance Impact
 ------------------
 Minimal. This is a few extra lines of instruction in existing through
 paths. No expensive activity is done in this new code.
 Configuration Impact
 --------------------
 The only configuration impact will be on the oslo.log context string.
 Developer Impact
 ----------------
 Developers will now have much easier tracing of build requests in
 their devstack environments!
 Testing Impact
 --------------
 Unit tests provided with various oslo components.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  sdague
 Other contributors:
  None
 .. note::
   Could definitely use help to get this through the gauntlet, there
   are lots of little patches here to get right.
 Milestones
 ----------
 Target Milestone for completion: TBD
 Work Items
 ----------
 TBD
 Documentation Impact
 ====================
 TBD - but presumably some updates to operators guide on tracing across
 services.
 Dependencies
 ============
 None
 References
 ==========
 .. [#f1] https://github.com/openstack/neutron/blob/5691f29e8fd1212bb22b1a48d32dbbddf7e0587d/etc/api-paste.ini#L6-L9
 .. [#f2] https://github.com/openstack/glance/blob/5caf1c739e190338e87be8bcd880cb88b0920299/etc/glance-api-paste.ini#L13-L15
 .. [#f3] https://github.com/openstack/cinder/blob/81ece6a9f2ac9b4ff3efe304bab847006f8b0aef/etc/cinder/api-paste.ini#L24-L28
 .. [#f4] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/etc/nova/api-paste.ini#L29-L32
 .. [#f5]
   https://github.com/openstack/glance/blob/70d51c7c5c09b070588041a65905eba789ae871b/glance/api/middleware/context.py#L179-L187
 .. [#f6] https://github.com/openstack/nova/blob/c2c6960e374351b3ce1b43a564b57e14b54c4877/nova/network/neutronv2/api.py#L317-L354
 .. note::
  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode