Matt Riedemann b3d86f4033 Update query for volume delete hang bug 1373513
Now that we have better logging in tempest via
commit d2b9651cb82910b2b86b0f94af1721f717e74024,
we can update the query to include the volume
delete timeout message from the tempest logs.

This adds another 27 hits in the last 24 hours
to the failure count, which should get our
gate categorization percentage up from the measly
58% we have now.

As for the actual signatures, we're basically
treating bug 1373513 as the home for all of the
current cinder blocking lvm/lvs/vgs issues right
now. When we see the volume delete timeout message
in tempest we don't always see the lvm message
in syslog, so it could be problems with vgs taking
too long also (we know that's an intermittent
problem), but we suspect the root issues are the
same in Cinder and need to be investigated together.

Change-Id: Ia100f7c5174a9629a90df7b13030b88195baac12
Related-Bug: #1373513
2014-10-16 07:49:50 -07:00
2014-09-13 09:51:02 +02:00
2014-07-22 09:09:58 -04:00
2013-09-23 15:27:39 -07:00
2013-09-23 15:27:39 -07:00
2013-09-23 15:27:39 -07:00
2013-09-23 15:27:39 -07:00
2013-09-23 15:27:39 -07:00
2013-09-23 15:27:39 -07:00
2014-07-28 11:19:48 -04:00
2013-09-23 15:27:39 -07:00
2014-07-28 11:19:48 -04:00

elastic-recheck

"Use ElasticSearch to classify OpenStack gate failures"

  • Open Source Software: Apache license

Idea

Identifying the specific bug that is causing a transient error in the gate is difficult. Just identifying which tempest test failed is not enough because a single tempest test can fail due to any number of underlying bugs. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.

Using these fingerprints elastic-recheck can:

  • Search ElasticSearch for all occurrences of a bug.
  • Identify bug trends such as: when it started, is the bug fixed, is it getting worse, etc.
  • Classify bug failures in real time and report back to gerrit if we find a match, so a patch author knows why the test failed.

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries:

  • Queries should get as close as possible to fingerprinting the root cause. A screen log query (e.g. tags:"screen-n-net.txt") is typically better than a console one (tags:"console"), as that's matching a deep failure versus a surface symptom.

  • Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough. A rule of thumb is > 10% success hits probably means this isn't good enough.

  • If it's impossible to build a query to target a bug, consider patching the upstream program to be explicit when it fails in a particular way.

  • Use the 'tags' field rather than the 'filename' field for filtering. This is primarily because of grenade jobs where the same log file shows up in the 'old' and 'new' side of the grenade job. For example, tags:"screen-n-cpu.txt" will query in logs/old/screen-n-cpu.txt and logs/new/screen-n-cpu.txt. The tags:"console" filter is also used to query in console.html as well as tempest and devstack logs.

  • Avoid the use of wildcards in queries since they can put an undue burden on the query engine. A common case where wildcards are used and shouldn't be are in querying against a specific set of build_name fields, e.g. gate-nova-python26 and gate-nova-python27. Rather than use build_name:gate-nova-python*, list the jobs with an OR. For example:

    (build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")

In order to support rapidly added queries, it's considered socially acceptable to approve changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Note that old queries which are no longer hitting in logstash and are associated with fixed or incomplete bugs are routinely deleted. This is to keep the load on the elastic-search engine as low as possible when checking a job failure. If a bug marked as Incomplete does show up again, the bug should be re-opened with a link to the failure and the e-r query should be restored.

Adding Bug Signatures

Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.

  1. Given a transient bug that is seen during the gate, go through the logs and try to find a log that is associated with the failure. The closer to the root cause the better.

    Note that queries can only be written against INFO level and higher log messages. This is by design to not overwhelm the search cluster.

  2. Go to logstash.openstack.org and create an elastic search query to find the log message from step 1. To see the possible fields to search on click on an entry. Lucene query syntax is available at lucene.apache.org.

  3. Tag your commit with a Related-Bug tag in the footer, or add a comment to the bug with the query you identified and a link to the logstash URL for that query search.

    Putting the logstash query link in the bug report is also valuable in the case of rare failures that fall outside the window of how far back log results are stored. In such cases the bug might be marked as Incomplete and the e-r query could be removed, only for the failure to re-surface later. If a link to the query is in the bug report someone can easily track when it started showing up again.

  4. Add the query to elastic-recheck/queries/BUGNUMBER.yaml (All queries can be found on git.openstack.org) and push the patch up for review.

You can also help classify Unclassified failed jobs, which is an aggregation of all failed gate jobs that don't currently have elastic-recheck fingerprints.

Future Work

  • Move config files into a separate directory
  • Make unit tests robust
  • Add debug mode flag
  • Expand gating testing
  • Cleanup and document code better
  • Add ability to check if any resolved bugs return
  • Move away from polling ElasticSearch to discover if its ready or not
  • Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed
Description
Classify tempest-devstack failures using ElasticSearch
Readme 14 MiB
Languages
Python 53.3%
JavaScript 38.1%
HTML 7.5%
Makefile 0.5%
CSS 0.4%
Other 0.2%