Go to file

Sean Dague b4591df9e9 have realtime engine only search recent indexes

Elastic Recheck is really 2 things, real time searching, and bulk
offline categorization. While the bulk categorization needs to look
over the entire dataset, the real time portion is really deadline
oriented. So only cares about the last hour's worth of data. As such
we really don't need to search *all* the indexes in ES, but only
the most recent one (and possibly the one before that if we are near
rotation).

Implement this via a recent= parameter for our search feature. If set
to true then we specify the most recently logstash index. If it turns
out that we're within an hour of rotation, also search the one before
that.

Adjust all the queries the bot uses to be recent=True. This will
hopefully reduce the load generated by the bot on the ES cluster.

Change-Id: I0dfc295dd9b381acb67f192174edd6fdde06f24c

2014-06-12 17:53:26 -04:00

doc/source

Document adding bug signatures to e-r.

2013-12-12 12:34:54 -08:00

elastic_recheck

have realtime engine only search recent indexes

2014-06-12 17:53:26 -04:00

queries

Merge "Add query for ironic bug 1316773"

2014-05-06 21:32:28 +00:00

web

Adds time-view filter to uncategorized page

2014-03-01 16:58:39 -08:00

.coveragerc

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

.gitignore

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

.gitreview

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

.testr.conf

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

babel.cfg

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

CONTRIBUTING.rst

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

elasticRecheck.conf.sample

move queries.yaml into a queries subdir

2013-12-02 11:43:00 -05:00

LICENSE

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

MANIFEST.in

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

README.rst

Remove inaccurate docs about wildcards

2014-03-05 16:08:33 -08:00

recheckwatchbot.yaml

Add multi-project irc support to the bot

2014-01-24 12:21:47 -05:00

requirements.txt

python-dateutil requires six, be explicit about it

2014-03-02 06:48:41 -05:00

setup.cfg

add uncategorized failure generation code

2014-01-17 09:35:40 -05:00

setup.py

Apply Cookiecutter to the repo.

2013-09-23 15:27:39 -07:00

test-requirements.txt

Add fingerprint for bug 1274056

2014-01-29 15:48:29 +04:00

tox.ini

Remove tox locale overrides

2014-02-10 03:01:36 +00:00

README.rst

elastic-recheck

"Use ElasticSearch to classify OpenStack gate failures"

Open Source Software: Apache license

Idea

Identifying the specific bug that is causing a transient error in the gate is very hard. Just identifying which tempest test failed is not enough because a single bug can potentially cause multiple tempest tests to fail. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.

Using these fingerprints elastic-recheck can:

Search ElasticSearch for all occurrences of a bug.
Identify bug trends such as: when it started, is the bug fixed, is it getting worse, etc.
Classify bug failures in real time and report back to gerrit if we find a match, so a patch author knows why the test failed.

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries

Queries should get as close as possible to fingerprinting the root cause
Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Adding Bug Signatures

Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.

Given a transient bug that is seen during the gate, go through the logs (logs.openstack.org) and try to find a log that is associated with the failure. The closer to the root cause the better.

Note that queries can only be written against INFO level and higher log messages. This is by design to not overwhelm the search cluster.
Go to logstash.openstack.org and create an elastic search query to find the log message from step 1. To see the possible fields to search on click on an entry. Lucene query syntax is available at http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
Add a comment to the bug with the query you identified and a link to the logstash url for that query search.
Add the query to elastic-recheck/queries/BUGNUMBER.yaml and push the patch up for review. https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries

Future Work

Move config files into a separate directory
Make unit tests robust
Add debug mode flag
Expand gating testing
Cleanup and document code better
Add ability to check if any resolved bugs return
Move away from polling ElasticSearch to discover if its ready or not
Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed

Languages

Python 53.3%

JavaScript 38.1%

HTML 7.5%

Makefile 0.5%

CSS 0.4%

Other 0.2%