elasticsearch

Elasticsearch Aggregations Overview

One of the most exciting features of the upcoming Elasticsearch v1.0 release is the new Aggregations framework. Elasticsearch Aggregations provide a massive jump in functionality over the existing Facets API, so I’ve spent a bit of time playing with the latest beta release, and have prepared this post so you know what to expect from this great new feature...

What are Elasticsearch Aggregations?

If you’ve ever used Elasticsearch Facets, you will understand how useful they can be. Elasticsearch Aggregations are even better.

If you're not familiar with Facets: They enable you to calculate and summerise data about the current query on-the-fly. They can be used for all sorts of tasks such as dynamic counting of result values or even distribution histograms:

Elasticsearch Histogram Facet
Elasticsearch can return metadata such as the distribution of the results of your current query context. Here, I used a Range Facet

Although hugely powerful, Facets had some frustrating limitations due to the way they were implemented in the Elasticsearch core. Facets only perform their calculations one-level deep, and they cant be easily combined. The Aggregations API, which is available from version 1.0.0.beta2, solves these problems and provides an easy way of sculpting very precise multi-level calculations, performed at query-time, in a single request. Simply put: Elasticsearch Aggregations are Facets on steroids.

It's always been possible to use facets to return "Popular Posts" and "Total pageviews per hour", but never things like "Popular Posts by hour". The only way to use Facets for these types of queries is to execute a separate query for each hour time-period - obviously, this can be slow and memory intensive, and in extreme cases it could lock your entire cluster up.

In my recent talk at the London Elasticsearch meetup, I mentioned a method of using facet filters and the bulk search API to try and get around this issue by performing all the queries in parallel, but it was far from an ideal solution.

Thankfully with the introduction of aggregations in ES v1.0, it’s as simple as:

$ curl -XGET http://localhost:9200/logs/today/_search -d '
{
    "aggregations": {
        "popular_posts_overall": {
            "terms": { "field": "post_title" }
        },
        "by_hour": {
            "date_histogram": {
                "field":    "date",
                "interval": "hour"
            },
            "aggregations": {
                "popular_posts": {
                    "terms": { "field": "post_title" }
                }
            }
        }
    }
}'

Let’s break down the example above:

So, It shows a query to my logs index. I defined two aggregations, First - a Terms aggregation called popular_posts_overall (which looks and behaves in the similar manner as the traditional Terms Facet) Secondly, the by_hour date-histogram aggregation, which splits the results into buckets using the date field. Now this is where it gets cool, We then nested the popular_posts aggregation which means it is executed over the results of each bucket returned by the by_hour date_histogram. What we've done here is perform a multi-level aggregation, using the results of one aggregation as the input for the next - That's powerful!

Although this is a rather simple example, this starts to show the amazing power and potential of Elasticsearch aggregations.

I’m listening, tell me more…

The fact that aggregations can be nested means that the JSON API feels very familiar to an Elasticsearch user. Dynamic and hugely informative aggregations can be created in the similar way as complex search queries are using the Query DSL..

Imagine combining geo aggregations with stats aggregations to produce something similar to this
Inspired by Mark Harwood's recent tweet

Check this other example from a PR on github: This example uses a script to first define some age_group buckets, and then calculates the average height of the documents that have 'fallen' into each of the buckets:

{
    "aggs" : {
        "age_group" : {
            "range" : { 
                "script" : "DateTime.now().year - doc['date_of_birth'].date.year",
                "ranges" : [
                    { "to" : 5 },
                    { "from" : 5, "to" : 10 },
                    { "from" : 10, "to" : 15 },
                    { "from" : 15}
                ]
            },
            "aggs" : {
                "avg_height" : { "avg" : { "field" : "height" } }
            }
        }
    }
}

Which yields the following output:

"aggregations" : {
    "age_group" : [
        {
            "to" : 5.0,
            "doc_count" : 10,
            "avg_height" : 95
        },
        {
            "from" : 5.0,
            "to" : 10.0,
            "doc_count" : 5,
            "avg_height" : 130
        },
        {
            "from" : 10.0
            "to" : 15.0,
            "doc_count" : 4,
            "avg_height" : 160
        },
        {
            "from" : 15.0,
            "doc_count" : 10,
            "avg_height" : 175.5
        }
    ]
}

With an API this versatile, the possibilities are limitless, especially when you consider how many aggregation types are available, and that they can be combined and nested to multiple levels.

Aggregation Types

Elasticsearch v1.0 will ship with most of the aggregation types you will ever need, but the framework makes it easy to add new aggregation types in the future.

Before you start combining and nesting aggregation types, it’s important to mention that each aggregation can be categorized as either a Metrics Aggregation or a Bucket Aggregation. Understanding the difference will help you to use various aggregation types in combination with each-other.

Metrics Aggregations

These aggregations return value(s) derived from the documents returned by the your query. Most of the time this will be the values contained in your document, but they can also be the results of scripts. The metrics aggregations available at launch will be: min, max, avg, stats, extended_stats, and value_count

Bucket Aggregations

Bucket aggregations define criteria for ‘buckets’ (think of them as ‘groups’) and documents 'fall' into relevant buckets. A bucket therefore contains a document set: meaning bucket aggregations can contain sub-aggregations (which are applied to the contents of each bucket. The bucket aggregations are: global, filter, missing, nested, terms, range, date_range, ipv4_range, histogram, date_histogram and geo_distance

Excited? I am.

So, hopefully this post has given you an idea of what to expect from Elasticsearch Aggregations, I'm really excited about what this feature brings, and can't wait for version 1.0 to be released. There's some more great features of version 1 such as the Snapshot/Restore API - (checkout my overview of Snapshot and Restore post) and also the new cat API . If you've enjoyed this post or think I've missed something out (sorry) feel free to let me know in the comments below or by tweeting me @simplechris.

comments powered by Disqus