Mastering Elastic Stack
上QQ阅读APP看书,第一时间看更新

Aggregations

This framework is a very important part of Elasticsearch. As the name suggests, this framework helps us to do aggregations and generate analytic information on result of a search query. Aggregations help us to get better insight of the data. For example, if we take our library index into account, we can get answers to: How many books in a specific year, which technology, average book per year, and many more.

These aggregations show their power when it comes to gaining insight of system data on a dashboard. Most often system dashboards have aggregated data in form of charts. We will also be using aggregations in later chapters and those aggregations will help Kibana to generate useful visualizations.

There are two types of core aggregations: metrics and buckets. We will learn about these in this section.

Bucket

These aggregations create buckets of documents based on a criterion. These types of aggregations can also hold sub-aggregations. We will learn about sub-aggregations in this section.

To understand bucket aggregations, let's add another index for stones and a type named diamonds. The dataset is available at https://vincentarelbundock.github.io/Rdatasets/datasets.html.

For your convenience, the used dataset is also bundled with this chapter and it is available with code files. If you want to try out the unmodified dataset, you can get it here https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Diamond.csv. A few samples from the dataset are as follows:

A row in this table shows data about one diamond. We have carat, cut, color, clarity, and price. There are other fields as well, but we have omitted those here for the sake of simplicity.

The Logstash config file to store this data into Elasticsearch is as follows:

Note

Logstash will be covered in Chapter 3, Exploring Logstash and Its Plugins.

input{
 file{
 path =>"/opt/elk/datasets/diamonds.csv"
 start_position =>"beginning"
 }
}
filter{
 csv{
 columns =>
 ["caret", "cut", "colour", "clarity", "depth", "table", "price", "x", "y", "z"]
 separator => ","
 }
 mutate {
 convert => ["caret","float"] 
 convert => ["depth","float"]
 convert => ["table","integer"]
 convert => ["price","integer"]
 convert => ["x","integer"]
 convert => ["y","integer"]
 convert => ["z","integer"]
 }
} 
output {
 elasticsearch { 
 index => "stones"
 document_type => "diamond"
 hosts => "localhost"
 }
}

In the preceding configuration, we have added a mutate section to convert fields to proper types. If we do not convert, Elasticsearch will set the string type for all the fields. Also, notice that we have added an index name as stones (a new index will be created for this name automatically if it does not exist already) and document type as diamond. This will create an index in Elasticsearch with the name stones and a type diamond to that index. If the index and type are already present, it will add data to those. We have not provided a port for Elasticsearch here and it will take a default port 9200. To run Logstash using this configuration, run the following (assuming that the configuration file is named logstash-diamond.conf and placed inside the conf directory):

$ ./bin/logstash -f conf/logstash.diamonds.conf

The last command will create an index upon which we will be trying out the aggregations.

With our search query, if we add an aggregation, it will look like this:

"aggs" : {
 "<aggregation-name>" : {
 "<aggregation-type>" : {
 "field" : "<field-name>"
 }
 }
}

This is how we define an aggregation - using the aggs parameter (if we put aggregations instead of aggs, that'll work too). Apart from the aggs parameter, we need to provide a name for aggregation, a type we want to use, and finally we need to provide a field against which the data should be aggregated.

Let's create our first bucket aggregation - and in this, we want to put diamonds in buckets by clarity:

$ curl -XGET 'http://localhost:9200/stones/diamond/_search?pretty' -d '{
 "aggs" : {
 "diamonds_by_clarity" : {
 "terms" : {
 "field" : "clarity"
 }
 }
 }
}'

You might get an illegal_argument_exception saying that Fielddata is disabled on text fields by default. Set fielddata=true on [clarity] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

To enable fielddata on a field, for example, on clarity, use the following:

$ curl -XPUT "http://localhost:9200/stones/_mapping/diamond" -d'
{
 "properties": {
 "clarity": { 
 "type": "text",
 "fielddata": true
 }
 }
}'

This will create buckets of clarity and the output will be as follows:

"aggregations" : {
 "diamonds_by_clarity" : {
 "doc_count_error_upper_bound" : 0,
 "sum_other_doc_count" : 0,
 "buckets" : [ {
 "key" : "si1",
 "doc_count" : 13065
 }, {
 "key" : "vs2",
 "doc_count" : 12258
 }, {
 "key" : "si2",
 "doc_count" : 9193
 }, {
 "key" : "vs1",
 "doc_count" : 8171
 }, {
 "key" : "vvs2",
 "doc_count" : 5066
 }, {
 "key" : "vvs1",
 "doc_count" : 3655
 }, {
 "key" : "if",
 "doc_count" : 1790
 }, {
 "key" : "i1",
 "doc_count" : 741
 } ]
 }
 }

As we can see, there are buckets for each unique value of clarity and the value becomes the key and count of all documents that appear as doc_count for each bucket.

Let's make it more complex by adding a metric aggregation as child aggregation of this:

$ curl -XGET 'http://localhost:9200/stones/diamond/_search?pretty' -d '{
 "aggs" : {
 "diamonds_by_clarity" : {
 "terms" : {
 "field" : "clarity"
 }, 
 "aggs" : { 
 "max_price" : { 
 "max" : { 
 "field" : "price" 
 } 
 }
 }
 }
 }
}'

To add a child aggregation, we just add one more aggs parameter inside the aggregation we created. max is a metric aggregation that we will learn about in the next section. Let's analyze the output of this one:

"buckets" : [ {
 "key" : "si1",
 "doc_count" : 13065,
 "max_price" : {
 "value" : 18818.0
 }
 }, {
 "key" : "vs2",
 "doc_count" : 12258,
 "max_price" : {
 "value" : 18823.0
 }
 },
 . . .
]

Each of the buckets will now contain max_price of the documents inside that bucket.

Let's take another example where we will use price range to analyze the data. As we can see, the price goes to max <19k, based on this let's define our price ranges from 0-5k, 5-10k, 10-14k, 14-16k, and 16-19k. The aggregation that we will use this time is the range aggregation:

$ curl -XGET 'http://localhost:9200/stones/diamond/_search?pretty' -d '{
 "aggs" : {
 "price_ranges" : {
 "range" : {
 "field" : "price",
 "ranges" : [
 { "to" : 5000 },
 {"from" : 5000, "to" : 10000 },
 {"from" : 10000, "to" : 14000 },
 {"from" : 14000, "to" : 16000 },
 {"from" : 16000, "to" : 19000 }
 ]
 }
 }
 }
}'

We need to define ranges as well for this aggregation. The output will now contain buckets according to ranges defined:

"aggregations" : {
 "price_ranges" : {
 "buckets" : [ {
 "key" : "*-5000.0",
 "to" : 5000.0,
 "doc_count" : 39212
 }, {
 "key" : "5000.0-10000.0",
 "from" : 5000.0,
 "to" : 10000.0,
 "doc_count" : 9504
 }, {
 "key" : "10000.0-14000.0",
 "from" : 10000.0,
 "to" : 14000.0,
 "doc_count" : 3064
 }, {
 "key" : "14000.0-16000.0",
 "from" : 14000.0,
 "to" : 16000.0,
 "doc_count" : 1017
 }, {
 "key" : "16000.0-19000.0",
 "from" : 16000.0,
 "to" : 19000.0,
 "doc_count" : 1142
 } ]
 }
 }

We can see that documents are now pided as per the ranges defined.

Before we do more with aggregations, let's put some more data into Elasticsearch. We are going to add a new type named movies to our library index. We will use the IMDB dataset, which can be downloaded from this URL: https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/movies.csv.

We have removed the first column, which is S.No. in the CSV, and loaded the rest of the data using Logstash. This is the configuration file loaded:

input{
 file{
 path =>"/ opt/elk/datasets/movies.csv"
 start_position =>"beginning"
 }
}
filter{ 
 csv{
 columns =>
 ["title", "year", "length", "budget", "rating", "votes", "r1votes", "r2votes", "r3votes", "r4votes", "r5votes", "r6votes", "r7votes", "r8votes", "r9votes", "r10votes", "mpaaRating", "action", "animation", "comedy", "drama", "documentary", "romance", "short"]
 separator => ","
 }
 mutate {
 convert => ["year","integer"]
 convert => ["budget","integer"]
 convert => ["votes","integer"]
 convert => ["rating","integer"]
 convert => ["action","integer"]
 convert => ["animation","integer"]
 convert => ["comedy","integer"]
 convert => ["drama","integer"]
 convert => ["documentary","integer"]
 convert => ["romance","integer"]
 convert => ["short","integer"]
 }
} 
output {
 elasticsearch {
 index => "library"
 document_type => "movies"
 hosts => "localhost"
 }
}

We can run this configuration just like we did for stones. The only change is that we have kept the index name the same as library and document_type as movies :

$ ./bin/logstash -f conf/logstash.movies.conf

Metrics aggregations

At the time of indexing, there are some numbers or values extracted and metrics aggregations compute those metrics from documents. There are single-valued and multi-valued numeric metrics. There are a number of aggregations available in this class, and in this section, we will learn about important metrics aggregations.

Avg aggregation

This single-value aggregation takes numeric values into account to calculate the average for aggregated documents. The values can be taken from specific fields or can also be a result of some script. This is a single-value metrics. For example, if we want to see average votes, we can use the following:

$ curl -XGET 'http://localhost:9200/library/movies/_search?pretty' -d '{
"aggs" : {
"avg_votes" : { "avg" : { "field" : "votes" } }
}
}'

This will return a number of records with the aggregation result:

{
 ...
 "aggregations" : {
 "avg_votes" : {
 "value" : 632.1034394774443
 }
 }
}
Min aggregation

This single-value aggregation returns the minimum value extracted from aggregated documents. For example, to find the minimum rating, use the following:

{
 "aggs" : {
 "min_rating" : { "min" : { "field" : "rating" } }
 }
}
Max aggregation

This single-value aggregation returns the maximum value extracted from aggregated documents. For example, to find the maximum rating, use the following:

{
 "aggs" : {
 "max_rating" : { "max" : { "field" : "rating" } }
 }
}
Percentiles Aggregation

It is a multi-value aggregation that generates percentile over a numeric field. For example, we are getting percentiles on a rating field.

To get percentile for rating, use the following:

{
 "aggs" : {
 "rating_percentiles" : { 
 "percentiles" : { "field" : "rating" } 
 }
 }
}

This will result in the following:

{ 
 ...
 "aggregations" : {
 "rating_percentiles" : {
 "values" : {
 "1.0" : 1.0,
 "5.0" : 3.0,
 "25.0" : 5.0,
 "50.0" : 6.0,
 "75.0" : 7.0,
 "95.0" : 8.0,
 "99.0" : 9.0
 }
 }
 }
}
Sum aggregation

It is a single-value aggregation. This aggregation returns the sum of the extracted values on a field. For our dataset, we want to find the number of comedy movies:

$ curl -XGET 'http://localhost:9200/library/movies/_search?pretty' -d '{
"aggs" : {
"total_comedy_movies" : { "sum" : { "field" : "comedy" } }
}
}'

The preceding code will result in the following:

"aggregations" : {
 "total_comedy_movies" : {
 "value" : 17271.0
 }
 }
Value count aggregation

It is a single-value aggregation. This aggregation results in the number of documents for that search and the field. For example, to get the count of the movies released in 2000, run the following:

$ curl -XGET 'http://localhost:9200/library/movies/_search?pretty&q=year:2000' -d '{
"aggs" : {
"total_rated_movies" : { "value_count" : { "field" : "rating" } }
}
}'
Cardinality aggregation

This is also a single-value metrics aggregation. This can be visualized as a distinct query on a relational database. Let's say if we want to get the number of unique clarity in the movies index, we can use this aggregation:

$ curl 'http://localhost:9200/library/movies/_search?pretty' -d '{
 "aggs" : {
 "years" : {
 "cardinality" : {
 "field" : "year"
 }
 }
 }
}'

This will result in a count of unique clarity in the index:

"aggregations" : {
 "years" : {
 "value" : 68
 }
 }
Stats aggregation

When calculated, this multi-valued aggregation returns min, max, sum, count, and avg on that field.

To calculate status on votes, run the following:

$ curl -XGET 'http://localhost:9200/library/movies/_search?pretty' -d '{
"aggs" : {
"stats_votes" : { "stats" : { "field" : "votes" } }
}
}'

This will result in the following output:

"aggregations" : {
 "stats_votes" : {
 "count" : 58787,
 "min" : 5.0,
 "max" : 149494.0,
 "avg" : 629.4601357442972,
 "sum" : 3.7004073E7
 }
 }
Extended stats aggregation

This multi-valued aggregation extends stats aggregation. Apart from min, max, count, sum, and avg, this aggregation adds sum_of_squares, variance, std_deviation, and std_deviation_bounds. For example, let's calculate extended stats on votes:

$ curl -XGET 'http://localhost:9200/library/movies/_search?pretty' -d '{
"aggs" : {
"extended_stats_votes" : { "extended_stats" : { "field" : "votes" } }
}
}'

This will result in the following output:

"aggregations" : {
 "extended_stats_votes" : {
 "count" : 58787,
 "min" : 5.0,
 "max" : 149494.0,
 "avg" : 629.4601357442972,
 "sum" : 3.7004073E7,
 "sum_of_squares" : 8.60820897863E11,
 "variance" : 1.4246828534358414E7,
 "std_deviation" : 3774.4971233739752,
 "std_deviation_bounds" : {
 "upper" : 8178.4543824922475,
 "lower" : -6919.534111003653
 }
 }
 }

We will be learning more about these aggregations in the following chapters with examples, for example, we will be using geolocation related aggregation in Chapter 12, Case Study-Meetup along with term, range aggregations, and much more.