Mappings configuration
If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less (we rather call it data driven schema) search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. The field type determining mechanism is not going to guess the future. For example, if you first send an integer
value, such as 60
, and you send a float
value such as 70.23
for the same field, an error can happen or Elasticsearch will just cut off the decimal part of the float
value (which is actually what happens). This is because Elasticsearch will first set the field type to integer and will try to index the float
value to the integer
field which will cause cutting of the decimal point in the floating point number. In the next few pages you'll see how to create mappings that suit your needs and match your data structure.
Note
Note that we didn't include all the information about the available types in this chapter and some features of Elasticsearch, such as nested type, parent-child handling, storing geographical points, and search, are described in the following chapters of this book.
Type determining mechanism
Before we start describing how to create mappings manually, we want to get back to the automatic type determining algorithm used in Elasticsearch. As we already said, Elasticsearch can try guessing the schema for our documents by looking at the JSON that the document is built from. Because JSON is structured, that seems easy to do. For example, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document:
{ "field1": 10, "field2": "10" }
The preceding document has two fields. The field1
field will be given a type number (to be precise, that field will be given a long type). The second field, called field2
will be given a string type, because it is surrounded by quotation marks. Of course, for some use cases this can be the desired behavior. However, if somehow we would surround all the data using quotation mark (which is not the best idea anyway) our index structure would contain only string type fields.
Note
Don't worry about the fact that you are not familiar with what are the numeric types, the string types, and so on. We will describe them after we show you what you can do to tune the automatic type determining mechanism in Elasticsearch.
Disabling the type determining mechanism
The first solution is to completely disable the schema-less behavior in Elasticsearch. We can do that by adding the index.mapper.dynamic
property to our index properties and setting it to false
. We can do that by running the following command to create the index:
curl -XPUT 'localhost:9200/sites' -d '{ "index.mapper.dynamic": false }'
By doing that we told Elasticsearch that we don't want it to guess the type of our documents in the site's index and that we will provide the mappings ourselves. If we will try indexing some example document to the site's index, we will get the following error:
{ "error" : { "root_cause" : [ { "type" : "type_missing_exception", "reason" : "type[[doc, trying to auto create mapping, but dynamic mapping is disabled]] missing", "index" : "sites" } ], "type" : "type_missing_exception", "reason" : "type[[doc, trying to auto create mapping, but dynamic mapping is disabled]] missing", "index" : "sites" }, "status" : 404 }
This is because we didn't create any mappings – no schema for documents was created. Elasticsearch couldn't create one for us because we didn't allow it and the indexation command failed.
Of course this is not the only thing we can do when it comes to configuring how the type determining mechanism works. We can also tune it or disable it for a given type on the object level. We will talk about the second case in Chapter 5, Extending Your Index Structure. For now, let's look at the possibilities of tuning type determining mechanism in Elasticsearch.
Tuning the type determining mechanism for numeric types
One of the solutions to the problems with JSON documents and type guessing is that we are not always in control of the data. The documents that we are indexing can come from multiple places and some systems in our environment may include quotation marks for all the fields in the document. This can lead to problems and bad guesses. Because of that, Elasticsearch allows us to enable more aggressive fields value checking for numeric fields by setting the numeric_detection
property to true
in the mappings definition. For example, let's assume that we want to create an index called users and we want it to have the user type on which we will want more aggressive numeric fields parsing. To do that, we will use the following command:
curl -XPUT http://localhost:9200/users/?pretty -d '{ "mappings" : { "user": { "numeric_detection" : true } } }'
Now let's run the following command to index a single document to the users
index:
curl -XPOST http://localhost:9200/users/user/1 -d '{"name": "User 1", "age": "20"}'
Earlier, with the default settings, the age field would be set to string type. With the numeric_detection
property set to true
, the type of the age field will be set to long. We can check that by running the following command (it will retrieve the mappings for all the types in the users index):
curl -XGET 'localhost:9200/users/_mapping?pretty'
The preceding command should result in the following response returned by Elasticsearch:
{ "users" : { "mappings" : { "user" : { "numeric_detection" : true, "properties" : { "age" : { "type" : "long" }, "name" : { "type" : "string" } } } } } }
As we can see, the age
field was really set to be of type long
.
Tuning the type determining mechanism for dates
Another type of data that causes trouble are fields with dates. Dates can come in different flavors, for example, 2015-10-01 11:22:33
is a proper date and so is 2015-10-01T11:22:33+00
. Because of that, Elasticsearch tries to match the fields to timestamps or strings that match some given date format. If that matching operation is successful, the field is treated as a date based one. If we know how our date fields look, we can help Elasticsearch by providing a list of recognized date formats using the dynamic_date_formats
property, which allows us to specify the formats array. Let's look at the following command for creating an index:
curl -XPUT 'http://localhost:9200/blog/' -d '{ "mappings" : { "article" : { "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"] } } }'
The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats
property with a single date format that will result in Elasticsearch using the date core type (refer to the Core types
section in this chapter for more information about field types) for fields matching the defined format. Elasticsearch uses the joda-time
library to define the date formats, so visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in knowing about them.
Note
Remember that the dynamic_date_format
property accepts an array of values. That means that we can handle several date formats simultaneously.
With the preceding index, we can now try indexing a new document using the following command:
curl -XPUT localhost:9200/blog/article/1 -d '{"name": "Test", "test_field":"2015-10-01 11:22"}'
Elasticsearch will of course index that document, but let's look at the mappings created for our index:
curl -XGET 'localhost:9200/blog/_mapping?pretty'
The response for the preceding command will be as follows:
{ "blog" : { "mappings" : { "article" : { "dynamic_date_formats" : [ "yyyy-MM-dd hh:mm" ], "properties" : { "name" : { "type" : "string" }, "test_field" : { "type" : "date", "format" : "yyyy-MM-dd hh:mm" } } } } } }
As we can see, the test_field
field was given a date type, so our tuning works.
Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition.
Index structure mapping
Each data has its own structure – some are very simple, and some include complicated object relations, children documents, and nested properties. In each case, we need to have a schema in Elasticsearch called mappings that define how the data looks. Of course, we can use the schema-less nature of Elasticsearch, but we can and we usually want to prepare the mappings upfront, so we know how the data is handled.
For the purposes of this chapter, we will use a single type in the index. Of course, Elasticsearch as a multitenant system allows us to have multiple types in a single index, but we want to simplify the example, to make it easier to understand. So, for the purpose of the next few pages, we will create an index called posts that will hold data for documents in a post type. We also assume that the index will hold the following information:
- Unique identifier of the blog post
- Name of the blog post
- Publication date
- Contents – text of the post itself
In Elasticsearch, mappings, as with almost all communication, are sent as JSON objects in the request body. So, if we want to create the simplest mappings that matches our need, it will look as follows (we stored the mappings in the posts.json
file, so we can easily send it):
{ "mappings": { "post": { "properties": { "id": { "type":"long" }, "name": { "type":"string" }, "published": { "type":"date" }, "contents": { "type":"string" } } } } }
To create our posts index with the preceding mappings file, we will just run the following command:
curl -XPOST 'http://localhost:9200/posts' -d @posts.json
Note
Note that you can store your mappings and set a file name to whatever name you like. The curl
command will just take the contents of it.
And again, if everything goes well, we see the following response:
{"acknowledged":true}
Elasticsearch reported that our index has been created. If we look at the Elasticsearch node – on the current master, we will see something as follows:
[2015-10-14 15:02:12,840][INFO ][cluster.metadata ] [Shalla-Bal] [posts] creating index, cause [api], templates [], shards [5]/[1], mappings [post]
We can see that the posts index has been created, with 5 shards and 1 replica (shards [5]/[1]
) and with mappings for a single post type (mappings [post]
). Let's now discuss the contents of the posts.json
file and the possibilities when it comes to mappings.
Type and types definition
The mappings definition in Elasticsearch is just another JSON object, so it needs to be properly started and ended with curly brackets. All the mappings definitions are nested inside a single mappings object. In our example, we had a single post type, but we can have multiple of them. For example, if we would like to have more than a single type in our mappings, we just need to separate them with a comma character. Let's assume that we would like to have an additional user type in our posts index. The mappings definition in such case will look as follows (we stored it in the posts_with_user.json
file):
{ "mappings": { "post": { "properties": { "id": { "type":"long" }, "name": { "type":"string" }, "published": { "type":"date" }, "contents": { "type":"string" } } }, "user": { "properties": { "id": { "type":"long" }, "name": { "type":"string" } } } } }
As you can see, we can name the types with the names we want. Under each type we have the properties
object in which we store the actual name of the fields and their definition.
Fields
Each field in the mappings definition is just a name and an object describing the properties of the field. For example, we can have a field defined as the following:
"body": { "type":"string", "store":"yes", "index":"analyzed" }
The preceding field definition starts with a name – body
. After that we have an object with three properties – the type of the field (the type
property), if the original field value should be stored (the store
property), and if the field should be indexed and how (the index
property). And, of course, multiple field definitions are separated from each other using the comma character, just like other JSON objects.
Core types
Each field type in Elasticsearch can be given one of the provided core types. The core types in Elasticsearch are as follows:
- String
- Number (
integer
,long
,float
,double
) - Date
- Boolean
- Binary
In addition to the core types, Elasticsearch provides additional types that can handle more complicated data – such as nested documents, object, and so on. We will talk about them in Chapter 5, Extending Your Index Structure.
Common attributes
Before continuing with all the core type descriptions, we would like to discuss some common attributes that you can use to describe all the types (except for the binary one):
index_name
: This attribute defines the name of the field that will be stored in the index. If this is not defined, the name will be set to the name of the object that the field is defined with. Usually, you don't need to set this property, but it may be useful in some cases; for example, when you don't have control over the name of the fields in the JSON documents that are sent to Elasticsearch.index
: This attribute can take the valuesanalyzed
andno
and, for string-based fields, it can also be set to the additionalnot_analyzed
value. If set toanalyzed
, the field will be indexed and thus searchable. If set tono
, you won't be able to search on such a field. The default value isanalyzed
. In case of string-based fields, there is an additional option,not_analyzed
. This, when set, will mean that the field will be indexed but not analyzed. So, the field is written in the index as it was sent to Elasticsearch and only a perfect match will be counted during a search – the query will have to include exactly the same value as the value in the index. If we compare it to the SQL databases world, setting the index property of a field tonot_analyzed
would work just like using wherefield = value
. Also remember that setting the index property to no will result in the disabling inclusion of that field ininclude_in_all
(theinclude_in_all
property is discussed as the last property in the list).store
: This attribute can take the valuesyes
andno
and specifies if the original value of the field should be written into the index. The default value isno
, which means that Elasticsearch won't store the original value of the field and will try to use the_source
field (the JSON representing the original document that has been sent to Elasticsearch) when you want to retrieve the field value. Stored fields are not used for searching, however they can be used for highlighting if enabled (which may be more efficient that loading the_source
field in case it is big).doc_values
: This attribute can take the values oftrue
andfalse
. When set totrue,
Elasticsearch will create a special on disk structure during indexation for not tokenized fields (like not analyzed string fields, number based fields, Boolean fields, and date fields). This structure is highly efficient and is used by Elasticsearch for operations that require un-inverted data, such as aggregations, sorting, or scripting. Starting with Elasticsearch 2.0 the default value of this istrue
for not tokenized fields. Setting this value tofalse
will result in Elasticsearch using field data cache instead of doc values, which has higher memory demand, but may be faster in some rare situations.boost
: This attribute defines how important the field is inside the document; the higher the boost, the more important the values in the field are. The default value of this attribute is1
, which means a neutral value – anything above 1 will make the field more important, anything less than 1 will make it less important.null_value
: This attribute specifies a value that should be written into the index in case that field is not a part of an indexed document. The default behavior will just omit that field.copy_to
: This attribute specifies an array of fields to which the original value will be copied to. This allows for different kind of analysis of the same data. For example, you could imagine having two fields – one called title and one calledtitle_sort
, each having the same value but processed differently. We could usecopy_to
to copy the title field value totitle_sort
.include_in_all
: This attribute specifies if the field should be included in the_all
field. The_all
field is a special field used by Elasticsearch to allow easy searching in the contents of the whole indexed document. Elasticsearch creates the content of the_all
field by copying all the document fields there. By default, if the_all
field is used, all the fields will be included in it.
String
String is the basic text type which allows us to store one or more characters inside it. A sample definition of such a field is as follows:
"body" : { "type" : "string", "store" : "yes", "index" : "analyzed" }
In addition to the common attributes, the following attributes can also be set for the string-based fields:
term_vector
: This attribute can take the valuesno
(the default one),yes
,with_offsets
,with_positions
, andwith_positions_offsets
. It defines whether or not to calculate the Lucene term vectors for that field. If you are using highlighting (distinction which terms where matched in a document during the query), you will need to calculate the term vector for the so called fast vector highlighting – a more efficient highlighting version.analyzer
: This attribute defines the name of the analyzer used for indexing and searching. It defaults to the globally-defined analyzer name.search_analyzer
: This attribute defines the name of the analyzer used for processing the part of the query string that is sent to a particular field.norms.enabled
: This attribute specifies whether the norms should be loaded for a field. By default, it is set totrue
for analyzed fields (which means that the norms will be loaded for such fields) and tofalse
for non-analyzed fields. Norms are values inside of Lucene index that are used when calculating a score for a document – usually not needed for not analyzed fields and used only during query time. An example index creation command that disables norm for a single field present would look as follows:curl -XPOST 'localhost:9200/essb' -d '{ "mappings" : { "book" : { "properties" : { "name" : { "type" : "string", "norms" : { "enabled" : false } } } } } }'
norms.loading
: This attribute takes the valueseager
andlazy
and defines how Elasticsearch will load the norms. The first value means that the norms for such fields are always loaded. The second value means that the norms will be loaded only when needed. Norms are useful for scoring, but may require a vast amount of memory for large data sets. Having norms loaded eagerly (property set toeager
) means less work during query time, but will lead to more memory consumption. An example index creation command that eagerly load norms for a single field present look as follows:curl -XPOST 'localhost:9200/essb_eager' -d '{ "mappings" : { "book" : { "properties" : { "name" : { "type" : "string", "norms" : { "loading" : "eager" } } } } } }'
position_offset_gap
: This attribute defaults to 0 and specifies the gap in the index between instances of the given field with the same name. Setting this to a higher value may be useful if you want position-based queries (such as phrase queries) to match only inside a single instance of the field.index_options
: This attribute defines the indexing options for the postings list – the structure holding the terms (we talk more about it in thePostings format
section of this chapter). The possible values are docs (only document numbers are indexed),freqs
(document numbers and term frequencies are indexed),positions
(document numbers, term frequencies, and their positions are indexed), andoffsets
(document numbers, term frequencies, their positions, and offsets are indexed). The default value for this property ispositions
for analyzed fields anddocs
for fields that are indexed but not analyzed.ignore_above
: This attribute defines the maximum size of the field in characters. A field whose size is above the specified value will be ignored by the analyzer.Note
In one of the upcoming Elasticsearch versions, the string type may be deprecated and may be replaced by two new types, text and keyword, to better indicate what the string based field is representing. The text type will be used for analyzed text fields and the keyword type will be used for not analyzed text fields. If you are interested in the incoming changes, refer to the following GitHub issue: https://github.com/elastic/elasticsearch/issues/12394.
Number
This is the common name for a few core types that gather all the numeric field types that are available and waiting to be used. The following types are available in Elasticsearch (we specify them by using the type property):
byte
: This type defines abyte
value; for example,1
. It allows for values between-128
and127
inclusive.short
: This type defines ashort
value; for example,12
. It allows for values between -32768
and32767
inclusive.integer
: This type defines aninteger
value; for example,134
. It allows for values between-231
and231-1
inclusive up to Java 7 and values between0
and232-1
in Java 8.long
: This type defines along
value; for example,123456789
. It allows for values between-263
and263-1
inclusive up to Java 7 and values between0
and264-1
in Java 8.float
: This type defines afloat
value; for example,12.23
. For information about the possible values, refer to https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.3.double
: This type defines a double value; for example,123.45
. For information about the possible values, refer to https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.3.Note
You can learn more about the mentioned Java types at http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html.
A sample definition of a field based on one of the numeric types is as follows:
"price" : { "type" : "float", "precision_step" : "4" }
In addition to the common attributes, the following ones can also be set for the numeric fields:
precision_step
: This attribute defines the number of terms generated for each value in the numeric field. The lower the value, the higher the number of terms generated. For fields with a higher number of terms per value, range queries will be faster at the cost of a slightly larger index. The default value is16
for long and double,8
for integer, short, and float, and2147483647
for byte.coerce
: This attribute defaults to true and can take the value oftrue
orfalse
. It defines if Elasticsearch should try to convert the string values to numbers for a given field and if the decimal parts of thefloat
value should be truncated for the integer based fields.ignore_malformed
: This attribute can take the valuetrue
orfalse
(which is the default). It should be set totrue
in order to omit the badly formatted values.
Boolean
The boolean
core type is designed for indexing the Boolean values (true
or false
). A sample definition of a field based on the boolean
type is as follows:
"allowed" : { "type" : "boolean", "store": "yes" }
Binary
The binary field is a BASE64 representation of the binary data stored in the index. You can use it to store data that is normally written in binary form, such as images. Fields based on this type are by default stored and not indexed, so you can only retrieve them and not perform search operations on them. The binary type only supports the index_name
, type
, store, and doc_values
properties. The sample field definition based on the binary field may look like the following:
"image" : { "type" : "binary" }
Date
The date core type is designed to be used for date indexing. The date in the field allows us to specify a format that will be recognized by Elasticsearch. It is worth noting that all the dates are indexed in UTC and are internally indexed as long values. In addition to that, for the date based fields, Elasticsearch accepts long values representing UTC milliseconds since epoch regardless of the format specified for the date field.
The default date format recognized by Elasticsearch is quite universal and allows us to provide the date and optionally the time; for example, 2012-12-24T12:10:22. A sample definition of a field based on the date type is as follows:
"published" : { "type" : "date", "format" : "YYYY-mm-dd" }
A sample document that uses the above date field with the specified format is as follows:
{ "name" : "Sample document", "published" : "2012-12-22" }
In addition to the common attributes, the following ones can also be set for the fields based on the date
type:
format
: This attribute specifies the format of the date. The default value isdateOptionalTime
. For a full list of formats, visit https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html.precision_step
: This attribute defines the number of terms generated for each value in the numeric field. Refer to the numeric core type description for more information about this parameter.numeric_resolution
: This attribute defines the unit of time that Elasticsearch will use when a numeric value is passed to the date based field instead of the date following a format. By default, Elasticsearch uses the milliseconds value, which means that the numeric value will be treated as milliseconds since epoch. Another value is seconds.ignore_malformed
: This attribute can take the valuetrue
orfalse
. The default value isfalse
. It should be set totrue
in order to omit badly formatted values.
Multi fields
There are situations where we need to have the same field analyzed differently. For example, one for sorting, one for searching, and one for analysis with aggregations, but all using the same field value, just indexed differently. We could of course use the previously described field value copying, but we can also use so called multi fields. To be able to use that feature of Elasticsearch, we need to define an additional property in our field definition called fields
. The fields
is an object that can contain one or more additional fields that will be present in our index and will have the value of the field that they are assigned to. For example, if we would like to have aggregations done on the name
field and in addition to that search on that field, we would define it as follows:
"name": { "type": "string", "fields": { "agg": { "type" : "string", "index": "not_analyzed" } } }
The preceding definition will create two fields – one called name and the second called name.agg
. Of course, you don't have to specify two separate fields in the data you are sending to Elasticsearch – a single one named name
is enough. Elasticsearch will do the rest, which means copying the value of the field to all the fields from the preceding definition.
The IP address type
The ip
field type was added to Elasticsearch to simplify the use of IPv4 addresses in a numeric form. This field type allows us to search data that is indexed as an IP address, sort on such data, and use range queries using IP values.
A sample definition of a field based on one of the numeric types is as follows:
"address" : { "type" : "ip" }
In addition to the common attributes, the precision_step
attribute can also be set for the ip
type based fields. Refer to the numeric type description for more information about that property.
A sample document that uses the ip
based field looks as follows:
{ "name" : "Tom PC", "address" : "192.168.2.123" }
Token count type
The token_count
field type allows us to store and index information about how many tokens the given field has instead of storing and indexing the text provided to the field. It accepts the same configuration options as the number type, but in addition to that, we need to specify the analyzer which will be used to pide the field value into tokens. We do that by using the analyzer
property.
A sample definition of a field based on the token_count
field type looks as follows:
"title_count" : { "type" : "token_count", "analyzer" : "standard" }
Using analyzers
The great thing about Elasticsearch is that it leverages the analysis capabilities of Apache Lucene. This means that for fields that are based on the string
type, we can specify which analyzer Elasticsearch should use. As you remember from the Full text searching section of Chapter 1, Getting Started with Elasticsearch Cluster, the analyzer is a functionality that is used to analyze data or queries in the way we want. For example, when we pide words on the basis of whitespaces and lowercase characters, we don't have to worry about the users sending words that are lowercased or uppercased. This means that Elasticsearch, elasticsearch, and ElAstIcSeaRCh will be treated as the same word. What's more is that Elasticsearch allows us to use not only the analyzers provided out of the box, but also create our own configurations. We can also use different analyzers at the time of indexing and different analyzers at the time of querying—we can choose how we want our data to be processed at each stage of the search process. Let's now have a look at the analyzers provided by Elasticsearch and at Elasticsearch analysis functionality in general.
Out-of-the-box analyzers
Elasticsearch allows us to use one of the many analyzers defined by default. The following analyzers are available out of the box:
standard
: This analyzer is convenient for most European languages (refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html for the full list of parameters).simple
: This analyzer splits the provided value on non-letter characters and converts them to lowercase.whitespace
: This analyzer splits the provided value on the basis of whitespace characters.stop
: This is similar to a simple analyzer, but in addition to the functionality of the simple analyzer, it filters the data on the basis of the provided set of stop words (refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html for the full list of parameters).keyword
: This is a very simple analyzer that just passes the provided value. You'll achieve the same by specifying a particular field asnot_analyzed
.pattern
: This analyzer allows flexible text separation by the use of regular expressions (refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html for the full list of parameters). The key point to remember when it comes to the pattern analyzer is that the provided pattern should match the separators of the words, not the words themselves.language
: This analyzer is designed to work with a specific language. The full list of languages supported by this analyzer can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html.snowball
: This is an analyzer that is similar to standard, but additionally provides the stemming algorithm (refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html for the full list of parameters).Note
Stemming is the process of reducing the inflected and derived words to their stem or base form. Such a process allows for the reduction of words, for example, with cars and car. For the mentioned words, stemmer (which is an implementation of the stemming algorithm) will produce a single stem, car. After indexing, the documents containing such words will be matched while using any of them. Without stemming, the documents with the word "cars" will only be matched by a query containing the same word. You can find more information about stemming on Wikipedia at https://en.wikipedia.org/wiki/Stemming.
Defining your own analyzers
In addition to the analyzers mentioned previously, Elasticsearch allows us to define new ones without the need for writing a single line of Java code. In order to do that, we need to add an additional section to our mappings file; that is, the settings section, which holds additional information used by Elasticsearch during index creation. The following code snippet shows how we can define our custom settings section:
"settings" : { "index" : { "analysis": { "analyzer": { "en": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } } }
We specified that we want a new analyzer named en
to be present. Each analyzer is built from a single tokenizer and multiple filters. A complete list of the default filters and tokenizers can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html. Our en
analyzer includes the standard tokenizer and three filters: asciifolding
and lowercase
, which are the ones available by default, and a custom ourEnglishFilter
, which is a filter we have defined.
To define a filter, we need to provide its name, its type (the type
property), and any number of additional parameters required by that filter type. The full list of filter types available in Elasticsearch can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html. Please be aware, that we won't be discussing each filter as the list of filters is constantly changing. If you are interested in the full filters list, please refer to the mentioned page in the documentation.
So, the final mappings file with our custom analyzer defined will be as follows:
{ "settings" : { "index" : { "analysis": { "analyzer": { "en": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } } }, "mappings" : { "post" : { "properties" : { "id": { "type" : "long" }, "name": { "type" : "string", "analyzer": "en" } } } } }
If we save the preceding mappings to a file called posts_mappings.json,
we can run the following command to create the posts
index:
curl -XPOST 'http://localhost:9200/posts' -d @posts_mappings.json
We can see how our analyzer works by using the Analyze API (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html). For example, let's look at the following command:
curl -XGET 'localhost:9200/posts/_analyze?pretty&field=name' -d 'robots cars'
The command asks Elasticsearch to show the content of the analysis of the given phrase (robots cars) with the use of the analyzer defined for the post type and its name field. The response that we will get from Elasticsearch is as follows:
{ "tokens" : [ { "token" : "robot", "start_offset" : 0, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "car", "start_offset" : 7, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 } ] }
As you can see, the robots cars phrase was pided into two tokens. In addition to that, the robots
word was changed to robot
and the cars
word was changed to car.
Default analyzers
There is one more thing to say about analyzers. Elasticsearch allows us to specify the analyzer that should be used by default if no analyzer is defined. This is done in the same way as we configured a custom analyzer in the settings section of the mappings file, but instead of specifying a custom name for the analyzer, a default keyword should be used. So to make our previously defined analyzer the default, we can change the en
analyzer to the following:
{ "settings" : { "index" : { "analysis": { "analyzer": { "default": { "tokenizer": "standard", "filter": [ "asciifolding", "lowercase", "ourEnglishFilter" ] } }, "filter": { "ourEnglishFilter": { "type": "kstem" } } } } } }
We can also choose a different default analyzer for searching and a different one for indexing. If we would like to do that instead of using the default keyword for the analyzer name, we should use default_search
and default_index
respectively.
Different similarity models
With the release of Apache Lucene 4.0 in 2012, all the users of this great full text search library were given the opportunity to alter the default TF/IDF-based algorithm and use a different one (we've mentioned it in the Full text searching section of Chapter 1, Getting Started with Elasticsearch Cluster). Because of that we are able to choose a similarity model in Elasticsearch, which basically allows us to use different scoring formulas for our documents.
Note
Note that the similarity models topic ranges from intermediate to advanced and in most cases the TF/IDF based algorithm will be sufficient for your use case. However, we decided to have it described in the book, so you know that you have the possibility of changing the scoring algorithm behavior if needed.
Setting per-field similarity
Since Elasticsearch 0.90, we are allowed to set a different similarity for each of the fields that we have in our mappings file. For example, let's assume that we have the following simple mappings that we use in order to index the blog posts:
{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long" }, "name" : { "type" : "string" }, "contents" : { "type" : "string" } } } } }
To do this, we will use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings will look like the following:
{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long" }, "name" : { "type" : "string", "similarity" : "BM25" }, "contents" : { "type" : "string", "similarity" : "BM25" } } } } }
And that's all, nothing more is needed. After the above change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and the contents fields.
Available similarity models
There are at least five new similarity models available. For most of the use cases, apart from the default one, you may find the following models useful:
- Okapi BM25 model: This similarity model is based on a probabilistic model that estimates the probability of finding a document for a given query. In order to use this similarity in Elasticsearch, you need to use the BM25 name. Okapi BM25 similarity is said perform best when dealing with short text documents where term repetitions are especially hurtful to the overall document score. To use this similarity, one needs to set the similarity property for a field to BM25. This similarity is defined out of the box and doesn't need additional properties to be set.
- Divergence from randomness model: This similarity model is based on the probabilistic model of the same name. In order to use this similarity in Elasticsearch, you need to use the DFR name. It is said that the pergence from randomness similarity model performs well on text that is similar to natural language.
- Information-based model: This is the last model of the newly introduced similarity models and is very similar to the pergence from randomness model. In order to use this similarity in Elasticsearch, you need to use the IB name. Similar to the DFR similarity, it is said that the information-based model performs well on data similar to natural language text.
The two other similarity models currently available are LM Dirichlet similarity (to use it, set the type
property to LMDirichlet
) and LM Jelinek Mercer similarity (to use it, set the type
property to LMJelinekMercer
). You can find more about these similarity models in Apache Lucene Javadocs, Mastering Elasticsearch Second Edition
, published by Packt Publishing or in official documentation of Elasticsearch available at https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.
Configuring default similarity
The default similarity allows us to provide an additional discount_overlaps
property. It allows us to control if the tokens on the same positions in the token stream (with position increment of 0) are omitted during score calculation. By default, it is set to true
, which means that the tokens on the same positions are omitted; if you want them to be counted, you can set that property to false
. For example, the following command shows how to create an index with the discount_overlaps
property changed for the default similarity:
curl -XPUT 'localhost:9200/test_similarity' -d '{ "settings" : { "similarity" : { "altered_default": { "type" : "default", "discount_overlaps" : false } } }, "mappings": { "doc": { "properties": { "name": { "type" : "string", "similarity": "altered_default" } } } } }'
Configuring BM25 similarity
Even though we don't need to configure the BM25 similarity, we can provide some additional options to tune its behavior. The BM25 similarity allows us to provide the discount_overlaps
property similar to the default similarity and two additional properties: k1
and b
. The k1
property specifies the term frequency normalization factor and the b
property value determines to what degree the document length will normalize the term frequency values.
Configuring DFR similarity
In case of the DFR similarity, we can configure the basic_model
property (which can take the value be
, d
, g
, if
, in
, p
, or ine
), the after_effect
property (with values of no
, b
, or l
), and the normalization
property (which can be no
, h1
, h2
, h3
, or z
). If we choose a normalization
value other than no
, we need to set the normalization
factor.
Depending on the chosen normalization value, we should use normalization.h1.c
(the float
value) for h1
normalization, normalization.h2.c
(the float
value) for h2
normalization, normalization.h3.c
(the float
value) for h3
normalization, and normalization.z.z
(the float
value) for z
normalization. For example, the following is how the example similarity configuration will look (we put this into the settings section of our mappings file):
"similarity" : { "esserverbook_dfr_similarity" : { "type" : "DFR", "basic_model" : "g", "after_effect" : "l", "normalization" : "h2", "normalization.h2.c" : "2.0" } }
Configuring IB similarity
In case of IB similarity, we have the following parameters through which we can configure the distribution property (which can take the value of ll
or spl
) and the lambda property (which can take the value of df
or tff
). In addition to that, we can choose the normalization factor, which is the same as for the DFR similarity, so we'll omit describing it a second time. The following is how the example IB similarity configuration will look (we put this into the settings section of our mappings file):
"similarity" : { "esserverbook_ib_similarity" : { "type" : "IB", "distribution" : "ll", "lambda" : "df", "normalization" : "z", "normalization.z.z" : "0.25" } }