Elasticsearch Server(Third Edition)
上QQ阅读APP看书,第一时间看更新

Mappings configuration

If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less (we rather call it data driven schema) search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. The field type determining mechanism is not going to guess the future. For example, if you first send an integer value, such as 60, and you send a float value such as 70.23 for the same field, an error can happen or Elasticsearch will just cut off the decimal part of the float value (which is actually what happens). This is because Elasticsearch will first set the field type to integer and will try to index the float value to the integer field which will cause cutting of the decimal point in the floating point number. In the next few pages you'll see how to create mappings that suit your needs and match your data structure.

Note

Note that we didn't include all the information about the available types in this chapter and some features of Elasticsearch, such as nested type, parent-child handling, storing geographical points, and search, are described in the following chapters of this book.

Type determining mechanism

Before we start describing how to create mappings manually, we want to get back to the automatic type determining algorithm used in Elasticsearch. As we already said, Elasticsearch can try guessing the schema for our documents by looking at the JSON that the document is built from. Because JSON is structured, that seems easy to do. For example, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document:

{
  "field1": 10,
  "field2": "10"
}

The preceding document has two fields. The field1 field will be given a type number (to be precise, that field will be given a long type). The second field, called field2 will be given a string type, because it is surrounded by quotation marks. Of course, for some use cases this can be the desired behavior. However, if somehow we would surround all the data using quotation mark (which is not the best idea anyway) our index structure would contain only string type fields.

Note

Don't worry about the fact that you are not familiar with what are the numeric types, the string types, and so on. We will describe them after we show you what you can do to tune the automatic type determining mechanism in Elasticsearch.

Disabling the type determining mechanism

The first solution is to completely disable the schema-less behavior in Elasticsearch. We can do that by adding the index.mapper.dynamic property to our index properties and setting it to false. We can do that by running the following command to create the index:

curl -XPUT 'localhost:9200/sites' -d '{
 "index.mapper.dynamic": false
}'

By doing that we told Elasticsearch that we don't want it to guess the type of our documents in the site's index and that we will provide the mappings ourselves. If we will try indexing some example document to the site's index, we will get the following error:

{
  "error" : {
    "root_cause" : [ {
      "type" : "type_missing_exception",
      "reason" : "type[[doc, trying to auto create mapping, but dynamic mapping is disabled]] missing",
      "index" : "sites"
    } ],
    "type" : "type_missing_exception",
    "reason" : "type[[doc, trying to auto create mapping, but dynamic mapping is disabled]] missing",
    "index" : "sites"
  },
  "status" : 404
}

This is because we didn't create any mappings – no schema for documents was created. Elasticsearch couldn't create one for us because we didn't allow it and the indexation command failed.

Of course this is not the only thing we can do when it comes to configuring how the type determining mechanism works. We can also tune it or disable it for a given type on the object level. We will talk about the second case in Chapter 5, Extending Your Index Structure. For now, let's look at the possibilities of tuning type determining mechanism in Elasticsearch.

Tuning the type determining mechanism for numeric types

One of the solutions to the problems with JSON documents and type guessing is that we are not always in control of the data. The documents that we are indexing can come from multiple places and some systems in our environment may include quotation marks for all the fields in the document. This can lead to problems and bad guesses. Because of that, Elasticsearch allows us to enable more aggressive fields value checking for numeric fields by setting the numeric_detection property to true in the mappings definition. For example, let's assume that we want to create an index called users and we want it to have the user type on which we will want more aggressive numeric fields parsing. To do that, we will use the following command:

curl -XPUT http://localhost:9200/users/?pretty -d '{ 
 "mappings" : {
 "user": {
 "numeric_detection" : true
 }
 }
}'

Now let's run the following command to index a single document to the users index:

curl -XPOST http://localhost:9200/users/user/1 -d '{"name": "User 1", "age": "20"}'

Earlier, with the default settings, the age field would be set to string type. With the numeric_detection property set to true, the type of the age field will be set to long. We can check that by running the following command (it will retrieve the mappings for all the types in the users index):

curl -XGET 'localhost:9200/users/_mapping?pretty'

The preceding command should result in the following response returned by Elasticsearch:

{
  "users" : {
    "mappings" : {
      "user" : {
        "numeric_detection" : true,
        "properties" : {
          "age" : {
            "type" : "long"
          },
          "name" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

As we can see, the age field was really set to be of type long.

Tuning the type determining mechanism for dates

Another type of data that causes trouble are fields with dates. Dates can come in different flavors, for example, 2015-10-01 11:22:33 is a proper date and so is 2015-10-01T11:22:33+00. Because of that, Elasticsearch tries to match the fields to timestamps or strings that match some given date format. If that matching operation is successful, the field is treated as a date based one. If we know how our date fields look, we can help Elasticsearch by providing a list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating an index:

curl -XPUT 'http://localhost:9200/blog/' -d '{ 
 "mappings" : {
 "article" : {
 "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]
 }
 }
}'

The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type (refer to the Core types section in this chapter for more information about field types) for fields matching the defined format. Elasticsearch uses the joda-time library to define the date formats, so visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in knowing about them.

Note

Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.

With the preceding index, we can now try indexing a new document using the following command:

curl -XPUT localhost:9200/blog/article/1 -d '{"name": "Test", "test_field":"2015-10-01 11:22"}'

Elasticsearch will of course index that document, but let's look at the mappings created for our index:

curl -XGET 'localhost:9200/blog/_mapping?pretty'

The response for the preceding command will be as follows:

{
  "blog" : {
    "mappings" : {
      "article" : {
        "dynamic_date_formats" : [ "yyyy-MM-dd hh:mm" ],
        "properties" : {
          "name" : {
            "type" : "string"
          },
          "test_field" : {
            "type" : "date",
            "format" : "yyyy-MM-dd hh:mm"
          }
        }
      }
    }
  }
}

As we can see, the test_field field was given a date type, so our tuning works.

Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition.

Index structure mapping

Each data has its own structure – some are very simple, and some include complicated object relations, children documents, and nested properties. In each case, we need to have a schema in Elasticsearch called mappings that define how the data looks. Of course, we can use the schema-less nature of Elasticsearch, but we can and we usually want to prepare the mappings upfront, so we know how the data is handled.

For the purposes of this chapter, we will use a single type in the index. Of course, Elasticsearch as a multitenant system allows us to have multiple types in a single index, but we want to simplify the example, to make it easier to understand. So, for the purpose of the next few pages, we will create an index called posts that will hold data for documents in a post type. We also assume that the index will hold the following information:

  • Unique identifier of the blog post
  • Name of the blog post
  • Publication date
  • Contents – text of the post itself

In Elasticsearch, mappings, as with almost all communication, are sent as JSON objects in the request body. So, if we want to create the simplest mappings that matches our need, it will look as follows (we stored the mappings in the posts.json file, so we can easily send it):

{
  "mappings": {
    "post": {
      "properties": {
        "id": { "type":"long" },
        "name": { "type":"string" },
        "published": { "type":"date" },
        "contents": { "type":"string" }
      }
    }
  }
}

To create our posts index with the preceding mappings file, we will just run the following command:

curl -XPOST 'http://localhost:9200/posts' -d @posts.json
Note

Note that you can store your mappings and set a file name to whatever name you like. The curl command will just take the contents of it.

And again, if everything goes well, we see the following response:

{"acknowledged":true}

Elasticsearch reported that our index has been created. If we look at the Elasticsearch node – on the current master, we will see something as follows:

[2015-10-14 15:02:12,840][INFO ][cluster.metadata ] [Shalla-Bal] [posts] creating index, cause [api], templates [], shards [5]/[1], mappings [post]

We can see that the posts index has been created, with 5 shards and 1 replica (shards [5]/[1]) and with mappings for a single post type (mappings [post]). Let's now discuss the contents of the posts.json file and the possibilities when it comes to mappings.

Type and types definition

The mappings definition in Elasticsearch is just another JSON object, so it needs to be properly started and ended with curly brackets. All the mappings definitions are nested inside a single mappings object. In our example, we had a single post type, but we can have multiple of them. For example, if we would like to have more than a single type in our mappings, we just need to separate them with a comma character. Let's assume that we would like to have an additional user type in our posts index. The mappings definition in such case will look as follows (we stored it in the posts_with_user.json file):

{
  "mappings": {
    "post": {
      "properties": {
        "id": { "type":"long" },
        "name": { "type":"string" },
        "published": { "type":"date" },
        "contents": { "type":"string" }
      }
    },
    "user": {
      "properties": {
        "id": { "type":"long" },
        "name": { "type":"string" }
      }
    }
  }
}

As you can see, we can name the types with the names we want. Under each type we have the properties object in which we store the actual name of the fields and their definition.

Fields

Each field in the mappings definition is just a name and an object describing the properties of the field. For example, we can have a field defined as the following:

"body": { "type":"string", "store":"yes", "index":"analyzed" }

The preceding field definition starts with a name – body. After that we have an object with three properties – the type of the field (the type property), if the original field value should be stored (the store property), and if the field should be indexed and how (the index property). And, of course, multiple field definitions are separated from each other using the comma character, just like other JSON objects.

Core types

Each field type in Elasticsearch can be given one of the provided core types. The core types in Elasticsearch are as follows:

  • String
  • Number (integer, long, float, double)
  • Date
  • Boolean
  • Binary

In addition to the core types, Elasticsearch provides additional types that can handle more complicated data – such as nested documents, object, and so on. We will talk about them in Chapter 5, Extending Your Index Structure.

Common attributes

Before continuing with all the core type descriptions, we would like to discuss some common attributes that you can use to describe all the types (except for the binary one):

  • index_name: This attribute defines the name of the field that will be stored in the index. If this is not defined, the name will be set to the name of the object that the field is defined with. Usually, you don't need to set this property, but it may be useful in some cases; for example, when you don't have control over the name of the fields in the JSON documents that are sent to Elasticsearch.
  • index: This attribute can take the values analyzed and no and, for string-based fields, it can also be set to the additional not_analyzed value. If set to analyzed, the field will be indexed and thus searchable. If set to no, you won't be able to search on such a field. The default value is analyzed. In case of string-based fields, there is an additional option, not_analyzed. This, when set, will mean that the field will be indexed but not analyzed. So, the field is written in the index as it was sent to Elasticsearch and only a perfect match will be counted during a search – the query will have to include exactly the same value as the value in the index. If we compare it to the SQL databases world, setting the index property of a field to not_analyzed would work just like using where field = value. Also remember that setting the index property to no will result in the disabling inclusion of that field in include_in_all (the include_in_all property is discussed as the last property in the list).
  • store: This attribute can take the values yes and no and specifies if the original value of the field should be written into the index. The default value is no, which means that Elasticsearch won't store the original value of the field and will try to use the _source field (the JSON representing the original document that has been sent to Elasticsearch) when you want to retrieve the field value. Stored fields are not used for searching, however they can be used for highlighting if enabled (which may be more efficient that loading the _source field in case it is big).
  • doc_values: This attribute can take the values of true and false. When set to true, Elasticsearch will create a special on disk structure during indexation for not tokenized fields (like not analyzed string fields, number based fields, Boolean fields, and date fields). This structure is highly efficient and is used by Elasticsearch for operations that require un-inverted data, such as aggregations, sorting, or scripting. Starting with Elasticsearch 2.0 the default value of this is true for not tokenized fields. Setting this value to false will result in Elasticsearch using field data cache instead of doc values, which has higher memory demand, but may be faster in some rare situations.
  • boost: This attribute defines how important the field is inside the document; the higher the boost, the more important the values in the field are. The default value of this attribute is 1, which means a neutral value – anything above 1 will make the field more important, anything less than 1 will make it less important.
  • null_value: This attribute specifies a value that should be written into the index in case that field is not a part of an indexed document. The default behavior will just omit that field.
  • copy_to: This attribute specifies an array of fields to which the original value will be copied to. This allows for different kind of analysis of the same data. For example, you could imagine having two fields – one called title and one called title_sort, each having the same value but processed differently. We could use copy_to to copy the title field value to title_sort.
  • include_in_all: This attribute specifies if the field should be included in the _all field. The _all field is a special field used by Elasticsearch to allow easy searching in the contents of the whole indexed document. Elasticsearch creates the content of the _all field by copying all the document fields there. By default, if the _all field is used, all the fields will be included in it.
String

String is the basic text type which allows us to store one or more characters inside it. A sample definition of such a field is as follows:

"body" : { "type" : "string", "store" : "yes", "index" : "analyzed" }

In addition to the common attributes, the following attributes can also be set for the string-based fields:

  • term_vector: This attribute can take the values no (the default one), yes, with_offsets, with_positions, and with_positions_offsets. It defines whether or not to calculate the Lucene term vectors for that field. If you are using highlighting (distinction which terms where matched in a document during the query), you will need to calculate the term vector for the so called fast vector highlighting – a more efficient highlighting version.
  • analyzer: This attribute defines the name of the analyzer used for indexing and searching. It defaults to the globally-defined analyzer name.
  • search_analyzer: This attribute defines the name of the analyzer used for processing the part of the query string that is sent to a particular field.
  • norms.enabled: This attribute specifies whether the norms should be loaded for a field. By default, it is set to true for analyzed fields (which means that the norms will be loaded for such fields) and to false for non-analyzed fields. Norms are values inside of Lucene index that are used when calculating a score for a document – usually not needed for not analyzed fields and used only during query time. An example index creation command that disables norm for a single field present would look as follows:
    curl -XPOST 'localhost:9200/essb' -d '{
     "mappings" : {
     "book" : {
     "properties" : {
     "name" : { 
     "type" : "string", 
     "norms" : {
     "enabled" : false
     }
     }
     }
     }
     }
    }'
  • norms.loading: This attribute takes the values eager and lazy and defines how Elasticsearch will load the norms. The first value means that the norms for such fields are always loaded. The second value means that the norms will be loaded only when needed. Norms are useful for scoring, but may require a vast amount of memory for large data sets. Having norms loaded eagerly (property set to eager) means less work during query time, but will lead to more memory consumption. An example index creation command that eagerly load norms for a single field present look as follows:
    curl -XPOST 'localhost:9200/essb_eager' -d '{
     "mappings" : {
     "book" : {
     "properties" : {
     "name" : { 
     "type" : "string", 
     "norms" : {
     "loading" : "eager"
     }
     }
     }
     }
     }
    }'
  • position_offset_gap: This attribute defaults to 0 and specifies the gap in the index between instances of the given field with the same name. Setting this to a higher value may be useful if you want position-based queries (such as phrase queries) to match only inside a single instance of the field.
  • index_options: This attribute defines the indexing options for the postings list – the structure holding the terms (we talk more about it in the Postings format section of this chapter). The possible values are docs (only document numbers are indexed), freqs (document numbers and term frequencies are indexed), positions (document numbers, term frequencies, and their positions are indexed), and offsets (document numbers, term frequencies, their positions, and offsets are indexed). The default value for this property is positions for analyzed fields and docs for fields that are indexed but not analyzed.
  • ignore_above: This attribute defines the maximum size of the field in characters. A field whose size is above the specified value will be ignored by the analyzer.
    Note

    In one of the upcoming Elasticsearch versions, the string type may be deprecated and may be replaced by two new types, text and keyword, to better indicate what the string based field is representing. The text type will be used for analyzed text fields and the keyword type will be used for not analyzed text fields. If you are interested in the incoming changes, refer to the following GitHub issue: https://github.com/elastic/elasticsearch/issues/12394.

Number

This is the common name for a few core types that gather all the numeric field types that are available and waiting to be used. The following types are available in Elasticsearch (we specify them by using the type property):

  • byte: This type defines a byte value; for example, 1. It allows for values between -128 and 127 inclusive.
  • short: This type defines a short value; for example, 12. It allows for values between -32768 and 32767 inclusive.
  • integer: This type defines an integer value; for example, 134. It allows for values between -231 and 231-1 inclusive up to Java 7 and values between 0 and 232-1 in Java 8.
  • long: This type defines a long value; for example, 123456789. It allows for values between -263 and 263-1 inclusive up to Java 7 and values between 0 and 264-1 in Java 8.
  • float: This type defines a float value; for example, 12.23. For information about the possible values, refer to https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.3.
  • double: This type defines a double value; for example, 123.45. For information about the possible values, refer to https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.2.3.
    Note

    You can learn more about the mentioned Java types at http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html.

A sample definition of a field based on one of the numeric types is as follows:

"price" : { "type" : "float", "precision_step" : "4" }

In addition to the common attributes, the following ones can also be set for the numeric fields:

  • precision_step: This attribute defines the number of terms generated for each value in the numeric field. The lower the value, the higher the number of terms generated. For fields with a higher number of terms per value, range queries will be faster at the cost of a slightly larger index. The default value is 16 for long and double, 8 for integer, short, and float, and 2147483647 for byte.
  • coerce: This attribute defaults to true and can take the value of true or false. It defines if Elasticsearch should try to convert the string values to numbers for a given field and if the decimal parts of the float value should be truncated for the integer based fields.
  • ignore_malformed: This attribute can take the value true or false (which is the default). It should be set to true in order to omit the badly formatted values.
Boolean

The boolean core type is designed for indexing the Boolean values (true or false). A sample definition of a field based on the boolean type is as follows:

"allowed" : { "type" : "boolean", "store": "yes" }
Binary

The binary field is a BASE64 representation of the binary data stored in the index. You can use it to store data that is normally written in binary form, such as images. Fields based on this type are by default stored and not indexed, so you can only retrieve them and not perform search operations on them. The binary type only supports the index_name, type, store, and doc_values properties. The sample field definition based on the binary field may look like the following:

"image" : { "type" : "binary" }
Date

The date core type is designed to be used for date indexing. The date in the field allows us to specify a format that will be recognized by Elasticsearch. It is worth noting that all the dates are indexed in UTC and are internally indexed as long values. In addition to that, for the date based fields, Elasticsearch accepts long values representing UTC milliseconds since epoch regardless of the format specified for the date field.

The default date format recognized by Elasticsearch is quite universal and allows us to provide the date and optionally the time; for example, 2012-12-24T12:10:22. A sample definition of a field based on the date type is as follows:

"published" : { "type" : "date", "format" : "YYYY-mm-dd" }

A sample document that uses the above date field with the specified format is as follows:

{
  "name" : "Sample document",
  "published" : "2012-12-22"
}

In addition to the common attributes, the following ones can also be set for the fields based on the date type:

  • format: This attribute specifies the format of the date. The default value is dateOptionalTime. For a full list of formats, visit https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html.
  • precision_step: This attribute defines the number of terms generated for each value in the numeric field. Refer to the numeric core type description for more information about this parameter.
  • numeric_resolution: This attribute defines the unit of time that Elasticsearch will use when a numeric value is passed to the date based field instead of the date following a format. By default, Elasticsearch uses the milliseconds value, which means that the numeric value will be treated as milliseconds since epoch. Another value is seconds.
  • ignore_malformed: This attribute can take the value true or false. The default value is false. It should be set to true in order to omit badly formatted values.
Multi fields

There are situations where we need to have the same field analyzed differently. For example, one for sorting, one for searching, and one for analysis with aggregations, but all using the same field value, just indexed differently. We could of course use the previously described field value copying, but we can also use so called multi fields. To be able to use that feature of Elasticsearch, we need to define an additional property in our field definition called fields. The fields is an object that can contain one or more additional fields that will be present in our index and will have the value of the field that they are assigned to. For example, if we would like to have aggregations done on the name field and in addition to that search on that field, we would define it as follows:

"name": {
 "type": "string",
 "fields": {
 "agg": { "type" : "string", "index": "not_analyzed" }
 }
}

The preceding definition will create two fields – one called name and the second called name.agg. Of course, you don't have to specify two separate fields in the data you are sending to Elasticsearch – a single one named name is enough. Elasticsearch will do the rest, which means copying the value of the field to all the fields from the preceding definition.

The IP address type

The ip field type was added to Elasticsearch to simplify the use of IPv4 addresses in a numeric form. This field type allows us to search data that is indexed as an IP address, sort on such data, and use range queries using IP values.

A sample definition of a field based on one of the numeric types is as follows:

"address" : { "type" : "ip" }

In addition to the common attributes, the precision_step attribute can also be set for the ip type based fields. Refer to the numeric type description for more information about that property.

A sample document that uses the ip based field looks as follows:

{
  "name" : "Tom PC",
  "address" : "192.168.2.123"
}
Token count type

The token_count field type allows us to store and index information about how many tokens the given field has instead of storing and indexing the text provided to the field. It accepts the same configuration options as the number type, but in addition to that, we need to specify the analyzer which will be used to pide the field value into tokens. We do that by using the analyzer property.

A sample definition of a field based on the token_count field type looks as follows:

"title_count" : { "type" : "token_count", "analyzer" : "standard" }

Using analyzers

The great thing about Elasticsearch is that it leverages the analysis capabilities of Apache Lucene. This means that for fields that are based on the string type, we can specify which analyzer Elasticsearch should use. As you remember from the Full text searching section of Chapter 1, Getting Started with Elasticsearch Cluster, the analyzer is a functionality that is used to analyze data or queries in the way we want. For example, when we pide words on the basis of whitespaces and lowercase characters, we don't have to worry about the users sending words that are lowercased or uppercased. This means that Elasticsearch, elasticsearch, and ElAstIcSeaRCh will be treated as the same word. What's more is that Elasticsearch allows us to use not only the analyzers provided out of the box, but also create our own configurations. We can also use different analyzers at the time of indexing and different analyzers at the time of querying—we can choose how we want our data to be processed at each stage of the search process. Let's now have a look at the analyzers provided by Elasticsearch and at Elasticsearch analysis functionality in general.

Out-of-the-box analyzers

Elasticsearch allows us to use one of the many analyzers defined by default. The following analyzers are available out of the box:

Defining your own analyzers

In addition to the analyzers mentioned previously, Elasticsearch allows us to define new ones without the need for writing a single line of Java code. In order to do that, we need to add an additional section to our mappings file; that is, the settings section, which holds additional information used by Elasticsearch during index creation. The following code snippet shows how we can define our custom settings section:

"settings" : {
  "index" : {
    "analysis": {
      "analyzer": {
        "en": {
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "ourEnglishFilter"
          ]
        }
      },
      "filter": {
        "ourEnglishFilter": {
          "type": "kstem"
        }
      }
    }
  }
}

We specified that we want a new analyzer named en to be present. Each analyzer is built from a single tokenizer and multiple filters. A complete list of the default filters and tokenizers can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html. Our en analyzer includes the standard tokenizer and three filters: asciifolding and lowercase, which are the ones available by default, and a custom ourEnglishFilter, which is a filter we have defined.

To define a filter, we need to provide its name, its type (the type property), and any number of additional parameters required by that filter type. The full list of filter types available in Elasticsearch can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html. Please be aware, that we won't be discussing each filter as the list of filters is constantly changing. If you are interested in the full filters list, please refer to the mentioned page in the documentation.

So, the final mappings file with our custom analyzer defined will be as follows:

{
  "settings" : {
    "index" : {
      "analysis": {
        "analyzer": {
          "en": {
            "tokenizer": "standard",
            "filter": [
             "asciifolding",
             "lowercase",
             "ourEnglishFilter"
            ]
          }
        },
        "filter": {
          "ourEnglishFilter": {
            "type": "kstem"
          }
        }
      }
    }
  },
  "mappings" : {
    "post" : {
      "properties" : { 
        "id": { "type" : "long" },
        "name": { "type" : "string", "analyzer": "en" } 
      }
    }
  }
}

If we save the preceding mappings to a file called posts_mappings.json, we can run the following command to create the posts index:

curl -XPOST 'http://localhost:9200/posts' -d @posts_mappings.json

We can see how our analyzer works by using the Analyze API (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html). For example, let's look at the following command:

curl -XGET 'localhost:9200/posts/_analyze?pretty&field=name' -d 'robots cars'

The command asks Elasticsearch to show the content of the analysis of the given phrase (robots cars) with the use of the analyzer defined for the post type and its name field. The response that we will get from Elasticsearch is as follows:

{
  "tokens" : [ {
    "token" : "robot",
    "start_offset" : 0,
    "end_offset" : 6,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "car",
    "start_offset" : 7,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

As you can see, the robots cars phrase was pided into two tokens. In addition to that, the robots word was changed to robot and the cars word was changed to car.

Default analyzers

There is one more thing to say about analyzers. Elasticsearch allows us to specify the analyzer that should be used by default if no analyzer is defined. This is done in the same way as we configured a custom analyzer in the settings section of the mappings file, but instead of specifying a custom name for the analyzer, a default keyword should be used. So to make our previously defined analyzer the default, we can change the en analyzer to the following:

{
  "settings" : {
    "index" : {
      "analysis": {
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": [
             "asciifolding",
             "lowercase",
             "ourEnglishFilter"
            ]
          }
        },
        "filter": {
          "ourEnglishFilter": {
            "type": "kstem"
          }
        }
      }
    }
  }
}

We can also choose a different default analyzer for searching and a different one for indexing. If we would like to do that instead of using the default keyword for the analyzer name, we should use default_search and default_index respectively.

Different similarity models

With the release of Apache Lucene 4.0 in 2012, all the users of this great full text search library were given the opportunity to alter the default TF/IDF-based algorithm and use a different one (we've mentioned it in the Full text searching section of Chapter 1, Getting Started with Elasticsearch Cluster). Because of that we are able to choose a similarity model in Elasticsearch, which basically allows us to use different scoring formulas for our documents.

Note

Note that the similarity models topic ranges from intermediate to advanced and in most cases the TF/IDF based algorithm will be sufficient for your use case. However, we decided to have it described in the book, so you know that you have the possibility of changing the scoring algorithm behavior if needed.

Setting per-field similarity

Since Elasticsearch 0.90, we are allowed to set a different similarity for each of the fields that we have in our mappings file. For example, let's assume that we have the following simple mappings that we use in order to index the blog posts:

{
  "mappings" : {
    "post" : {
      "properties" : {
        "id" : { "type" : "long" },
        "name" : { "type" : "string" },
        "contents" : { "type" : "string" }
      }
    }
  }
}

To do this, we will use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings will look like the following:

{
  "mappings" : {
    "post" : {
      "properties" : {
        "id" : { "type" : "long" },
        "name" : { "type" : "string", "similarity" : "BM25" },
        "contents" : { "type" : "string", "similarity" : "BM25" }
      }
    }
  }
}

And that's all, nothing more is needed. After the above change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and the contents fields.

Available similarity models

There are at least five new similarity models available. For most of the use cases, apart from the default one, you may find the following models useful:

  • Okapi BM25 model: This similarity model is based on a probabilistic model that estimates the probability of finding a document for a given query. In order to use this similarity in Elasticsearch, you need to use the BM25 name. Okapi BM25 similarity is said perform best when dealing with short text documents where term repetitions are especially hurtful to the overall document score. To use this similarity, one needs to set the similarity property for a field to BM25. This similarity is defined out of the box and doesn't need additional properties to be set.
  • Divergence from randomness model: This similarity model is based on the probabilistic model of the same name. In order to use this similarity in Elasticsearch, you need to use the DFR name. It is said that the pergence from randomness similarity model performs well on text that is similar to natural language.
  • Information-based model: This is the last model of the newly introduced similarity models and is very similar to the pergence from randomness model. In order to use this similarity in Elasticsearch, you need to use the IB name. Similar to the DFR similarity, it is said that the information-based model performs well on data similar to natural language text.

The two other similarity models currently available are LM Dirichlet similarity (to use it, set the type property to LMDirichlet) and LM Jelinek Mercer similarity (to use it, set the type property to LMJelinekMercer). You can find more about these similarity models in Apache Lucene Javadocs, Mastering Elasticsearch Second Edition, published by Packt Publishing or in official documentation of Elasticsearch available at https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html.

Configuring default similarity

The default similarity allows us to provide an additional discount_overlaps property. It allows us to control if the tokens on the same positions in the token stream (with position increment of 0) are omitted during score calculation. By default, it is set to true, which means that the tokens on the same positions are omitted; if you want them to be counted, you can set that property to false. For example, the following command shows how to create an index with the discount_overlaps property changed for the default similarity:

curl -XPUT 'localhost:9200/test_similarity' -d '{
 "settings" : {
  "similarity" : {
   "altered_default": {
    "type" : "default",
    "discount_overlaps" : false
   }
  }
 },
 "mappings": {
  "doc": {
   "properties": {
    "name": { "type" : "string", "similarity": "altered_default" }
   }
  }
 }
}'
Configuring BM25 similarity

Even though we don't need to configure the BM25 similarity, we can provide some additional options to tune its behavior. The BM25 similarity allows us to provide the discount_overlaps property similar to the default similarity and two additional properties: k1 and b. The k1 property specifies the term frequency normalization factor and the b property value determines to what degree the document length will normalize the term frequency values.

Configuring DFR similarity

In case of the DFR similarity, we can configure the basic_model property (which can take the value be, d, g, if, in, p, or ine), the after_effect property (with values of no, b, or l), and the normalization property (which can be no, h1, h2, h3, or z). If we choose a normalization value other than no, we need to set the normalization factor.

Depending on the chosen normalization value, we should use normalization.h1.c (the float value) for h1 normalization, normalization.h2.c (the float value) for h2 normalization, normalization.h3.c (the float value) for h3 normalization, and normalization.z.z (the float value) for z normalization. For example, the following is how the example similarity configuration will look (we put this into the settings section of our mappings file):

      "similarity" : {
        "esserverbook_dfr_similarity" : {
          "type" : "DFR",
          "basic_model" : "g",
          "after_effect" : "l",
          "normalization" : "h2",
          "normalization.h2.c" : "2.0"
        }
      }
Configuring IB similarity

In case of IB similarity, we have the following parameters through which we can configure the distribution property (which can take the value of ll or spl) and the lambda property (which can take the value of df or tff). In addition to that, we can choose the normalization factor, which is the same as for the DFR similarity, so we'll omit describing it a second time. The following is how the example IB similarity configuration will look (we put this into the settings section of our mappings file):

      "similarity" : {
        "esserverbook_ib_similarity" : {
          "type" : "IB",
          "distribution" : "ll",
          "lambda" : "df",
          "normalization" : "z",
          "normalization.z.z" : "0.25"
        }
      }