Mastering Elastic Stack
上QQ阅读APP看书,第一时间看更新

The beginning of Elasticsearch

It all started with Lucene, a brilliant project supported by Apache Software Foundation. There is a good list of Lucene-based projects. To name a few - Apache Solr, Elasticsearch, Apache Nutch, Lucene.Net, DocFetcher, and many more. If you ever try to find a search engine kind of solution, you will surely come across Lucene. It's not only available for Java, but also for Delphi, Perl, C#, C++, Python, Ruby, and PHP. A complete list of Lucene implementation is available at http://wiki.apache.org/lucene-java/LuceneImplementations.

Lucene is a full text search engine and it creates indices on documents. In a paragraph or blob of text, every string is called a term and a sequence of terms is named as a field, and a sequence of fields is named a document. An index contains a sequence of documents and it indexes data as documents.

In books, we usually see an index where all the keywords are written and which helps us to find the actual content. This type of index is called an inverted-index where terms or strings are used to index documents.

Lucene is a wonderful project for text-based search engine implementation. It first appeared in 1999 and since then there is a huge list of Lucene-based implementations. An interesting thing to notice is that there are even search engines that use Lucene at the core. These projects extend Lucene by wrapping it, creating an interface for it, adding more features, and so on, thus providing varieties to be utilized for various solutions.

For Java-based projects, Apache Solr and Elasticsearch are a good choice. You can find a number of threads on the Internet discussing the superiority of a search engine.

Before Elasticsearch, Shay Banon created Compass, which was also built on top of Lucene. Compass made the life of Java developers easy with its seamless integration, XML, JSON support, and ability to integrate with ORM libraries such as Hibernate and JPA. While upgrading to Compass 3.0, Shay felt it would require major changes to address the scalability issue, to upgrade Lucene to version 2.9. Then he thought of a better solution that would address all the issues and thus Elasticsearch came instead of Compass 3.0. In July 2010, on his blog titled The Future of Compass and Elasticsearch (http://thedudeabides.com/articles/the_future_of_compass), he writes:

So, I started out building elasticsearch. It's basically a solution built from the ground up to be distributed. I also wanted to create a search solution that can be used by any other programming language easily, which basically means JSON over HTTP, without sacrificing the ease of use within the Java programming language (or more specially, the JVM).

Elasticsearch was born at the time and started catching attention among developers in the open source community. As a result, there is a huge list of clients for Elasticsearch. To name a few - GitHub, Quora, Stack Exchange, Mozilla, StumbleUpon, CISCO, and Netflix are the most renowned. A more comprehensive list can be found at the product site here https://www.elastic.co/use-cases.

Key features

Elasticsearch can be considered as the most advanced search engine that offers whatever Lucene offers and much more than that. Let's see a few of those features:

  • Just give JSON: Elasticsearch takes documents, rather structured JSON documents, as input to create indices. All of the field's properties are automatically detected and indexed by default. Elasticsearch creates mappings (strings by default, which we can change) on its own. You don't need to define schema (schema.xml as in Solr). Since Elasticsearch utilizes the best of Lucene, it offers full-text search on data that is indexed.
  • RESTful API: With the RESTful API, when using JSON data most of the necessary actions can be performed. You can send a JSON document to add to index, delete an entry, update an entry, and many more things. We will learn about APIs later in this book.
  • Real-Time Data availability and analytics: As soon as data is indexed, it is made available for search and analytics. It's all real-time.
  • Distributed: Elasticsearch allows us to set up as many nodes we need for our requirement. Cluster will manage everything and it can grow horizontally to a large number (1000(s), as they say). To grow the cluster, just start another node in the network with the same cluster name and it will be added to the cluster.
  • Highly available: The cluster is smart enough to detect a new node or failed node to add/remove from the cluster. As soon as a node is added or removed, data is rebalanced in a manner that it remains available.
  • Safety of your data comes first: Any change in data is recorded in transaction logs and not only on single, but multiple nodes (just in case of a node failure, it remains available). This way, Elasticsearch tries to minimize the data loss.
  • Multitenancy: In Elasticsearch, an alias for index can be created. Usually a cluster contains multiple indices. These aliases allow a filtered view of an index to achieve multitenancy.