Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine based on Lucene. It allows you to store, search, and analyze big volumes of data quickly and near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements. A few sample use-cases including online web store catalog, collect and analyze logs for data mining, supervise and alerting system, business-intelligence needs. Elastic stack is used by many technology companies including Linkedin and Uber, its business counterpart is Splunk. Elasticsearch is the search engine part of Elastic stack. For most cases, you will also need Logstash, the data import and storage system, Kibana, the data visualization system.

History

Compass is the precursor to ElasticSearch, created by Shay Banon in 2004. In the release of its 3rd version, Banon rewrite big parts of Compass to "create a scalable search solution". A solution built from the ground up to be distributed and used a common interface, JSON over HTTP. Shay Banon released the first version of Elasticsearch in February 2010. Elasticsearch BV was founded in 2012 to provide commercial services and products around Elasticsearch and related software. In March 2015, the company ElasticSearch changed their name to Elastic.

Concurrency Control

Two-Phase Locking (Deadlock Detection)

Elasticsearch does not support ACID transactions for changes involving multiple documents, changes to individual documents are ACIDic. If your main data store is a relational database, and Elasticsearch is simply being used as a search engine or as a way to improve performance, then ACID transactions is dealt with in the relational database. If you are not using a relational store, these concurrency issues need to be dealt with the Elasticsearch level. The three practical solutions used by Elasticsearch are Global Locking, Document Locking, Tree Locking, with increasing fine-grained lock level. Each of them is kind of two-phase locking. Global Lock will block the entire storage system to enable only one writer at a time. Document Locking will lock for all involved files. Tree Lock will lock only a directory.

Indexes

Hash Table

Elasticsearch target at text search, so different with most relational database index implementations. Elasticsearch use inverted index as its basic index structure. An index term is the unit of search. It turns everything to look like a string prefix problem. To favor search speed, Elasticsearch will compact the index because when searching over a smaller index, less data needs to be processed, and more of it will fit in memory. But there is also trade-off since compactness means sacrificing the possibility to efficiently update them. An Elasticsearch index is made up of one or more shards, which can have zero or more replicas. These are all individual Lucene indexes, which in turn is made up of index segments.

Isolation Levels

Snapshot Isolation

Elasticsearch doesn't support transactions. Also, Elasticsearch is more preferable in read intensive workload. When you enable versioning feature, it could ensure one-session semantics. If not using versioning, all modification will come to the same document.

Joins

Nested Loop Join Hash Join

Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. Instead, Elasticsearch offers two forms of join which are designed to scale horizontally, nested query, has_child and has parent queries. Nested query utilized similar idea of nested loop join, Documents may contain fields of type nested. These fields are used to index arrays of objects, where each object can be queried (with the nested query) as an independent document. Has_child and has_parent queries use hash join to return docs match parent in child or docs match child in parent within a single index.

Checkpoints

Blocking Fuzzy

By default, Logstash uses in-memory bounded queues absorbs bursts of events and buffer them on disk. Persistent queues provide durability of data within Logstash for Elastic systems. When it's enabled, Logstash will store events on disk, commit to disk using checkpointing. The persistent queue has two kinds of pages: head pages and tail pages. There is only one head page, when head page is of a certain size, it becomes a tail page. Tail page is immutable and head page is append only. When recording a checkpoint, Logstash will call fsync on the head page and atomically write to disk the current state of the queue. The process of checkpointing is atomic, any update to the file is saved if successful. If Logstash is terminated or there is a hardware-level failure, any data that is buffered in the persistent queue but not yet checkpointed is lost.

Data Model

Document / XML

Elasticsearch is a document oriented distributed database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Elasticsearch design mappings and store the document in a way that is optimized for search and retrieval. They are excellent for write-once-read-many-workloads. Like many other document oriented databases, Elasticsearch don't have constraints on data.

Elasticsearch Logo
Website

http://www.elasticsearch.org/

Source Code

https://github.com/elastic/elasticsearch

Tech Docs

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

Developer

Shay Banon

Country of Origin

US

Start Year

2004

Former Name

Compass

Project Type

Open Source

Written in

Java

Supported languages

Java, Python

Derived From

Elasticsearch

Operating Systems

All OS with Java VM

Licenses

Apache v2