Espresso is an internal distributed document-oriented database management system written by LinkedIn. It serves as the source-of-truth primary store for many downstream systems and applications such as Company Pages, Unified Social Content Platform, InMail, etc. Because of its positioning, the design principles of Espresso mainly focus on satisfying the requirements from production environments, which include but are not limited to the guarantee of operability, availability, scalability, and elasticity. For operability, Espresso is designed to support a hierarchical data model with various operations and be compatible with the whole data ecosystem at LinkedIn. For availability, it has multiple heavyweight fault tolerance mechanisms, e.g., there is always a warm standby of the Espresso cluster at a geographically remote disaster recovery data center. For scalability, most of the methods it adopts, including cluster management and data management, avoid centralized processing and synchronized operations. For elasticity, it supports online cluster expansion with little downtime. Besides all those design principles, the overall objective of Espresso is to find a sweet-spot between SQL database systems and NoSQL systems, which is clearly reflected by its data model design.
As a distributed database system, Espresso chooses consistency and partition tolerance among CAP. It is derived from MySQL (with InnoDB as the storage engine). It has two system-level internal building blocks: a cluster management system and a change capture system. Apache Helix, with Apache Zookeeper integrated, is used to carry out the first function. LinkedIn Databus was used as the first generation change capture system but was later replaced by Kafka. The change capture system also plays a role in data replication. Besides, a library-level building block of Espresso is Apache Lucene, which provides basic supports fo full-text inverted index.
The Espresso project was first planned and designed in early 2011. Its mission at that time was to fill the vacancy that there exists no well-designed highly consistent database systems with both scalability and agility in LinkedIn's data infrastructure. Based on the experience of developing early relational database systems like Voldemort, the engineering team at LinkedIn spent one year writing Espresso and deployed it in production in June 2012.
Although LinkedIn once planed to open source Espresso, the plan was later shelved. Therefore, currently Espresso is still a closed-source system.
Multi-version Concurrency Control (MVCC) Two-Phase Locking (Deadlock Detection)
Espresso relies on MySQL to provide transaction support. So its concurrency control schemes are identical to MySQL's. Since Espresso only allows transactional actions within a partition (a.k.a. document group) and all the documents belonging to the same partition will be stored on the same node, it does not support distributed transactions. Because the indexes are also mostly built within a partition, updates can be transactionally applied to both documents and indexes.
The consistency model of Espresso between the primary partition on master node and its replicas on slave nodes is timeline consistency. It requires the commit order on slave nodes to be the same as what it is on master node. The order is defined by a global timestamp called system change number (SCN).
Espresso adopts a hierarchical document-oriented data model, where documents belong to both tables and document groups, tables and document groups then belong to databases. Different levels of this hierarchy have different schema-define format, e.g., database and table schemas are defined in JSON but document schemas are defined in Avro who supports online schema evolution. Since document group is a logical concept and has no explicit representation, it does not have schemas.
Database is the largest unit of the data model. Table and document group are the two lower parallel levels in the hierarchy, but they are quite different from each other. A table explicitly contains some documents that have the same schema, which defines the complete key structure of tables. But a document group does not actively own any document; on the contrary as long as two documents have the same partitioning key, they belong to the same document group. A table can therefore span multiple document groups and a document group can also span multiple tables. The most basic unit of the data model is document. A document contains abundant schema-ed contents with various data structures. It's conceptually similar to a row in relational database systems.
There is another level called collection in the hierarchy of Espresso. Similar to document groups, collections are not explicitly represented by schemas; on the contrary, any documents that have the same partial keys are in the same collection. To make everything above clear, below is an example of this data model:
MailboxDB is a database,
Messages is a table. The schema of the
Messages table defines that all its documents have key structure
<MailboxID> is partition key. Then the following statements are true in Espresso:
/MailboxDB/Messages/100/1uniquely identifies a document.
/MailboxDB/Messages/100/2are in the same collection
MailboxStatsis another table that has the same schema as
Messages, then Document
/MailboxDB/MailboxStats/100/1are in the same partition.
Although document group is a passive component in the data model, it plays an important role in data management. For example, secondary indexes are mainly built inside a document group and the documents of the same document group are usually stored in the same node. Espresso tries to obtain the benefits from both relational database systems and NoSQL database systems by leveraging the hierarchical document data model. For instance, transactional operations are supported for at most all the documents in a document group, unlike usual cases where document-oriented database systems only support transactional operations on a single document. On the other hand, the relationships between documents are loosened by such a data model.
Hash Table Inverted Index (Full Text)
Espresso supports 1) hash index on partition key and 2) a high-performance implementation of inverted index as secondary indexes.
The inverted index of Espresso is adapted from Apache Lucene. There are two types of secondary indexes in Espresso based on their scope: 1) local secondary indexes, which are built within a document group; 2) global secondary indexes, which are built across multiple document groups. Local secondary indexes are optimized by prefix the collection key in front of the indexed terms. This is called Prefix Index. An example of this is as below:
If the original inverted list has a term
Andy, which is a long posting list, then after applying Prefix Index, it will be split into several terms with small posting lists, e.g.,
This optimization is used to reduce the latency and memory footprint, as well as bound the overhead by the size of working set, although it increases the number of lists in the index. The original inverted index of Lucene is implemented in a log-structure manner, which makes the index immutable. Espresso modifies it to make it updatable.
The posting lists of Prefix Index are stored in MySQL, which provides free transactional support. Terms in an inverted index are organized by a B+-Tree, and bitmap indexes are also used to speed up the searching process. But these two indexes are not used as data indexes.
Read Uncommitted Read Committed Serializable Snapshot Isolation Repeatable Read
Espresso relies on MySQL to provide transaction support. So its isolation levels are identical to MySQL's.
Espresso's logging scheme is adapted from MySQL's binary logging. It's a kind of physical logging and is responsible for data replication. The original version of MySQL's binary logging does not support replication of partial logs. Espresso adds a partition-specific sequence number, a.k.a. system change number (SCN), to each binary log record so that logs can be partially replicated and sent to different data shards.
Espresso adopts REST-style APIs for the purpose of ease-to-use and ease-to-integrate. It supports 1) read via primary keys or secondary indexes; 2) partially update, fully update, and insert; 3) filter predicates based on timestamp and document CRC; 4) register for monitoring change stream; 5) bulk load and export for offline analysis purpose.
All the operations above can be applied transactionally within the range of a partition, namely a document group.
Espresso's storage engine is MySQL's InnoDB. Therefore Espresso directly inherits the disk-oriented storage architecture from InnoDB, and relies on its functions to do buffer pool management, etc.
N-ary Storage Model (Row/Record)
Espresso is a document store implemented on MySQL, which means its underline storage model is still a row store.
Espresso uses shared-nothing architecture and relies on the change capture system to ship logs and implement automatic failover. The first generation of the change capture system is LinkedIn Databus. Its successor is Kafka, which is also the current system in use.
Espresso uses master-slave model for replication. The change capture system listens on master node and slave nodes subscribe on the change capture system. As long as there is new logs for any update, the slave nodes consume them to update the replicas. Downstream applications and offline systems can also consume the change streams from change capture system. The replication can be configured to be both asynchronous and semi-synchronous. For load balance and fast failover/expansion reasons, the granularity of replication is partition-level and master partitions and slave partitions are allowed to mix in the same node.
Besides the major components mentioned above, there are several other functional components in Espresso. Helix, together with its Zookeeper, keeps all the system metadata. Router nodes are responsible for routing queries to the master nodes of different partitions based on the hash value of their partition keys. A consistency checker will periodically compare the checksum of master partitions and slave partitions. If errors are detected, recovery process will be invoked.