etcd

etcd is a distributed key-value store which is highly available, strongly consistent, and watchable for changes. The name "etcd" was from a unix's configuration directory, "etc" and "d"istributed system. etcd is adopted by cloud-native systems such as Kubernetes, Cloud Foundry Diego, and Project Calico. Major uses cases include metadata store and distributed coordination.

History

Originally etcd was started for two use cases: reboot coordination and application configuration. CoreOS used etcd to coordinate reboot of CoreOS cluster and avoid that all nodes in the cluster rebooted at the same time. Also, etcd was used to store application configuration, so whenever a server starts or restarts or whenever application configuration is updated, the server receives application configuration from etcd.

Stored Procedures

Not Supported

Data Model

Key/Value

etcd stores data as a multiversion key-value pair. Each mutative operation (e.g. PUT) creates a new version and does not change older versions. Previous versions are also accessible until they are compacted.

System Architecture

Shared-Nothing

The etcd cluster is composed of shared-nothing nodes. The cluster has one leader node and other nodes work as followers, which will be determined at run-time (Raft algorithm). When the leader node receives a request, the leader takes votes against all followers. If the majority of nodes agrees on the request, the leader commits the request and ask followers to commit. An etcd client does not need to know which node is a leader to send a request. Instead, the client can send a request to any node in the cluster, and the node will forward the request to the leader node if the node is a follower.

Checkpoints

Blocking

etcd provides a snapshot to improve the recovery speed and avoid increasing logs. The etcd automatically creates a snapshot based on the number of committed transactions from the last snapshot, which is configurable, while the user can create the snapshot anytime via etcdctl command. The etcd acquires a global latch to produce a snapshot, so the high frequency for taking the snapshot will degrade the performance of the database operation.

Logging

Command Logging

etcd appends committed commands which are determined by Raft algorithm. Since the etcd uses gRPC for the query interface, etcd logs the gRPC command in their log.

Storage Model

N-ary Storage Model (Row/Record)

etcd stores physically data as a key-value pair. The key is consist of a 3-tuple: major, sub, type. Major contains the revision (a counter which is incremented when data modification is requested.) Sub contains the identifier among the revision because the transaction might produce a single revision with multiple keys. Type is an optional and one use case is for a tombstone. The value contains a delta from a previous version.

Indexes

B+Tree

etcd creates an in-memory btree index for keys and provides range operations.

Storage Architecture

Disk-oriented

etcd stores a key-value pair in a persistent disk as the b+ tree structure sorted by a key.

Concurrency Control

Multi-version Concurrency Control (MVCC)

etcd uses MVCC for the concurrency control. The etcd uses revision which is corresponding to a version of MVCC and each key-value contains two revisions which respectively represents when the key-value was created and when the key-value was updated. The etcd cluster maintains the current revision. When the mutative operation has arrived (e.g., Put, Delete, Txn), the etcd assigns the revision to the data related to the operation and updates the current revision.

Query Compilation

Not Supported