WiredTiger is an open-source key/value storage engine for NoSQL databases and is currently the default storage engine for MongoDB.
WiredTiger supports both B-tree and log-structured merge tree for data storage. It also supports both row store and column store. WiredTiger employs multi-version concurrency control for better scalability on multi-core architectures and supports several standard isolation levels. Though WiredTiger is often considered as a storage engine, it also has some basic schema support where schemas are provided by users at runtime.
WiredTiger is developed in C and provides Java/Python language bindings. It has a set of APIs that is compatible with LevelDB.
The development of the WiredTiger storage engine started in November 2008. The corresponding company WiredTiger Software was founded in 2012 by Keith Bostic (also founder of BerkeleyDB) and Michael Cahill, and was later acquired by MongoDB to replace its original mmap-based storage engine as the default storage engine.
Checkpoints in WiredTiger are different from typical checkpoints in traditional relational database systems. Essentially, a WiredTiger table consists of multiple ready-only checkpoints and a writable live tree. Checkpoints are done periodically or be manually triggered. During a checkpoint process, dirty pages in a table are flushed to disk and a new checkpoint is created. Checkpoints are ordered by timestamp and therefore can support time-travel queries. Checkpoints are deleted only when there are no running transactions referencing it. Copy-on-write techniques are used to reduce overhead of creating new checkpoints.
A successful checkpoint in WiredTiger involves four phases. We will take B-trees as an example in the following discussion of these phases. First, all dirty leaf pages are flushed to disk. Then all dirty internal pages are written to disk. Then fsync is called on the checkpoint file so that the checkpoint is persistent to disk. Finally, the root of the B-tree is atomically changed to the new root. In this way, the old root becomes a checkpoint and the new root becomes the new "live tree". This is somewhat similar to shadow paging. When a checkpoint is deleted, WiredTiger will check whether there is any dependency between the two adjacent checkpoints and merge them if there are common pages in the two B-trees.
Multiple compression techniques are used in WiredTiger.
Block-level: Existing compression libraries can be directly plugged into WiredTiger. It supports LZ4, snappy, zlib, and Zstd.
Record-level: WiredTiger supports dictionary encoding, run-length encoding, prefix encoding, and Huffman encoding.
WiredTiger employs multi-version concurrency control (MVCC), in which readers never block writers. In case of write-write conflict, one transaction will be rolled back and retried. In addition, transactions' working set is required to fit in main memory in WiredTiger.
In WiredTiger, transactions can pin particular versions of the database, which are called "named snapshots". Queries within such transactions are executed as if they are operating on the named snapshots.
WiredTiger typically servers as a NoSQL storage engine similar to LevelDB and RocksDB. But it also supports a thins schema layer above the storage layer.
B-tree: WiredTiger uses B-tree for in-memory indexes. Hazard pointers are used in their B-tree implementation, which points to file offsets on disk files. The checksum of a child page, as well as the child page's pointer are stored in the parent page to avoid potential memory flip issues.
Log-structured merge tree: LSM trees are used in WireTiger to improve write throughput. Each LSM tree has a corresponding Bloom filter so that unnecessary disk access can be avoided.
Three isolation levels are provided:
READ_COMMITTED is the default isolation level.
Nested loop joins in WiredTiger are supported by join cursors. When a join cursor is created, it creates corresponding iterators on tables to be joined. Indexes can be used to narrow down the range of iterators for
WHERE clauses in joins.
Durability in WiredTiger is often done by checkpointing. However, additional logging (journalling) is also supported. In WiredTiger, log records for writes are flushed to disk when a transaction commits. Group commits are supported for faster logging.
WireTiger uses physical logging, where each log record represents a single write operation. Each log record contains: transaction id, operation type, file id, operation key, and operation value.
WiredTiger has a command line tool that supports basic
write functionalities. WiredTiger provides language bindings for C/Python/Java and databases are accessed via cursors. It also provides optional asynchronous query APIs with callback mechanism if high throughput is desired.
WiredTiger's storage architecture is optimized for multi-core CPUs and large memory. Its hybrid architecture contains two key components: in-memory cache and disk block manager. WiredTiger's in-memory cache is implemented in B-tree with hazard pointers, where B-tree nodes are organized in page granularity and cache eviction policy is LRU. The block manager is responsible for raw block I/O operations such as read, write, sync, compression and checkpoints. Blocks are managed in a skip list for efficient indexing and allocation.
WiredTiger supports both row store and column store. Row store has better write performance whereas column store provides better read performance and low memory usage. WiredTiger supports mixed storage model for a particular table, e.g. a column-store table with a row-store index.
WiredTiger supports both heap files and log-structured storage. The storage organization for a table is specified when a table is created. For heap file storage, B-tree is used as the index data structure and on-disk blocks are maintained by WiredTiger's block manager. For log-structured storage, a specialized LSM tree manager is used to manage data in the LSM tree.
WiredTiger typically serves as a pluggable key-value storage engine for NoSQL databases, but also supports schemas and multi-version concurrency control. There is a thin schema layer in WiredTiger, in which the structure of a table is provided at runtime. Right below the schema layer, there is an in-memory caching layer, which heavily uses lock-free data structures (such as lock-free skip list and B-tree) and is optimized for modern multi-core architectures. Below the caching layer is a block storage manager, where log-structured merge trees are used to improve write throughput and reduce write amplification.