HerdDB is a distributed SQL DBMS implemented in Java, optimized for primary key read/update access patterns. It has been designed to be embeddable in any Java Virtual Machine.
HerdDB supports periodic checkpoints, every 15 minutes by default. At checkpoints, active page ids are written to disk with their current log sequence numbers. At the same time all dirty pages are discarded and their records used to build new pages, among other new or updated records. Checkpoints can be tuned to be either as fast as possible or as clean as possible. Fast checkpoints block write operations for less time, whereas clean checkpoints optimize memory usage (fewer dirty pages left in memory) and speed up searches (fewer dirty pages left on disk).
Deterministic Concurrency Control
Before accessing records, clients acquire read or write locks. Every transaction that modifies a record holds the new data in a local buffer copy, and this new version of the record is not visible to other transactions until that one is committed (Pessimistic Row Level Locking).
B+Tree Block Range Index (BRIN)
WAL replication based on Apache BookKeeper.
N-ary Storage Model (Row/Record)
HerdDB’s internal architecture stores a table as a set of key-value entries. This is implemented in Java by a very large map of binary data. Each row is translated from column-oriented to key-value format by tearing apart the “primary key” part (one or multiple columns) from the “value” part (other columns).
At any given time, some part of the data is stored in a memory buffer and some other on disk. Transaction logs are the source-of-truth and the whole database can be recovered from them plus a checkpoint to ensure that no data can be lost on JVM crashes. When a row is stored on disk it is assigned to a "data page"; on its first mutation, it is detached from its data page and that page is marked as “dirty”. At checkpoints, all dirty pages are dismissed and their records used to build new pages, among other new or updated records. Records modified/inserted/deleted in the scope of a transaction are never written to disk and they are not present in the main buffer until that transaction is committed, so that there is always a consistent and committed data snapshot. Every transaction uses its own local buffer to store temporary data.
WAL replication and distributed configuration based on Apache Zookeeper and Apache BookKeeper.