CrateDB is an open-source, distributed, shared-nothing SQL database system developed by Crate.io, built on top of a variety of open source projects. Some of these projects include Facebook’s Presto SQL parser and the Apache Lucene search library.
The system's target workload is machine-generated IoT data, particularly for operational analytics applications.
CradeDB started as a standalone project by Jodok Batlogg (who previously contributed to Open Source Initiative Vorarlberg), Christian Lutz, and Bern Dorn. The group, who ran a consulting business before that helped companies use tools for their data needs, turned that knowledge into a product. The team won Judge’s Choice at GigaOm Structure Launchpad competition in June 2014 and TechCrunch Discord Europe in October 2014.
The first version was released in September 2016 The second and enterprise versions released in May 2017.
SQL HTTP / REST Command-line / Shell
CrateDB can be queried with regular SQL. One way to query a table is through the CrateDB admin UI, in which the user can construct and view the results of a query. Another way is through CrateDB Shell (called Crash), which is their custom command-line shell. An HTTP endpoint is also provided for submitting queries; using HTTPie users can query using some endpoint, to which CrateDB will respond with JSON. There are also numerous third party client tools that will work with CrateDB. Since the system supports PostgreSQL wire protocol, most third party tools for PostgreSQL will also be compatible with CrateDB.
CrateDB supports creating, querying, and dropping views. The view is not materialized, so the query associated with this view is rerun every time the view is used. The enterprise version allows different users to have privileges, so to query a view the user must have DQL privileges on the view. The user who created the view automatically has DQL privileges on all the relations in the view.
Optimistic Concurrency Control (OCC)
Optimistic concurrency control is implemented using rows’ sequence number (
_seq_no) and primary term (
_primary_term). Initially each row’s sequence number is 0, and is incremented with every
DELETE to its shard (partition of the table). The primary term is incremented when a shard becomes primary. When updating or deleting, the query must be done with the correct sequence number and primary term, otherwise no effect will take place. CrateDB does not support transactions.
Each shard has its own write ahead log (called the translog), which gets flushed to the index storage of Lucene when the log is full. The translog stores records of the operations on nodes. Operations store metadata about the query that was executed (such as
used_bytes). The job that corresponds to the given job id stores the actual query statement that was run.
Additionally, CrateDB supports application logging (with Log4j) and JVM garbage collection logging. Apache’s Log4J provides logging functionalities for Java. Crate DB uses JVM garbage collection logging to maintain garbage collection times.
CrateDB supports two join algorithms: nested loop joins and block hash joins. By default the system uses nested loop joins. Block hash joins can only be applied on inner joins where the condition meets the following criteria: it contains at least one equal operator and no or operators and every argument of an equal operator can only reference fields from one relation. The hash join algorithm can be enabled or disabled explicitly.
The system supports cross joins, inner joins, and left/right/full outer joins, and its performance is limited when joining two or more tables (resulting in poor execution plans).
CrateDB stores given data both in row and column store formats. Column store is enabled by default for primitive types (such as integers or booleans) and cannot be turned off. It is supported for a limited length of text data as well, but can be disabled. It is not supported for other types, specifically compound or geographic types.
Rows of a table are stored as a semi-structured document that can be nested in objects, and operation on these documents are atomic.
Read Uncommitted Read Committed Serializable Repeatable Read
CrateDB supports four isolation levels:
REPEATABLE READ, and
SERIALIZABLE. The isolation level is specified in the begin statement as a
Given an operation tree, CrateDB’s Job Execution Service will execute the given plan. Each operation will emit the entire (intermediate) result to other nodes, which will continue the query execution with the given result.
CrateDB does not support stored procedures.
CrateDB stores data on disk, which is retrieved and stored depending on the execution plan. The data is partitioned and stored across many nodes.
CrateDB does not support foreign keys.
CrateDB is a shared nothing distributed system. Every Crate Node has the same four components, making them all equal in terms of functionality. They can all process SQL statements, execute queries, interact with the cluster, and store data. Therefore, any node can receive, process, and execute queries.
A Crate cluster is defined as two different nodes representing the same database instance but running on different hosts. The cluster state includes metadata including global settings, discovered nodes, schemas, and status/location of shards. A single node in the cluster is elected the “Metadata Primary”, and is the only node allowed to change the state at runtime. Discovering a node refers to finding, adding, and removing nodes. A node will ping potential hosts, and after receiving a response including the Metadata Primary, the new node can send a join request to that cluster specifically.
All nodes in a Crate cluster can communicate with any other node in the cluster, via byte serialized POJOs (plain old Java objects). This full mesh topology improves reliability and messages to be sent along the shortest possible path, but is limited in growth.
Crate does not support query compilation.
In CrateDB, the tables are sharded (partitioned) and divided amongst the nodes, and the shard is stored as a Lucene index, which is further broken down to stored as files under a directory of a node. Data is appended to files, never removed, making replication and recovery easier. When writing to a file, the primary node is looked up and the new data is added to the file. The operation is repeated on replicas.