CilckHouse is an open-source column-oriented OLAP DBMS, which outperforms existing commercial OLAP DBMSs(Vertica, Hive, MySQL) on similar OLAP worloads. It's famous for its linear scalability, hardware efficiency, fault tolerance, rick features, simplicity and high reliability.
ClickHouse is developed by a Russian company called Yandex. It is designed for multiple projects within Yandex. Yandex needed a fast DBMS for analyzing large amount of data, which cannot be solved by its original solutions. So it began to develop its own column-oriented DBMS, which can handle analytical data on the internet scale. The prototype of ClickHouse appeared in 2009 and it was released in 2016 and then blazing fast.
ClickHouses provides two types of parsers: a full SQL parser and a data format parser. It uses SQL parser for all types of queries and the data format parser only for INSERT queries. Beyond the query language, it provides multiple user interfaces, including HTTP interface, JDBC driver, TCP interface, command-line client, etc.
Virtual Views Materialized Views
ClickHouse supports both virtual views and materialized views. The materialized views store data transformed by corresponding SELECT query. The SELECT query can contain DISTINCT, GROUP BY, ORDER BY, LIMIT, etc.
ClickHouse doesn't support transactions.
ClickHouse replicates its data on multiple nodes and monitors data synchronicity on replicas. It recovers after failures by syncing data from other replica nodes.
ClickHouse only supports hash join, which is done by placing right part of data in a hash table in memory. Hash join is faster but require enough memory.
Decomposition Storage Model (Columnar)
ClickHouse is a column-oriented DBMS and it stores data by columns.
Currently, stored procedures and UDF are listed as open issues in ClickHouse.
ClickHouse supports primary key indexes. The index mechanism is called sparse index. In the MergeTree, data are sorted by primary key lexicographically in each part. Then ClickHouse selects some marks every index_granualarity rows. These marks are served as sparse indexes, which allows efficient range queries.
ClickHouse has multiple types of table engines. The type of the table engine determines where the data is stored, concurrent level, whether indexes are supported and some other properties. The table engines that store data on disks include TinyLog and Log. The Memory engine stores data in memory and this table engine is mainly used for temporary tables with external query data. The data of Memory engine will disapper after the server is restarted.
ClickHouse not only store data by columns, but also process data by columns.
ClickHouse system is a cluster of shards. It uses asynchronous multimaster replication and there is no single point of contention across the system.
ClickHouse supports runtime code generation. The code is generated for every kind of query on the fly, removing all indirection and dynamic dispatch. Runtime code generation can be better when it fuses many operations together and fully utilizes CPU execution units.