CosmosDB is a globally distributed, consistent, schema-less, multi-model document database that provides high-throughput and availability across various geographical regions. It is used to solve data storage problems of large-scale distributed Internet-scale applications. Most of the Microsoft internal services such as Bing, Office 365, Ads, etc. and many other external services use Cosmos DB for their storage needs. It provides 99.99% availability regardless of a number of regions associated with data. It provides turn-key distribution, which can be used to replica data on specific replica instances to provide low-latency data accesses to users across the globe.
It started as an internal project called 'Florence' at Microsoft for storing large scale unstructured data generated by several of its internal services. It was later named as Document DB in 2014. It was released as an Azure service to the public as 'Azure Cosmos DB' in 2017.
It performs checkpoints on its document Index (Bw-Tree) periodically to reduce the recovery time if a node fails.
Optimistic Concurrency Control (OCC)
Cosmos DB supports OCC for executing SQL transactions. It uses 'ETag' HTTP header to validate user-queries against the stored data to commit/abort a transaction
It is a multi-model service and supports document, key-value, graph, and column-family data models.
Cosmos DB is a NoSQL document database which performs Indexing directly on document's contents. The index is a union of all documents words and can be queried on any word of any document present in the database. It is represented as a schema-agnostic tree, where the tree nodes are all possible words of document set and values are the associated documents in which the word is present. To represent this schema-less index, Bw-Tree is used. To support fast random writes on SSDs and Disks, Cosmos DB also employs Log-Structured merge trees, to store Bw-Tree modifications. It uses delta-record updates instead of in-place updates in the tree to avoid cache invalidation and write amplification on SSDs. Cosmos DB supports blind-incremental updates to its Bw-Tree, so as to perform partial writes to any record without reading it to the memory.
As the database is distributed, Index modifications have to be replicated to all of the replicas of a data shard. Cosmos DB performs asynchronous replication of delta-records to make secondaries consistent with the primary replica. When a new document is created on the primary, it is completely analyzed to extract all of the words and these words are inserted into the Index, while also transferring the word stream to the secondaries.
Only self-joins are supported.
It supports SQL, MongoDB, Cassandra, Gremlin, Table APIs.
It uses both in-memory and disk-based log-structured merge trees to store documents.
Cosmos DB service is deployed on several replicated shared-nothing nodes across geographical regions for high-availability, low-latency, and high throughput. Some or all of these distributed nodes form a replica set for serving requests on a data shard that contains documents. Among the replicas, one of them is elected as a master to perform totally-ordered writes on the data shard. Writes are done on the write-quorum (W), a subset of the replica nodes, to ensure that the data is durable. Reads are performed on read-quorum (R), a subset of replica nodes, to get the desired consistency levels (Strong, Bounded-staleness, Session, Consistent Prefix, Eventual) as configured by users.
Data is partitioned at logic level and is replicated at storage layer in terms of physical partitions to achieve desired availability and throughput.
Cosmos DB uses fast Bw-Tree to support real-time queries. Views are not used.