Accumulo

View Current Viewing Revision #9 from 12/10/2018 11:29 p.m.

Apache Accumulo is a sorted, distributed key-value store based on Google's Bigtable, HDFS and Apache Zookeeper. First designed and developed by a team in NSA, Accumulo's mission is to support big data storing and processing, but at the same time enforce fine-grained data access control. In particular, the team in NSA extends Bigtable in a way that Accumulo can control the access of individual data elements. Accumulo is currently an open source project under Apache v2 license.

History

2006

Google publishes "Bigtable: A Distributed Storage System for Structured Data." In the same year, Yahoo! releases an open source version named Apache Hadoop.

January 2008

In order to solve the issue of storing and processing large amounts of data with different sensitivity level, a team of computer scientists and mathematicians in NSA are evaluating various big data technologies.

July 2008

The NSA team decides to begin a new Bigtable implementation.

September 2011

Accumulo becomes a public open source incubator project hosted by Apache Software Foundation.

March 2012

Version 1.3.5 is released. This is the first publicly available version.

April 2012

Version 1.4 is released.

May 2013

Version 1.5 is released. This version incorporates Thrift proxy and table import/export into Accumulo.

May 2014

Version 1.6 is released.

Views

Not Supported

Accumulo doesn't support views since Bigtable doesn't.

Data Model

Column Family / Wide-Column

Based on Google's Bigtable, Accumulo is a column-oriented DBMS. It stores key-value pairs on disk and always keeps the keys sorted. Values are stored as byte arrays and their size or type are not restricted. Keys consist of three components: a row ID, a column and a time stamp. Keys are sorted first by row IDs, then column, and finally time stamps. This implies that values in the same row will be stored together, and that different rows don't have to contain the same number of columns. Time stamps are used to support multi-versioning of the same key. The column component in the key can be further divided into three fields: column families, column qualifiers and column visibility. Column families are defined by the application designer to group columns with similar functions, so that Accumulo will store them close on disk for faster access. Note that unlike Bigtable and HBase, Accumulo column families need not be declared before use. Column visibility is Accumulo's unique feature; this allows Accumulo to store data with different sensitivity to be stored on the same physical tables.

System Architecture

Shared-Nothing

Relying on HDFS to manage files, Accumulo applies a Shared-Nothing architecture.

Storage Model

Decomposition Storage Model (Columnar)

Concurrency Control

Multi-version Concurrency Control (MVCC)

Accumulo guarantees ACID properties per row.

Query Interface

Custom API Command-line / Shell

Accumulo provides the user with two ways to interact with the system. The first one is to use a client. It supports C++, Python, Java and Ruby. It also has a simple shell that allows the user to examine the content, update configuration settings, insert/update/delete values, etc.

Compression

Prefix Compression

Accumulo emploies two compression techniques. The first one is running GZip or LZO on blocks of data that are stored on disk. The second one is relative-key encoding, which allows the common prefixes of keys to be stored only once, and the following keys only need to store the difference.