BlinkDB

BlinkDB is an approximate query engine built on top of Hive as well as Shark (Hive on Spark, the former Spark SQL). It allows users to trade-off query accuracy for response time, thus enabling interactive queries on big data. BlinkDB builds a couple of stratified samples on the original data and executes the queries on the samples instead of the original data to reduce query execution time. It has two major parts: one is the sample building engine that selects what stratified samples to build by considering historic workloads and the features of the table; the other part is a dynamic sample selection module that chooses appropriate sample files at runtime according to specific time/accuracy requirements.

History

BlinkDB was proposed in BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, which is the best paper of Eurosys 2013.

BlinkDB is no longer maintained. It is integrated into VerdictDB.

Data Model

Relational

Concurrency Control

Not Supported

BlinkDB leaves concurrency control to the base database system.

Indexes

Not Supported

Storage Architecture

Hybrid

BlinkDB maintains samples both on disks and in memory.

System Architecture

Shared-Nothing

Query Interface

SQL

The query interface of BlinkDB is SQL-based aggregation queries along with response time of error bound constraints. Like: SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' WITHIN 2 SECONDS or SELECT avg(sessionTime) FROM Table WHERE city='San Francisco' ERROR 0.1 CONFIDENCE 95.0%.

BlinkDB Logo
Website

http://blinkdb.org/

Source Code

https://github.com/sameeragarwal/blinkdb

Developer

University of California-Berkeley, Massachusetts Institute of Technology

Country of Origin

US

Start Year

2012

End Year

2014

Project Type

Academic, Open Source

Derived From

Spark SQL

Operating Systems

All OS with Java VM

Licenses

Apache v2