Blink and it's done: Interactive queries on very large data
Article 2012 en
Authors
SA
Sameer Agarwal
AI
Anand Iyer
AP
Aurojit Panda
Abstract
1 min read
In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for run-ning interactive queries on large volumes of data. The key obser-vation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H bench-mark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150 × faster than Hive on MapReduce and 10−150 × faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2 − 10%. 1.
Discussion(0)
No comments yet. Be the first to comment.