Perform Data Queries Faster With Drill
Written by Kay Ewbank   
Friday, 24 August 2012

Drill a new distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel, has been accepted into the Apache Incubator.

The main attraction of Dremel, the query system used for for Google’s BiqQuery analytics is the ability to store and search trillion-row datasets without the need to use Hadoop.

While Hadoop is very efficient when using the MapReduce framework to perform batch analysis, the batch nature of the work makes Hadoop unsuitable for analysing transactional data.

Drill, by comparison, can perform data queries at a much faster rate. The team behind Drill say that it is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data, and has a design goal of being able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

 

 

According to its Apache Incubator proposal, which is being championed by Ted Dunning, like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers.

It points out that in many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data.

The Drill architecture consists of four key components or layers:

The query languages layer is responsible for parsing the user’s query and constructing an execution plan.  The initial goal is to support the SQL-like language used by Dremel and Google BigQuery, DrQL. Drill is also designed to support other languages and programming models, such as the Mongo Query Language, Cascading and Plume.

Drill has a low-latency distributed execution engine that is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines such as Dremel, Dryad, Hyracks, CIEL and Stratosphere, alongside columnar storage.

The nested data formats layer is responsible for supporting various data formats, with the initial goal of supporting the column-based format used by Dremel.

A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use.

The scalable data sources layer is responsible for supporting various data sources, starting with Hadoop.

 

More Information

Drill Proposal

Related Articles

Real-time Hadoop Analysis

New MinuteSort Record Set by Microsoft Research

SQL Server 2012 and Second Preview for Hadoop for Azure

 

espbook

 

Comments




or email your comment to: comments@i-programmer.info

 

To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.

Banner


Gender Differences In Coding Style
13/11/2024

A novel investigation into the gender gap between men and women regarding coding ability was undertaken by Dr Siân Brooke. Her conclusion? There is a difference in the Python code [ ... ]



PlanetScale Gets Into Vector Search
02/12/2024

PlanetScale, the cloud MySQL-compatible database with advanced scaling capabilities, is now upgraded with vector storage and search.


More News

 

Last Updated ( Friday, 24 August 2012 )