Drill a new distributed system for interactive analysis of large-scale datasets, inspired by Google's Dremel, has been accepted into the Apache Incubator.
The main attraction of Dremel, the query system used for for Google’s BiqQuery analytics is the ability to store and search trillion-row datasets without the need to use Hadoop.
While Hadoop is very efficient when using the MapReduce framework to perform batch analysis, the batch nature of the work makes Hadoop unsuitable for analysing transactional data.
Drill, by comparison, can perform data queries at a much faster rate. The team behind Drill say that it is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data, and has a design goal of being able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.
According to its Apache Incubator proposal, which is being championed by Ted Dunning, like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers.
It points out that in many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data.
The Drill architecture consists of four key components or layers:
The query languages layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery, DrQL. Drill is also designed to support other languages and programming models, such as the Mongo Query Language, Cascading and Plume.
Drill has a low-latency distributed execution engine that is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines such as Dremel, Dryad, Hyracks, CIEL and Stratosphere, alongside columnar storage.
The nested data formats layer is responsible for supporting various data formats, with the initial goal of supporting the column-based format used by Dremel.
A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use.
The scalable data sources layer is responsible for supporting various data sources, starting with Hadoop.