Apache Druid Improves Compaction
Written by Kay Ewbank   
Tuesday, 04 February 2020

Apache Druid, a high performance real-time analytics database, designed for workflows where fast queries and ingest really matter, has been updated with improvements including better compaction and batch ingestion.

Currently an incubator project at Apache, Druid is:

designed to excel at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency, and provides an open source alternative to data warehouses.

It was originally developed at a startup called Metamarkets to power an all-in-one analytics solution for programmatic digital advertising. Ad-tech is an area that generates data to the tune of hundreds of billions or even trillions of new records per day, and Druid was developed to cope with this level of data. It has since been extended for situations that aren’t adequately addressed by classic analytics stacks. Application areas that Druid is used for include network flow analytics, product analytics, and user behavior. It is used by major companies including NTT, WalkMe, Pinterest, Netflix, Airbnb, Lyft, and Walmart.


Druid can natively stream data from message buses such as Kafka and Amazon Kinesis, and batch load files from data lakes such as HDFS and Amazon S3.Along with support for column-oriented storage, Druid also incorporates designs from search systems and timeseries databases.

The developers say Druid is better than traditional data warehouses because it has much lower latency for OLAP-style queries and for data ingest (both streaming and batch). Its support for time-based partitioning means time-based queries can be run efficiently, and it has fast search and filter for fast slice and dice. This makes it good for use with real-time analytics and where the end-user (technical or not) wants to apply numerous queries in rapid succession to explore or better understand data trends.

The latest release includes an update to the native batch ingestion system. The internal framework now supports non-text binary formats, with initial support for ORC and Parquet. Single dimension range partitioning for parallel native batch ingestion has also been added, meaning it is now possible to carry out range-based partitioning on a single dimension.

Compaction improvements start with support for parallel index task split hints, meaning operators can provide hints to control the amount of data that each first phase subtask reads. Parallel and stateful auto-compaction support has been added, and the Druid broker can now opportunistically merge query results in parallel using multiple threads.


More Information

Druid Home Page

Related Articles

Kafka 2 Adds Support For ACLs

Kafka Graphs Framework Extends Kafka Streams

Amazon Introduces Kinesis Analytics

Cloudera Extends Apache HBase To Use Amazon S3

Hadoop 3 Adds HDFS Erasure Coding

Amazon Redshift Updates


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


We Built A Software Engineer

One of the most worrying things about being a programmer today is the threat from AI. It has gone so far that NVIDA CEO Jensen Huang proclaims that you really shouldn't start training as a programmer  [ ... ]

The University of Tübingen's Self-Driving Cars Course

The recorded lectures and the written material of a course on Self-Driving Cars at the University of Tübingen have been made available for free. It's a first class opportunity to learn the in an [ ... ]

More News

raspberry pi books



or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 04 February 2020 )