|Apache Druid Improves Compaction|
|Written by Kay Ewbank|
|Tuesday, 04 February 2020|
Apache Druid, a high performance real-time analytics database, designed for workflows where fast queries and ingest really matter, has been updated with improvements including better compaction and batch ingestion.
Currently an incubator project at Apache, Druid is:
designed to excel at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency, and provides an open source alternative to data warehouses.
It was originally developed at a startup called Metamarkets to power an all-in-one analytics solution for programmatic digital advertising. Ad-tech is an area that generates data to the tune of hundreds of billions or even trillions of new records per day, and Druid was developed to cope with this level of data. It has since been extended for situations that aren’t adequately addressed by classic analytics stacks. Application areas that Druid is used for include network flow analytics, product analytics, and user behavior. It is used by major companies including NTT, WalkMe, Pinterest, Netflix, Airbnb, Lyft, and Walmart.
Druid can natively stream data from message buses such as Kafka and Amazon Kinesis, and batch load files from data lakes such as HDFS and Amazon S3.Along with support for column-oriented storage, Druid also incorporates designs from search systems and timeseries databases.
The developers say Druid is better than traditional data warehouses because it has much lower latency for OLAP-style queries and for data ingest (both streaming and batch). Its support for time-based partitioning means time-based queries can be run efficiently, and it has fast search and filter for fast slice and dice. This makes it good for use with real-time analytics and where the end-user (technical or not) wants to apply numerous queries in rapid succession to explore or better understand data trends.
The latest release includes an update to the native batch ingestion system. The internal framework now supports non-text binary formats, with initial support for ORC and Parquet. Single dimension range partitioning for parallel native batch ingestion has also been added, meaning it is now possible to carry out range-based partitioning on a single dimension.
Compaction improvements start with support for parallel index task split hints, meaning operators can provide hints to control the amount of data that each first phase subtask reads. Parallel and stateful auto-compaction support has been added, and the Druid broker can now opportunistically merge query results in parallel using multiple threads.
or email your comment to: firstname.lastname@example.org
|Last Updated ( Tuesday, 04 February 2020 )|