Apache Druid Improves Compaction
Written by Kay Ewbank   
Tuesday, 04 February 2020

Apache Druid, a high performance real-time analytics database, designed for workflows where fast queries and ingest really matter, has been updated with improvements including better compaction and batch ingestion.

Currently an incubator project at Apache, Druid is:

designed to excel at instant data visibility, ad-hoc queries, operational analytics, and handling high concurrency, and provides an open source alternative to data warehouses.

It was originally developed at a startup called Metamarkets to power an all-in-one analytics solution for programmatic digital advertising. Ad-tech is an area that generates data to the tune of hundreds of billions or even trillions of new records per day, and Druid was developed to cope with this level of data. It has since been extended for situations that aren’t adequately addressed by classic analytics stacks. Application areas that Druid is used for include network flow analytics, product analytics, and user behavior. It is used by major companies including NTT, WalkMe, Pinterest, Netflix, Airbnb, Lyft, and Walmart.

druid

Druid can natively stream data from message buses such as Kafka and Amazon Kinesis, and batch load files from data lakes such as HDFS and Amazon S3.Along with support for column-oriented storage, Druid also incorporates designs from search systems and timeseries databases.

The developers say Druid is better than traditional data warehouses because it has much lower latency for OLAP-style queries and for data ingest (both streaming and batch). Its support for time-based partitioning means time-based queries can be run efficiently, and it has fast search and filter for fast slice and dice. This makes it good for use with real-time analytics and where the end-user (technical or not) wants to apply numerous queries in rapid succession to explore or better understand data trends.

The latest release includes an update to the native batch ingestion system. The internal framework now supports non-text binary formats, with initial support for ORC and Parquet. Single dimension range partitioning for parallel native batch ingestion has also been added, meaning it is now possible to carry out range-based partitioning on a single dimension.

Compaction improvements start with support for parallel index task split hints, meaning operators can provide hints to control the amount of data that each first phase subtask reads. Parallel and stateful auto-compaction support has been added, and the Druid broker can now opportunistically merge query results in parallel using multiple threads.

druid

More Information

Druid Home Page

Related Articles

Kafka 2 Adds Support For ACLs

Kafka Graphs Framework Extends Kafka Streams

Amazon Introduces Kinesis Analytics

Cloudera Extends Apache HBase To Use Amazon S3

Hadoop 3 Adds HDFS Erasure Coding

Amazon Redshift Updates

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Python 3.13 Is Here
09/10/2024

As time ticks on, the changes to the Python language become fewer and this makes it easier to upgrade. With this release the emphasis is on performance rather than new features.



OpenAI Releases Swarm
25/10/2024

OpenAI has released an experimental educational framework for exploring ergonomic, lightweight multi-agent orchestration. Swarm is managed by the OpenAI Solution team, but is not intended to be used i [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 04 February 2020 )