Apache Hudi Achieves Top Level Status
Written by Kay Ewbank   
Monday, 29 June 2020

Apache Hudi has been adopted as a top-level project. The open source data lake technology for stream processing on top of Apache Hadoop is already being used at organizations including Alibaba, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services.

The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.


Apache Hudi can be used used to manage petabyte-scale data lakes.  Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.

Hudi supports three types of queries - snapshot, incremental and read optimized. Hudi snapshot queries give a view of real-time data using a combination of columnar and row-based storage such as Parquet and Avro. It's incremental queries provide a change stream with records inserted or updated after a point in time, while the read optimized queries are essentially snapshot queries offering faster performance on purely columnar storage such as Parquet.

According to Uber, Hudi is conceptually divided into three main components: the raw data that needs to be stored, the data indexes that are used to provide upsert capability, and the metadata used to manage the dataset. Hudi maintains a timeline of all actions performed on the table at different points in time, referred to as instants in Hudi. This means users can get an instantaneous views of the table, while also efficiently supporting retrieval of data in the order of arrival. Hudi guarantees that the actions performed on the timeline are atomic and consistent based on the time at which the change was made in the database. With this information, Hudi provides different views of the same Hudi table, including a read-optimized view for fast columnar performance, a real-time view for fast data ingestion, and an incremental view to read Hudi tables as a stream of changelogs.



More Information

Hudi Website

Related Articles

SQL Server 2019 Includes Hadoop And Spark

Hortonworks Plans To Take Hadoop Cloud Native

Hadoop 3 Adds HDFS Erasure Coding

Hadoop 2.9 Adds Resource Estimator

Hadoop Adds In-Memory Caching

Hadoop SQL Query Engine Launched

Hadoop 2 Introduces YARN 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Quantum Computing Prize Awarded

John Preskill, Professor of Theoretical Physics at the California Institute of Technology, is the eighth recipient of the John Stewart Bell Prize for Research on Fundamental Issues in Quantu [ ... ]

Falco On Track To Version 1.0.0

Falco is a cloud native runtime security tool for the Linux operating system, designed to detect abnormal behavior and warn of potential security threats in real-time. Now it's about to release its fi [ ... ]

More News

raspberry pi books



or email your comment to: comments@i-programmer.info