Apache Hudi Achieves Top Level Status

Written by Kay Ewbank

Monday, 29 June 2020

Apache Hudi has been adopted as a top-level project. The open source data lake technology for stream processing on top of Apache Hadoop is already being used at organizations including Alibaba, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services.

The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.

hudi

Apache Hudi can be used used to manage petabyte-scale data lakes. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.

Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.

Hudi supports three types of queries - snapshot, incremental and read optimized. Hudi snapshot queries give a view of real-time data using a combination of columnar and row-based storage such as Parquet and Avro. It's incremental queries provide a change stream with records inserted or updated after a point in time, while the read optimized queries are essentially snapshot queries offering faster performance on purely columnar storage such as Parquet.

According to Uber, Hudi is conceptually divided into three main components: the raw data that needs to be stored, the data indexes that are used to provide upsert capability, and the metadata used to manage the dataset. Hudi maintains a timeline of all actions performed on the table at different points in time, referred to as instants in Hudi. This means users can get an instantaneous views of the table, while also efficiently supporting retrieval of data in the order of arrival. Hudi guarantees that the actions performed on the timeline are atomic and consistent based on the time at which the change was made in the database. With this information, Hudi provides different views of the same Hudi table, including a read-optimized view for fast columnar performance, a real-time view for fast data ingestion, and an incremental view to read Hudi tables as a stream of changelogs.

hudi

More Information

Hudi Website

SQL Server 2019 Includes Hadoop And Spark

Hortonworks Plans To Take Hadoop Cloud Native

Hadoop 3 Adds HDFS Erasure Coding

Hadoop 2.9 Adds Resource Estimator

Hadoop Adds In-Memory Caching

Hadoop SQL Query Engine Launched

Hadoop 2 Introduces YARN

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Amazon Q Developer Adds Faster Agentic Coding
28/04/2025

Amazon has improved the CLI agent within the Amazon Q command line interface (CLI) to provide a faster more interactive coding experience. Amazon Q Developer can now use the information in its CLI env [ ... ]

+ Full Story

Harvard RoboBee Gets New Knees
25/04/2025

The Harvard RoboBee can now make better landings thanks to new legs based on those of a crane-fly. The researchers who developed the robot say it now no longer needs to crash land, and can inste [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

More Information

Related Articles

Comments