|Apache Hudi Achieves Top Level Status|
|Written by Kay Ewbank|
|Monday, 29 June 2020|
Apache Hudi has been adopted as a top-level project. The open source data lake technology for stream processing on top of Apache Hadoop is already being used at organizations including Alibaba, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services.
The name Hudi stands for Hadoop Upserts Deletes and Incrementals, describing what the data lake technology can do. Upserts are operations that insert rows into a database table if they do not already exist, or update them if they do. Hudi enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 and was made open source then submitted to the Apache Incubator in January 2019.
Apache Hudi can be used used to manage petabyte-scale data lakes. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing.
Hudi provides upsert and delete support with fast, pluggable indexing, along with transactionally compliant commit and rollback. It supports Apache Hive, Apache Spark, Apache Impala and Presto query engines, and has a built-in data ingestion tool that supports Apache Kafka, Apache Sqoop and other common data sources. Users can optimize query performance by managing file sizes and storage layout.
Hudi supports three types of queries - snapshot, incremental and read optimized. Hudi snapshot queries give a view of real-time data using a combination of columnar and row-based storage such as Parquet and Avro. It's incremental queries provide a change stream with records inserted or updated after a point in time, while the read optimized queries are essentially snapshot queries offering faster performance on purely columnar storage such as Parquet.
According to Uber, Hudi is conceptually divided into three main components: the raw data that needs to be stored, the data indexes that are used to provide upsert capability, and the metadata used to manage the dataset. Hudi maintains a timeline of all actions performed on the table at different points in time, referred to as instants in Hudi. This means users can get an instantaneous views of the table, while also efficiently supporting retrieval of data in the order of arrival. Hudi guarantees that the actions performed on the timeline are atomic and consistent based on the time at which the change was made in the database. With this information, Hudi provides different views of the same Hudi table, including a read-optimized view for fast columnar performance, a real-time view for fast data ingestion, and an incremental view to read Hudi tables as a stream of changelogs.
or email your comment to: firstname.lastname@example.org