Apache Iceberg Improves Spark Support

Written by Kay Ewbank

Thursday, 25 August 2022

Apache Iceberg 0.14 has been released with improvements to support for Spark and a common REST catalog client that uses change-based commits to resolve commit conflicts on the server side.

Iceberg is a high-performance format for huge analytic tables of titanic proportions that was originally developed by Netflix. Iceberg was made an opensource Apache incubator project in 2018, and graduated to be a top level project in 2020.

iceberg

Netflix created Iceberg to provide a way of working with very large datasets and ensure the format would work as reliably and predictably as SQL. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

SQL commands can be used in Iceberg to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance, or it can use delete deltas for faster updates.

Schema evolution just works, columns can be renamed and reordered, and actions such as adding a column won't bring back "zombie" data. Iceberg supports "hidden partitioning". In other words, it handles the task of producing partition values for rows in a table and skips unnecessary partitions and files automatically. No extra filters are needed for fast queries, and table layout can be updated as data or queries change.

Time-travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state.

The improvements to the latest version start with several performance improvements for scan planning and Spark queries. Specifically, Parquet vectorized reads are enabled by default, and ScanBuilder now has SupportsReportStatistics. Spark tables have also been updated to avoid expensive (and inaccurate) size estimations. Support has been added for Spark 3.3, including AS OF syntax for SQL time travel queries. There's also merge-on-read support for MERGE and UPDATE queries in Spark 3.2 or later,

Other improvements include a new common REST catalog client that uses change-based commits to resolve commit conflicts on the service side; and new interfaces for consuming data incrementally (both append and changelog scans). The developers have also added a spec and implementation for Puffin, which is a format for large stats and index blobs, like Theta sketches or bloom filters.

Iceberg 0.14 is available now.

iceberg

More Information

Apache Iceberg

.NET For Apache Spark Updated

Spark BI Gets Fine Grain Security

Spark Announcements

Apache Hive Adds Support For Set Operations

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

The Future Of JavaScript - Stage 3 Propsals
25/06/2025

The new proposals for ECMA Script/JavaScript have reached Stage 3, which means they will soon be with us. Is there room for excitement?

+ Full Story

Google Clarifies ChromeOS Android Merger
21/07/2025

Is Google planning on merging ChromeOS with Android? Last week it looked like the long-standing rumor had been casually confirmed by a Google spokesperson. This week, we're back to a position of 'as y [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 25 August 2022 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments