Apache Arrow 5 Improves Asynchronous Scanner

Written by Kay Ewbank

Monday, 16 August 2021

Apache Arrow 5 has been released, alongside Apache Arrow Rust 5. Both versions have a number of improvements, including a better asynchronous scanner for the Dataset layer. This is the first release where the Rust projects have moved to separate repositories outside the main Arrow monorepo.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

The improvements to the Dataset layer start with the asynchronous scanner introduced in Arrow 4. This has been improved with truly asynchronous readers implemented for CSV, Parquet, and IPC file formats and file-level parallelism added.

The compute layer has lots of new scalar functions, including 30 new scalar arithmetic and math functions, a collection of scalar bitwise functions, 21 scalar string functions, 16 scalar temporal functions and a group of 'other' scalar functions such as case_when, coalesce, if_else, and make_struct.

The Flight support has been improved in Arrow's Go implementation, and now supports custom metadata and middleware.

Java improvements include Improved support for extension types using a complex storage type, e.g. struct, map or union.

Python support has been extended, with the ability to scan files asynchronously in Datasets. The developers say this should provide better performance in environments where I/O can be slow, such as with remote sources.

The developers working on the R support in Arrow say they've more than doubled the number of functions you can call on Arrow Datasets inside dplyr::filter(), mutate(), and arrange(), including many more string, datetime, and math functions.The support for the Arrow C interface has been deepened. This allows integration with other projects, such as DuckDB.

Apache Arrow 5 is available now.

More Information

Apache Arrow Website

Arrow On GitHub

Apache Arrow 4 Adds New C++ Compute Functions

Apache Arrow Improves C++ Support

Apache Arrow 2 Improves C++ and Rust Support

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Mitch Kapor Gains MSc 45 Years After Dropping Out of MIT
04/07/2025

Mitch Kapor, founder of Lotus Development Corporation and designer of Lotus 1-2-3, the "killer application" which made the personal computer ubiquitous in the business world in the 1980s has completed [ ... ]

+ Full Story

Parasoft Adds AI Assistant To C/C++ Test
30/06/2025

Parasoft has updated its C/C++ Test software with an AI-powered documentation assistant, along with complete support for MISRA C:2025 and auto-suppression of equivalent violations. C/C++ Test can be u [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 16 August 2021 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments