Apache Arrow 4 Adds New C++ Compute Functions

Written by Kay Ewbank

Tuesday, 18 May 2021

Apache Arrow has been updated to version 4.0. It has extra C++ compute functions for numeric and string data, and improves the performance of Arrow Datasets.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

The C++ support has been improved in this release, with support for automatic implicit casting in compute kernels, and new compute functions fir numeric data including quantile and power. Several new functions for string processing have been added, providing ways to trim characters, extract substrings captured by a regex pattern, and matching strings against redex patterns. There are also new functions for computing UTF8 string lengths, and replacing non-overlapping substrings that match a literal pattern or regular expression.

Improvements to the Python support include the ability to create a dataset from a Python iterator of record batches. The Dataset interface has also been improved for Python, and can now use custom projections using expressions when scanning.

Rust support has seen the most changes in this release, with new features and performance improvements. The developers say that they have concentrated largely on the necessary details to make it possible to release the Rust versions to cargo at a more regular rate. In addition, the Ballista distributed compute project has been officially included.

Rust support for Arrow includes JSON reader improvements and a new JSON writer, as well as improved schema inference for nested list and struct types. Rust support for Arrow DataFusion has better SQL support including the ability to use Union, Having, Extract, Show Tables and Interval. You can use Group By with more data types, and user defined functions can now provide specialized implementations for scalar values. There are also several new SQL metrics.

Performance improvements include Constant folding, a partitioned hash join, and improved parallelism using repartitioning pass. Hash aggregate performance is also better with large numbers of grouping values, and there's predicate pushdown support for table scans.

More Information

Apache Arrow Website

Arrow On GitHub

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

Databricks Delta Adds Faster Parquet Import

Apache Kudu 1.9 Adds Location Awareness

Apache Kudu Improves Web Interface

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

JetBrains Previews VSCode Kotlin At KotlinConf
26/05/2025

JetBrains has shown off a pre-alpha version of its forthcoming official Kotlin support for Visual Studio Code and an implementation of Language Server Protocol for the Kotlin language. The announcemen [ ... ]

+ Full Story

Early 2025 Java Conferences Galore Part 2
16/05/2025

We continue the lowdown of Java conferences that took place in the first half of 2025. Last week we explored three Voxxed sessions, this week it's Devoxx Greece, Devoxx UK and JavaOne.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 18 May 2021 )

More Information

Related Articles

Comments