Apache Arrow 6 Improves Support For R and Rust
Monday, 22 November 2021

Apache Arrow 6 has been released with improvements to support for R and Rust as well as Arrow Flight. There's also new support for DataFusion.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.


The improvements to the new release start with the addition of bindings for Flight in GLib and Ruby. The team says that while SQL support for Flight hasn't made it into this release, work is ongoing. Arrow Flight SQL defines a protocol for clients to communicate with SQL databases using Arrow Flight.

In Arrow's compute layer, a basic in-memory query engine has been implemented and is accessible from the R bindings. The query engine supports operations including filter, project, sort, equality joins, and various aggregations. A wide range of functions have also been added in this version, and type support has been improved for most of the compute functions.

The support for R has been enhanced with a number of major new features in this version, some of which the team has been building up to for several years. In practical terms, there's more dplyr support, including the ability to carry out grouped aggregation. You can now summarise() on Arrow data, both with or without group_by(). These are supported both with in-memory Arrow tables as well as across partitioned datasets. Most common aggregation functions are supported. In addition to aggregation, Arrow now also supports all of dplyr’s mutating joins (inner, left, right, and full) and filtering joins (semi and anti).

The R team has also added support for DuckDB as a way to query Arrow Datasets. This means you can use duckdb’s dbplyr methods, as well as its SQL interface, to aggregate data.

Alongside the R improvements, there's new support for DataFusion. This is an embedded query engine that uses Rust and Apache Arrow to provide a system that the developers say is high performance, easy to connect, easy to embed, and high quality. This release includes a runtime operator metrics collection framework, and object store abstraction for unified access to local or remote storage. The framework includes Hive-style table partitioning support for Parquet, CSV, Avro and Json files, and DataFrame API support for: except, intersect, show, limit and window functions. It also has extensive SQL support, and now passes TPC-H queries 8, 13 and 21.

Apache Arrow 6 is available for download.


More Information

Apache Arrow Website

Arrow On GitHub

Related Articles

Apache Arrow 5 Improves Asynchronous Scanner

Apache Arrow 4 Adds New C++ Compute Functions

Apache Arrow Improves C++ Support

Apache Arrow 2 Improves C++ and Rust Support

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Important Conference Results

The SIGBOVIK conference has just finished and its proceedings can be downloaded, but only at your peril. You might never see computer science in the same way ever again.

Run WebAssembly Components Inside Node.js With Jco

Jco 1.0 has been just announced by the Bytecode Alliance.It's a native JavaScript WebAssembly toolchain and runtime that runs Wasm components inside Node.js. Why is that useful?

More News

raspberry pi books



or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 22 November 2021 )