Apache Arrow 6 Improves Support For R and Rust
Monday, 22 November 2021

Apache Arrow 6 has been released with improvements to support for R and Rust as well as Arrow Flight. There's also new support for DataFusion.

Apache Arrow is a development platform for in-memory analytics. It has technologies that enable big data systems to process and move data fast..It is language independent, can be used for flat and hierarchical data, and the data store is organized for efficient analytic operations. It also provides computational libraries. Languages currently supported are C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

arrow

The improvements to the new release start with the addition of bindings for Flight in GLib and Ruby. The team says that while SQL support for Flight hasn't made it into this release, work is ongoing. Arrow Flight SQL defines a protocol for clients to communicate with SQL databases using Arrow Flight.

In Arrow's compute layer, a basic in-memory query engine has been implemented and is accessible from the R bindings. The query engine supports operations including filter, project, sort, equality joins, and various aggregations. A wide range of functions have also been added in this version, and type support has been improved for most of the compute functions.

The support for R has been enhanced with a number of major new features in this version, some of which the team has been building up to for several years. In practical terms, there's more dplyr support, including the ability to carry out grouped aggregation. You can now summarise() on Arrow data, both with or without group_by(). These are supported both with in-memory Arrow tables as well as across partitioned datasets. Most common aggregation functions are supported. In addition to aggregation, Arrow now also supports all of dplyr’s mutating joins (inner, left, right, and full) and filtering joins (semi and anti).

The R team has also added support for DuckDB as a way to query Arrow Datasets. This means you can use duckdb’s dbplyr methods, as well as its SQL interface, to aggregate data.

Alongside the R improvements, there's new support for DataFusion. This is an embedded query engine that uses Rust and Apache Arrow to provide a system that the developers say is high performance, easy to connect, easy to embed, and high quality. This release includes a runtime operator metrics collection framework, and object store abstraction for unified access to local or remote storage. The framework includes Hive-style table partitioning support for Parquet, CSV, Avro and Json files, and DataFrame API support for: except, intersect, show, limit and window functions. It also has extensive SQL support, and now passes TPC-H queries 8, 13 and 21.

Apache Arrow 6 is available for download.

arrow 

More Information

Apache Arrow Website

Arrow On GitHub

Related Articles

Apache Arrow 5 Improves Asynchronous Scanner

Apache Arrow 4 Adds New C++ Compute Functions

Apache Arrow Improves C++ Support

Apache Arrow 2 Improves C++ and Rust Support

Apache Arrow Reaches 1.0

Apache Arrow Flight Released

Apache Arrow Adds DataFusion Rust-Native Engine

Apache Arrow Adds Streaming Binary Format

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Udacity's New Discovering Ethical AI Course
12/04/2024

Udacity has just launched an hour-long course on Ethical AI. Intended for a wide audience across many industries, it introduces to basic concepts and terms needed to step into the world of Ethica [ ... ]



Pure Virtual C++ 2024 Sessions Announced
19/04/2024

Microsoft has announced the sessions for Pure Virtual C++ 2024, which is taking place on April 30th 15:00 UTC. People who sign up will get access to five sessions happening on the day, alongside a ran [ ... ]


More News

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Monday, 22 November 2021 )