Apache Arrow Adds Streaming Binary Format
Apache Arrow Adds Streaming Binary Format
Written by Kay Ewbank   
Monday, 06 March 2017

There's a new version of Apache Arrow  that is being described as a major milestone for the project. Apache Arrow is a columnar in-memory analytics layer the permits random access.

Arrow isn’t a standalone piece of software. It is used as a component within systems to accelerate analytics and to allow Arrow-enabled systems to exchange data with low overhead. It is sufficiently flexible to support most complex data models.

Apache Arrow can be used to store a set of canonical in-memory representations of flat and hierarchical data along with multiple language-bindings for structure manipulation. It also provides low-overhead streaming and batch messaging, zero-copy interprocess communication (IPC), and common algorithm implementations.

Todd Lipcon, original Apache Kudu creator and member of the Apache Arrow Project Management Committee, said Apache Arrow is important because:

"A columnar in-memory data layer enables systems and applications to process data at full hardware speeds. Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."

In many workloads, 70-80% of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies.

The component can be particularly useful for Python and R developers, as Arrow provides an option for data interoperability, which has been one of the biggest roadblocks to tighter integration with big data systems.

The benefits of Apache Arrow start with its columnar memory-layout that permits random access. The layout is highly cache-efficient in analytics workloads and supports SIMD optimizations with modern processors. This lets developers create very fast algorithms which process Arrow data structures.

Another benefit is the ability Arrow provides for efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.

The final benefit of Arrow is the flexibility of its structured data model that supports complex types. It handles flat tables as well as real-world JSON-like data engineering workloads.

This release is a major milestone for the project, as it adds integration tests validating binary compatibility between the Java and C++ (and Python) implementations.

Another improvement to the new version is a new streaming binary format (with Java and C++/Python implementations). 

The Python functionality has been significantly expanded, particularly pandas and Apache Parquet interoperability. A JSON file "format" for specifying integration tests has been added, and there is expanded zero-copy or low-overhead threadsafe IO for C++.

 

asf logo

More Information

Apache Arrow Page

Related Articles

Apache Kafka Adds New Streams API

Apache Beam Moves To Top Level

HBase Adds MultiWAL Support

Spark BI Gets Fine Grain Security

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin.

 

Banner


Orion 16 Improves Node Server
09/10/2017

There's a new release of Orion, the Eclipse cloud IDE, with improvements to the Node.js server, language tooling, and trial debugger support.



A Worm's Mind In An Arduino Body
04/10/2017

It is a few years since we first encountered the mind of C. elegans built into a Lego body but now we have a version everyone can play with. It is presented as a biologically plausible model of the ne [ ... ]


More News

 

 
 

 

blog comments powered by Disqus

Last Updated ( Monday, 06 March 2017 )
 
 

   
Banner
Banner
RSS feed of news items only
I Programmer News
Copyright © 2017 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.