Apache Beam Moves To Java 8
Written by Kay Ewbank   
Wednesday, 28 February 2018

Apache Beam, the open source programming SDK for defining batch and streaming data-parallel processing pipelines, is now available in a new version that moves to Java 8 and Spark 2.x.

 

Apache Beam has an number of Beam SDKs that you can use to build a program that defines a pipeline. This is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Beam began life at Google, and is used as the Google Cloud Dataflow (GCD) service. Beam uses the same API as GCD.

The latest version now uses Java 8 as its supported Java version, and the code and examples in Beam have been reworked to take advantages of the improvements in Java 8 such as lambdas, streams, and improved type inference.

Beam's Spark runner has also been updated to the Spark 2.x development line to improve performance and for future compatibility with the Structured Streaming APIs. The Beam Pipeline Runners translate the data processing pipeline you define with your Beam program into the API compatible with the distributed processing back-end of your choice.

The support for AWS S3 has also been improved. In previous versions, AWS S3 was supported via the HadoopFileSystem, but the new release adds native support for S3, so improving performance.

The final improvement of note is the addition of the Splittable DoFn API for the Python SDK,  and Splittable DoFn support for the Python streaming DirectRunner.

Splittable DoFn Example

 

DoFn is a Beam SDK class that defines a distributed processing function. The DoFn object contains the processing logic that gets applied to the elements in the input collection. It processes one element at a time. Splittable DoFn is a generalization of DoFn that can be used to develop more powerful IO connectors than before, with shorter, simpler, more reusable code.

beamicon

More Information

Beam Website

Related Articles

Apache Beam Moves To Top Level

Apache Spark 2.0 Released

Flink Gets Event-time Streaming

Google Announces Big Data the Cloud Way

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


pg_parquet - Postgres To Parquet Interoperability
28/11/2024

pg_parquet is a new extension by Crunchy Data that allows a PostgreSQL instance to work with Parquet files. With pg_duckdb, pg_analytics and pg_mooncake all of which can access Parquet files, is  [ ... ]



AWS Releases Lambda SnapStart For .NET Functions
10/12/2024

Amazon has released new services for AWS Lambda SnapStart,  Amazon's performance optimization that aims to significantly improve the startup time for applications.


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 28 February 2018 )