|Apache Samza Adds SQL|
|Written by Kay Ewbank|
|Monday, 08 January 2018|
There's a new version of Apache Samza that adds Samza SQL and both Azure EventHubs and AWS Kinesis. Samza is an open source framework originally developed alongside Kafka by LinkedIn before being made open source and taken over by the Apache Software Foundation.
The idea behind Samza is to provide a simple way to develop and run stream processing jobs that can be used by non-programmers as well as developers. Samza uses Apache Kafka for messaging, and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management. It has support for local state via a RocksDB store that allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.
Samza has a simple callback-based “process message” API comparable to MapReduce. It supports managed state via snapshotting and restoration of a stream processor’s state. When a processor is restarted, Samza restores its state to a consistent snapshot. It also provides fault tolerance by working with YARN to transparently migrate tasks to another machine if the active machine in the cluster fails. Kafka is used to process messages in the order they were written to a partition, so that no messages are ever lost.
Samza is partitioned and distributed at every level. It has a pluggable API that means it can be run with other messaging systems and in other execution environments, though it is designed to work out of the box with Kafka and YARN. Samza is written in Scala and Java.
The new release of Samza adds three main new features. The first is Samza SQL. This is a high level API that is designed to expand the target audience for stream processing to make it accessible to anyone who can write SQL. The developers say Samza SQL can be used to obtain quick real time insights, and to quickly create stream processing applications.
Samza SQL is based on Apache Calcite, an open source SQL language framework used by several Apache projects. The way Samza SQL works is that you write a normal SQL query, and the API deals with creating, configuring, and managing the pipeline.
The second improvement is an Azure EventHubs producer, consumer and checkpoint provider. An AWS Kinesis consumer has also been added. Other improvements include durable state in high-level API, Zookeeper-based deployment stability, and multi-stage batch processing.
or email your comment to: email@example.com