Apache Samza Adds SQL
Written by Kay Ewbank   
Monday, 08 January 2018

There's a new version of Apache Samza that adds Samza SQL and both Azure EventHubs and AWS Kinesis. Samza is an open source framework originally developed alongside Kafka by LinkedIn before being made open source and taken over by the Apache Software Foundation.

The idea behind Samza is to provide a simple way to develop and run stream processing jobs that can be used by non-programmers as well as developers. Samza uses Apache Kafka for messaging, and Apache Hadoop YARN for fault tolerance, processor isolation, security, and resource management.  It has support for local state via a RocksDB store that allows a stateful application to scale up to 1.1 Million events/sec on a single machine with SSD.

Samza has a simple callback-based “process message” API comparable to MapReduce. It supports managed state via snapshotting and restoration of a stream processor’s state. When a processor is restarted, Samza restores its state to a consistent snapshot. It also provides fault tolerance by working with YARN to transparently migrate tasks to another machine if the active machine in the cluster fails. Kafka is used to process messages in the order they were written to a partition, so that no messages are ever lost.

Samza is partitioned and distributed at every level. It has a pluggable API that means it can be run with other messaging systems and in other execution environments, though it is designed to work out of the box with Kafka and YARN. Samza is written in Scala and Java.

The new release of Samza adds three main new features. The first is Samza SQL. This is a high level API that is designed to expand the target audience for stream processing to make it accessible to anyone who can write SQL. The developers say Samza SQL can be used to obtain quick real time insights,  and to quickly create stream processing applications. 



Samza SQL is based on Apache Calcite, an open source SQL language framework used by several Apache projects. The way Samza SQL works is that you write a normal SQL query, and the API deals with creating, configuring, and managing the pipeline.

The second improvement is an Azure EventHubs producer, consumer and checkpoint provider. An AWS Kinesis consumer has also been added. Other improvements include durable state in high-level API, Zookeeper-based deployment stability, and multi-stage batch processing.


More Information

Samza Site

Related Articles

Apache Bigtop Adds OpenJDK 8 Support 

Apache Fluo Improves Spark Integration

Kafka 1 Becomes More Tolerant

Comparing Kafka To RabbitMQ

Apache Kafka Adds New Streams API

GoKa Stream Processing For Kafka

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.



JetBrains Releases Aqua Test Automation IDE

JetBrains has announced the public release of Aqua, its IDE designed for test automation. The full release follows a preview in 2022.

A Swarming Bee From Festo

The latest addition to the Festo Bionic Learning Network menagerie of bionic robots inspired by the natural world is a bee. Like the Bionic Ant from a decade ago, it has been designed n [ ... ]

More News


raspberry pi books



or email your comment to: comments@i-programmer.info