|Kafka: The Definitive Guide|
Author: Neha Narkhede, Gwen Shapira and Todd Palino
Kafka is increasingly popular for moving large amounts of streaming data. This guide, subtitled Real-time data and stream processing at scale, has been written to show how the people who built Kafka control and use it.
The authors are from Confluent and LinkedIn, and were among the team responsible for developing Kafka. They say that they wrote the book from the perspective of asking 'what are the most useful things we can share with new users to take them from beginner to expert'.
The book has some parts that are aimed at developers, others that are more useful for administrators of Kafka. It opens with a general introduction to Kafka and what it does, followed by a chapter on installing Kafka.
Having got those openers out of the way, the authors get into the heart of the book, beginning with a chapter on Kafka Producers and how to write messages to Kafka. Next comes a chapter on Kafka Consumers, and how to read data from Kafka. Both chapters have plenty of code snippets that illustrate the concepts being discussed. The samples are there to show the concepts rather than being full programs that you could copy and paste to produce a program you could run.
A chapter on Kafka Internals is next, looking at how Kafka replication works, how it handles requests from producers and consumers, and how it deals with message storage. There are explanations of how Kafka handles replication and partitions. All these topics are explained with the idea of giving a better understanding of why Kafka behaves in certain ways in certain situations.
The next chapter is titled Reliable Data Delivery, and looks at reliability guarantees and how to configure brokers.A chapter on building data pipelines comes next, starting with what to think about when building a pipeline, then going on to an introduction to Kafka Connect, with examples on connectors between a file source and a file sink, and between MySQL and ElasticSearch. There's also a discussion of alternatives to Connect.
Cross-cluster data mirroring is the next topic to be considered. The rest of the book concentrates on single Kafka cluster use, but this chapter shows how to handle the situation where you need to copy data between clusters using Kafka's MirrorMaker cross-cluster data replicator, including configuring and tuning it.
A chapter on administering Kafka is next, mainly looking at Kafka's command line utilities that you can use for basic cluster administration. However, as the authors point out, there are better third party tools available on the Kafka website. This chapter is followed by a look on how to monitor a Kafka cluster using the Java Management Extension (JMX) interface. The authors discuss the different metrics, which are the critical ones to monitor all the time, and what you should do in response to different results. They also look at which metrics are useful when debugging problems.
The final chapter looks at stream processing and how Kafka Streams works. This is Kafka's stream-processing library, and the authors show how to use it to build a topology and use it. The chapter ends with some stream processing use cases.
Overall, I found this book to be clearly written and it gave me a good explanation of what Kafka is capable of. The code samples illustrated the points well, and the authors obviously have a detailed knowledge of everything about Kafka. The one drawback of this is that sometimes it led to them giving a much shorter explanation of a point or concept where I'd have preferred a slower, more detailed description. That's still a minor point, and if you need to learn about Kafka, this is a very good book.