Author: Nathan Marz and James Warren
The problem with big data is the sheer amount of information that’s out there needing to be analyzed, and this book looks in depth at one approach to that problem – the Lambda Architecture Model.
Nathan Marz is the man behind Apache Storm, and he also invented the Lambda Architecture Model for big data systems, so it’s no surprise that this book, which has the subtitle "Principles and best practices of scalable real-time data systems" is in fact an in-depth looks at how you can use the Lambda architecture model (LAM) for managing big data.
The idea behind LAM is that big data databases are too large to manage and query in real time, so you split them into a batch layer where you precompute the queries on the batch data, and read the results from the precomputed view in a serving layer. This is indexed so that it can be accessed quickly with random reads, and is updated as often as is feasible given the size of the data – perhaps every few hours. Data that comes in after the batch is updated is then treated separately in a speed layer, and views are created on the data as it comes in. As new data comes in, the real-time view is updated to minimize latency. The two results are then combined to give an overall accurate result in real time. One thing I didn’t find the book explained was the reason the technique is called the Lambda architecture model. I’d assumed it was some tie-up to lambda calculus, but there’s a persistent story that Marz called it Lambda is because of the overall shape of his original system diagram.
The book is divided into three parts, covering the three layers of LAM – batch, serving, and speed. One nice aspect of the book is that the chapters alternate between theory and examples, so chapter two discusses how you model your data and create schemas, then chapter three looks at actually doing the tasks using Apache Thrift. The next chapter looks at what you need in the way of storage, and the following one covers how to actually store a dataset using HDFS – the Hadoop Distributed File System.
In some ways the next two chapters are the key to the whole concept; how do you work out which queries to pre-compute to create the batch views, and what’s the best way to do the pre-computation? The use of MapReduce is discussed in these chapters, along with JCascalog, which gives a way to create abstractions based on pipe diagrams. JCascalog is a Java library that you can use to create abstractions for expressing MapReduce computations. You describe the output in terms of the input, and JCascalog comes up with the best way to perform the necessary calculations as a series of MapReduce jobs.
Part I concludes with chapters showing the implementation of a batch layer, from the architecture and algorithms to the working code.
Part II is much shorter, consisting of just two chapters on the ‘serving layer’. The first chapter explains the concepts, and the second works through an example serving layer database called ElephantDB.
The final part of the book looks at the ‘speed layer’, the element that handles the recent data changes so that query results are up to date. The authors discuss real-time views versus batch views, and there’s a chapter showing how Apache Cassandra can be used to provide real-time views. There’s an interesting discussion of asynchronous architectures, looking at using queues and stream processing for incremental computation. The authors then go on to show implementing one-at-a-time stream processing using Apache Kafka and Apache Storm. The alternative micro-batched stream processing is then explored, with examples based on Trident.
Overall, this is a really good exploration of the techniques of Lambda Architecture and the advantages it might offer. The main caveat is that it isn’t really a book about Big Data, it’s about one solution to the big data problem.
|Last Updated ( Saturday, 19 September 2015 )|