Big Data

Author: Nathan Marz and James Warren
Publisher: Manning
Pages: 328
ISBN: 978-1617290343
Print: 1617290343
Audience: Data developers
Rating: 4.5
Reviewer: Kay Ewbank

The problem with big data is the sheer amount of information that’s out there needing to be analyzed, and this book looks in depth at one approach to that problem – the Lambda Architecture Model.

Nathan Marz is the man behind Apache Storm, and he also invented the Lambda Architecture Model for big data systems, so it’s no surprise that this book, which has the subtitle "Principles and best practices of scalable real-time data systems" is in fact an in-depth looks at how you can use the Lambda architecture model (LAM) for managing big data.

The idea behind LAM is that big data databases are too large to manage and query in real time, so you split them into a batch layer where you precompute the queries on the batch data, and read the results from the precomputed view in a serving layer. This is indexed so that it can be accessed quickly with random reads, and is updated as often as is feasible given the size of the data – perhaps every few hours. Data that comes in after the batch is updated is then treated separately in a speed layer, and views are created on the data as it comes in. As new data comes in, the real-time view is updated to minimize latency. The two results are then combined to give an overall accurate result in real time. One thing I didn’t find the book explained was the reason the technique is called the Lambda architecture model. I’d assumed it was some tie-up to lambda calculus, but there’s a persistent story that Marz called it Lambda is because of the overall shape of his original system diagram.

The book is divided into three parts, covering the three layers of LAM – batch, serving, and speed. One nice aspect of the book is that the chapters alternate between theory and examples, so chapter two discusses how you model your data and create schemas, then chapter three looks at actually doing the tasks using Apache Thrift. The next chapter looks at what you need in the way of storage, and the following one covers how to actually store a dataset using HDFS – the Hadoop Distributed File System.

In some ways the next two chapters are the key to the whole concept; how do you work out which queries to pre-compute to create the batch views, and what’s the best way to do the pre-computation? The use of MapReduce is discussed in these chapters, along with JCascalog, which gives a way to create abstractions based on pipe diagrams. JCascalog is a Java library that you can use to create abstractions for expressing MapReduce computations. You describe the output in terms of the input, and JCascalog comes up with the best way to perform the necessary calculations as a series of MapReduce jobs.

Part I concludes with chapters showing the implementation of a batch layer, from the architecture and algorithms to the working code.

bigdata

Part II is much shorter, consisting of just two chapters on the ‘serving layer’. The first chapter explains the concepts, and the second works through an example serving layer database called ElephantDB.

The final part of the book looks at the ‘speed layer’, the element that handles the recent data changes so that query results are up to date. The authors discuss real-time views versus batch views, and there’s a chapter showing how Apache Cassandra can be used to provide real-time views. There’s an interesting discussion of asynchronous architectures, looking at using queues and stream processing for incremental computation. The authors then go on to show implementing one-at-a-time stream processing using Apache Kafka and Apache Storm. The alternative micro-batched stream processing is then explored, with examples based on Trident.

Overall, this is a really good exploration of the techniques of Lambda Architecture and the advantages it might offer. The main caveat is that it isn’t really a book about Big Data, it’s about one solution to the big data problem.

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

R for the Rest of Us

Author: David Keyes
Publisher: No Starch Press
Date: June 2024
Pages: 256
ISBN: 978-1718503328
Print: 1718503326
Kindle: B0CD3GV46N
Audience: Beginners interested in R
Rating: 3
Reviewer: Mike James
Well I'm certainly the "rest of us" - what about you?

+ Full Review

Algorithmic Thinking, 2nd Ed (No Starch Press)

Author: Dr. Daniel Zingaro
Publisher: No Starch
Date: January 2024
Pages: 480
ISBN: 978-1718503229
Print: 1718503229
Kindle: B0BZGZHK3B
Audience: C programmers
Rating: 4
Reviewer: Mike James
What exactly is algorithmic thinking?

+ Full Review

More Reviews

Last Updated ( Saturday, 19 September 2015 )

Recent Articles

Recent Book Reviews

Popular Articles