Hadoop: The Definitive Guide (4th ed)
Article Index
Hadoop: The Definitive Guide (4th ed)
Parts II and III
Part IV
Part V; Conclusion

Author: Tom White
Publisher: O'Reilly 
Pages: 768
ISBN: 978-1491901632
Print: 1491901632
Kindle: B00V7B1IZC
Audience: Hadoop developers, architects and admins
Rating: 4.8
Reviewer: Ian Stirk

This very popular Hadoop book reaches its fourth edition, how does it fare?

The creation of ever increasing amounts of data has changed the way data is being processed. Hadoop is the most popular platform for processing this big data. This updated book covers Hadoop 2 exclusively, with new chapters on several of Hadoop’s components.

This is a wide ranging book divided into five parts. It cover Hadoop’s core components, Hadoop installation and maintenance, various Hadoop-related projects, and some case studies, spread over twenty four chapters.

This book is aimed at developers, architects, and administrators. Some experience of Java or equivalent is needed to get the most out of the book.

Below is a chapter-by-chapter exploration of the topics covered.



Part I Hadoop Fundamentals

Chapter 1 Meet Hadoop

This chapter opens with a look at the recent explosion in data volumes. It is estimated that digital data occupied 4.4 zettabytes in 2013 (a zettabyte is a billion terabytes), and this is expected to grow to 44 zettabytes by 2020. Big data systems have been created to process this data in a timely manner. Hadoop is the most popular big data platform.

The Hadoop Distributed File System (HDFS) allows files to be split, distributed, and duplicated over the servers, this give resilience and allows parallel processing. MapReduce is the processing model used for distributed batch processing.

The earlier version of Hadoop had some problems, many of these have been solved with the implementation of Yet Another Resource Negotiator (YARN), which is responsible for cluster resource management. Additionally, YARN allows other processing models, in addition to MapReduce, to run.

A useful overview of the history of Hadoop is provided, covering the milestones from its beginnings as Nutch in 2002, to its award winning data processing benchmarks. The chapter ends with a brief look at what’s in the rest of the book.

This chapter provides a useful overview of how the need for distributed processing arose, and how Hadoop fulfils these needs. The core components of Hadoop (HDFS, MapReduce, and YARN) are outlined. It’s noted that increasingly, Hadoop is taken to mean both the core components and various related components.

The prose, explanations, examples, and diagrams are well written, relevant and helpful – as they are throughout the book.


Chapter 2 MapReduce

The chapter opens with a look at MapReduce, the batch processing model of Hadoop. It is relatively simple, can be used with a variety of languages (e.g. Java, Python), it’s inherently parallel, and works best with large data volumes.

The chapter continues with a look at a weather dataset, this is used as the sample data in the book’s examples. An exercise to find the minimum and maximum temperatures for various weather sites, for various years, is undertaken using a more traditional tool (i.e. bash), and compared with using Hadoop –showing Hadoop can process large data volumes much more quickly.

MapReduce contains two main stages. The Map stage processes data in parallel over several servers, this processed data is then passed to the Reduce stage which typically aggregates the data.

This chapter provides a useful overview of what MapReduce is, and its main stages. A useful comparison is made between data processing on a single machine using bash, and using MapReduce across many machines. 

Chapter 3 The Hadoop Distributed Filesystem

The chapter opens with the observation that a dataset can be larger than a single machine can store, thus emphasising the need for a distributed file system.

HDFS is designed for very large files, running on commodity hardware. It is less suitable for low-latency access, or accessing lots of small files. The chapter looks at various features of HDFS, including: 

  • Blocks – the default blocksize is 128GB

  • Namenode (master, coordination) and datanodes (slaves, do the work)

  • HDFS Federation – scales cluster by adding other namenodes

  • HDFS High Availability – uses a standby secondary namenode 

The chapter continues with a look at some of the common commands for interacting with the HDFS file system (e.g. copyFromLocal). These commands are similar to their UNIX counterparts.

The chapter has a useful overview of the steps involved in reading data from HDFS and writing data to HDFS. In both cases, helpful diagrams are provided.

This chapter provided a useful explanation of what HDFS is, how it aids resilience and parallel processing, its underlying architecture, and how reads and writes are processed.

Chapter 4 YARN

YARN is a resource manager, added to Hadoop 2 to improve the implementation of MapReduce, additionally, it is generic enough to enable other distributed processing models to run (e.g. Spark).

The chapter continues with a useful overview of how YARN runs an application. Next, a comparison is made between the original version of MapReduce and YARN. The advantages of YARN include improved scalability (double the number of nodes and tasks), availability (High Availability), utilization (pool of resources) and multitenancy (allows other types of processing model).

The chapter ends with a discussion of the various types of schedulers in YARN: 

  • FIFO – processes tasks sequentially, not good for sharing

  • Capacity – allocate static queues/resources to different tasks

  • Fair usage – allocate tasks dynamically 

This chapter provided a helpful overview of what YARN is, why it was created, and its advantages over the original MapReduce architecture. The walkthrough of how YARN runs jobs was useful. The explanation of the different types of schedulers was useful.


Chapter 5 Hadoop I/O

This chapter expands on some of the more important features associated with Hadoop IO. Firstly, data integrity is examined. HDFS holds lots of data, so there’s an increased chance of having bad data. Checksums are used to report on bad data, it is possible for HDFS to heal itself – by copying data from a good replica.

The chapter continues with a look at compression. Compression has the advantage of reducing the space requirement, and improving data transfer speeds. Various compression libraries are examined, and often there a trade off of speed versus space.

Next, serialization is examined. Serialization turns structured objects into a byte stream for transmission over network or storing, deserialization does the reverse.

This chapter provides useful information about advanced aspects of the Hadoop I/O.

Last Updated ( Tuesday, 21 July 2015 )