Hadoop: The Definitive Guide (4th ed)
Article Index
Hadoop: The Definitive Guide (4th ed)
Parts II and III
Part IV
Part V; Conclusion

Part II Map Reduce

Chapter 6 Developing a MapReduce Application

This chapter covers the more practical aspects of developing a MapReduce application in Hadoop. Generally, you need to write just the map and reduce functions, and ideally have some associated unit tests. A driver program runs the job. Typically, your initial tests will use a little data - with the aim of discovering any obvious bugs. Once these bugs are fixed, testing on the cluster can begin.

The chapter discusses the configuration files that your code can use. Various examples of using the configuration API to access the config files are given. A MapReduce program can be run locally or on the cluster, and this can be controlled by a config switch. Writing unit tests with MRUnit is discussed.

After testing the code locally, and fixing any problems, it can be run on the cluster. Here, monitoring and debugging is more complex. It’s suggested to use the logs and progress status as debugging aids.

The chapter ends with a useful checklist of tuning best practises, these include: changing the number of mappers/reducers, use intermediate compression, and use custom serilaization.

This chapter provided a useful overview of how to develop MapReduce applications. There a useful suggestion about using higher-level tools (e.g. Hive) for easier and quicker development.

Chapter 7 How MapReduce Works

The chapter opens with a look at how Hadoop runs a MapReduce job. The stages are: job submission, job initialization, task assignment, task execution, progress/status update, job completion. Each of these stages is examined in detail.

The chapter next looks at failures, their causes (e.g. bad code), and how to handle them so your job can finish successfully. Failures are examined from the viewpoint of: Task, Application Manager, Node Manager, and Resource Manager – in each case the severity of the failure and the possibility of recovery are examined.

Next, the step between Map and Reduce, called Shuffle is examined. Shuffle involves doing a sort and transferring the map output to reducers. The chapter ends with a look at the factors affecting task execution, including: task environment properties, and speculative execution.

This chapter provided a detailed look at how a MapReduce job runs. The steps have plenty of detail, and are tied together to give a coherent flow.

Chapter 8 MapReduce Types and Formats

The chapter takes a detailed look at the data types and formats involved in the MapReduce model. Useful tables are provided showing the configuration of MapReduce types in both the old and new MapReduce APIs – this should prove useful in any migration work.

The chapter continues with a detailed look at input formats, and how they are processed. Types examined include: text, binary, and database input. Similarly, output formats are then examined.

Chapter 9 MapReduce Features

This chapter looks at the more advanced features of MapReduce, including:


  • Counters – user-defined and built-in counters are examined at various levels (task and job)

  • Sorting – is central to MapReduce, various ways of sorting are examined

  • Joins – performing joins in MapReduce is often complex, using higher-level programs such as Hive or Pig makes using joins much easier

  • Side data distribution – extra read-only data needed by job to process the main dataset

  • MapReduce library classes – contains helpful commonly used functions 

This chapter contains detailed advanced information concerning a miscellany of MapReduce features.  



Part III Hadoop Operations

Chapter 10 Setting Up a Hadoop Cluster

This chapter contains details on how to set up Hadoop to run on a cluster. While running on a single machine is great for learning purposes, a cluster is needed to process large datasets.

The chapter opens with an overview of the various methods of installation, including: 

  • Install binaries and source tarballs - most flexible but also the most work

  • Use packages – e.g. Bigtop project, ensures components are consistent

  • Use cluster management tools - e.g. Cloudera Manager, simple UI, good defaults, wizards etc 

The chapter next examines how to install and configure a basic Hadoop cluster from scratch using Apache Hadoop distribution – but does make the very valid point that using cluster management tools, especially for live, is much easier.

The next section considers how to configure Hadoop, using its various configuration files, the content of each of these is outlined. Each node has configuration information that needs to be kept synchronized, tools such as Cloudera Manager excel at this synchronization. Next, environment settings (e.g. memory heap size), and the various Hadoop daemons are discussed. Security is briefly discussed, with Kerberos performing authentication, and Hadoop handling the permissions.

The chapter ends with a look at benchmarking the cluster, to ensure it is set up correctly. This can be achieved by running some jobs and checking the output is valid, and runs in a timely manner. Hadoop comes with some benchmark jobs you can use.

This chapter provides a useful overview of how to set up Hadoop on a cluster. Various methods were given, including set up from scratch. Using a cluster management tool is preferable.


Chapter 11 Administering Hadoop

This chapter is concerned with keeping the cluster running smoothly. The chapter opens with a look at how the various components of HDFS (e.g. namenode) organize their data. Directories, files, and logs are discussed. This information can be useful when diagnosing problems.

The chapter continues with a look at various admin tools, namely: 

  • dfsadmin – information about state of HDFS and performs tasks

  • fsck – check health of files

  • datanode block scanner – runs on each datanode periodically to verify blocks

  • balancer – balances distribution of blocks across datanodes 

Next, monitoring is discussed. All the daemons produce log files these can be configured to produce verbose content. The importance of the daemons for the master and resource manager is noted.

The chapter ends with a look at maintenance. The section discusses the automation and scheduling of: the metadata backups (for namenode), data backups, fsck, and balancer. The commissioning and decommissioning of nodes is given in step-by-step walkthroughs, as is the upgrade process.

This chapter discusses the regular tasks that need to be performed to keep the Hadoop cluster in good shape. Various associated tools are discussed.


Last Updated ( Tuesday, 21 July 2015 )