Authors: Arun C Murthy, Vinod Kumar Vavilapalli
Publisher: Addison Wesley, 2014
Aimed at: Programmers who want to learn about the most recent Hadoop
Reviewed by: Kay Ewbank
Subtitled "Moving Beyond MapReduce and Batch Processing with Apache Hadoop".
Most programmers will have heard of Apache Hadoop, the open-source framework designed for managing big data in a distributed way. Hadoop HDFS is the data storage layer, and it’s normally used with MapReduce, the data processing layer. YARN – Yet Another Resource Navigator is an attempt to provide an alternative layer to MapReduce, to enable more generic processing of HDFS data. Most Hadoop installations have a shared MapReduce cluster with shared HDFS instances. The main components of this shared architecture are a JobTracker daemon that runs all the jobs in the cluster, and TaskTrackers that act as the slaves, executing one task at a time under directions from the JobTracker. YARN splits the JobTracker role into two separate daemons, one for resource management and the other for job scheduling/monitoring.
The authors of the book are the YARN project founder (Arun Murthy), and the project lead (Vinod Kumar Vavilapalli), so they certainly know what they’re talking about.
The book opens with brief descriptions of how Hadoop developed to the point where YARN emerged, which made the reasons for developing YARN a lot clearer. Having (hopefully) convinced you that YARN is a good idea, there’s a chapter on installing and getting started with YARN, showing how to configure a single-node YARN cluster. Next comes a description of YARN, its core components, and how it fits into Hadoop. Having given the overview of the components, the authors then work through a functional overview of the YARN components.
Having introduced the components, the book then has a chapter on installing YARN. This seems a little strange when we’ve already had a chapter on a quick simple install of YARN, but the idea seems to be to get you started, then show you a cluster-wide installation. The chapter works through a script installation and a GUI-based install using Apache Ambari. Next, the authors show how to administer YARN using scripts and open-source tools such as Nagios, Ganglia and Ambari.
Chapter 7 is a more detailed administration guide, working through the tasks of administering the ResourceManager, NodeManager, and ApplicationManager. YARN’s Capacity Scheduler is the next topic to be covered, looking at how it can be used to manage applications in a shared cluster to maximize application throughput and make best use of the cluster.
One of the most common ways to begin using YARN, if you’re an existing Hadoop user, will be with existing MapReduce apps that might potentially run better under YARN, and the next chapter shows how to go about this.
By Chapter 10 we finally get to the point of a YARN application example, specifically how to create a cluster of JBoss Application Servers. JBoss is an open-source Java EE server, and the example also shows how to write a YARN client. The next chapter looks at YARN’s Distributed-Shell application, which is a non-MapReduce app built on top of YARN. It gives a simple way to run shell commands and scripts on multiple nodes in a Hadoop cluster.
The final chapter in the book looks at YARN frameworks – Tazm Giraph, Hoya (HBase on Yarn), Dryad, Spark, Storm, REEf and Hamster (Hadoop and MPI on the Same Cluster) each get a couple of paragraphs explaining what they do and how they fit with YARN. The book closes with a set of appendixes with scripts and reference libraries. One complaint some people have made about the book is the lack of downloadable scripts, and while the download page for the code still says the authors are in the process of setting up a github page, you can download the scripts for installation, administration, and the YARN example.
I found this a very understandable book, on the whole. The authors are obviously extremely knowledgeable about the topic, and by the end of the book I felt I knew a lot more about the basics of YARN. I think this is still an introduction to YARN, a way to find your feet rather than turning into an expert. It’s also not a ‘press this, type that’ getting started introduction – the scripts are examples not instructions. So long as you bear these points in mind, it still represents an excellent introduction to YARN.