Page 1 of 2
Author: Michael Frampton
Audience: Devs and admins new to Big Data
Reviewer: Ian Stirk
This book sets out to be "A Working Guide to the Complete Hadoop Toolset" and is both wide ranging in content, and practical in its approach.
Michael Frampton assumes some knowledge of Linux and SQL (but only a little), and no knowledge of Hadoop or its tools.
There is a step by step approach to tool download, installation, execution, and error checking. The following areas of functionality and associated tools are covered:
Hadoop installation (version 1 and version 2)
Web-based data collection (Nutch, Solr, Gora, HBase)
Map Reduce programming (Java, Pig, Perl, Hive)
Scheduling (Fair and Capacity schedulers, Oozie)
Moving data (Hadoop commands, Sqoop, Flume, Storm)
Monitoring (Hue, Nagios, Ganglia)
Hadoop cluster management (Ambari, CDH)
Analysis with SQL (Impala, Hive, Spark)
ETL (Pentaho, Talend)
Reporting (Splunk, Talend)
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1: The Problem with Data
This chapter opens with a review of what big data is – data that can’t be stored and processed in the required timeframe using traditional methods (e.g. relational databases). It is typically defined in terms of the 3Vs (i.e. volume, velocity and variety).
Various attributes of big data are discussed, including: massive scalability, commodity and cost-effective hardware, ability to recover from hardware failure, and parallel processing.
Increasing amounts of data are being created and stored. Big data processing typically starts with high terabyte amounts of data, however big data can also be applicable to smaller systems which grow. Examples of high volume data generators are given, including the Large Hadron Collider which produces 15PB of detector data per year. The chapter ends with an overview of the rest of the book.
This is a useful opening chapter, defining what big data is, its attributes, and the problems it solves. It helpfully explains the book’s practical approach i.e. introduce a tool, show how to obtain it, how to install it, and show example usage. There’s a very useful overview of each subsequent chapter.
Chapter 2: Storing and Configuring Data with Hadoop, YARN, and ZooKeeper
This chapter opens with an overview of Hadoop, its versions, where to download it and how to manually install it. The author uses a small 4 server cluster, for the book’s illustrative purposes, using a symbolic link to swap between the Hadoop version 1 and version 2 environments.
It’s noted that Hadoop version 1 (V1) offers less scalability than version 2 (V2). V1 is scalable to around 4,000 nodes, running around 40,000 concurrent tasks. Differences between V1 and V2 are discussed, with V2 having better resource management via YARN (Yet Another Resource Negotiator) and having non map reduce processing. ZooKeeper, the distributed configuring service, is discussed in terms of functionality, installation and execution. The Hadoop versions are compared using a simple word-count example program.
The chapter looks at an alternative to manually installing Hadoop and its components, using the Hadoop stack. This stack is a collection of tools and their versions that are known to work together successfully. Luckily, vendors such as HortonWorks and Cloudera pre-package these tested tools (the vendor’s tools and their versions are typically similar but not identical). These stacks make the installation process much easier and quicker.
This chapter provides a good understanding of Hadoop’s characteristics, including: large datasets, open source, large cluster, commodity hardware, resilient via data replication, automatic failover, moving processing to the data, and supporting large volumes. Additionally, it provides a step by step guidance for installing, configuration, and running the different versions of Hadoop. Overall, this chapter provides a solid foundation for subsequent chapters.
Chapter 3: Collecting Data with Nutch and Solr
This chapter looks at tools that relate to obtaining data from the web. Nutch is a web crawler, used to obtain data from websites. Solr can be used to index this data and subsequently search it. The data can be stored inside the various storage systems (e.g. Hadoop Distributed File System (HDFS)).
The chapter shows how to download, install and configure Nutch and Solr. The example shows Nutch crawling a website, using Solr for indexing, and storing the data in HDFS. The example uses a file on the Linux file system containing details of the initial website to crawl. The chapter continues with a look at how to download, install and configure HBase (Hadoop’s NoSQL database), this is used in a second example to store the collected data. As always, a useful list of potential errors and solutions is given. The Nutch example given uses Hadoop v1.2.1 since the author couldn’t get Nutch to work with Hadoop version 2 (there is a version currently under development for use with YARN).
This chapter provided useful examples of collecting data using Nutch, Solr and HBase. Plenty of step by step instruction is provided, together with potential errors and their solution.
Chapter 4: Processing Data with Map Reduce
This chapter shows how various tools can be used to perform map reduce. Map reduce involves splitting and processing the work of a problem over numerous nodes (map) and merging/aggregating the results (reduce). These numerous nodes are part of the Hadoop system.
The chapter uses a simple word-count example. The tools examined to solve this problem include: java, Pig, Hive (Hadoop’s data warehouse), and Perl. In each case, step by step instructions are given on how to download, install, configure and run the tool. It should be noted that while java offers the most flexible and low-level solution, it requires the most code and has the longest development time. In contrast, higher level tools like Hive require relatively little code and development time, but are less flexible. In the real world, the problem itself will determine the appropriate tool to use. Chapter 10 has other tools (e.g. Talend, Pentaho) that offer a drag and drop approach to creating ETL, including map reduce steps.
This chapter provided an interesting overview of various tools that can perform map reduce work. Since the same problem (a word-count example) was used, comparisons between the tools could be made. The higher level tools (e.g. Hive), can create solutions more quickly, however they offer less flexibility than using java.
Chapter 5: Scheduling and Workflow
This chapter discusses the two types of scheduler together with their properties. The capacity scheduler handles large clusters, can have many job types and job priorities, which are shared amongst multiple organizations. The fair scheduler aims to share resources fairly amongst all the jobs on a cluster that’s owned by a single organization. The chapter shows how to configure each to work with the different versions of Hadoop.
The chapter describes the download, installation, and configuration of Oozie, the tool used for defining and executing workflow, including job dependencies. Workflow is a set of chained actions that call HDFS based scripts (e.g. Hive). Actions are controlled via pass/fail of control nodes. Fork allows running in parallel. It is possible to send notifications via mail. Useful example workflows are provided using both Pig and Hive.
Chapter 6: Moving Data
This chapter discusses some common ways to move data in and out of Hadoop/HDFS. The chapter opens with a look at using the file system to move data. This is similar to using the Linux file commands, these include: cat, copyFromLocal, copyToLocal, cp, get, put, mv, and tail.
The chapter continues with a look at Sqoop, a well-known tool for moving data between HDFS and relational databases (e.g. Oracle). The data can then be moved into Hive (Hadoop’s data warehouse), alternatively Sqoop can import database data directly into Hive. Sqoop supports incremental load data, and many data formats (e.g. avro, CSV). Sqoop integrates with many other Hadoop tools. The section shows how to download, install, import data into HDFS, and directly into Hive.
Next Flume is examined. Flume is used for moving large volumes of log data. Its model is defined in terms of agents that respond to events, having event sources, channels, and sinks. The section shows how to download, install, set up and run a simple agent that moves messages from the Linux file system to a HDFS directory.
The chapter ends with a look at Storm, which processes streams in real time, so is especially useful for processing Twitter feeds, getting trending data, and discovering what people are talking about right now. The structure of streams is described (tuples, spouts, joints), before describing its configuration via ZooKeeper.
This chapter provides a very useful overview of some of the common tools used to move data into and out of Hadoop/HDFS. Step by step instructions are provided for installation, configuration and running of each tool. Potential errors together with their meaning and solutions are also given.