Page 1 of 2
Authors: Kevin Sitto & Marshall Presser
Audience: Managers, architects and developers new to Hadoop
Reviewer: Ian Stirk
This slim book sets out to provide an up-to-date overview of Hadoop and its various components, which seems a worthwhile aim.
Hadoop is the most common platform for storing and analysing big data. This book aims to be a short introduction to Hadoop and its various components. The authors compare this to a field guide for birds or trees, so it is broad in scope and shallow in depth. Each chapter briefly covers an area of Hadoop technology, and outlines the major players. The book is not a tutorial, but a high-level overview, consisting of 132 pages in 8 chapters.
For each component, details are listed for:
License – much is open source but there may be some conditions
Activity – how much development work is being done on the product
Purpose – what the technology does
Official Page – home page of the technology
Hadoop Integration – the technology’s level of integration with Hadoop
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Core Technologies
The chapter opens with a bit of history. The origins of Hadoop can be traced back to a project called Nutch, which stored large amounts of data, together with 2 seminal papers from Google – one relating to the Google File System, and the other about a distributed programming model called MapReduce. The ideas in the papers were incorporated into the Nutch project, and Hadoop was born. Yahoo! began using Hadoop for its search engine, and now Hadoop is the premier platform for processing big data.
Hadoop consists of 3 primary resources:
Hadoop Distributed File System (HDFS) – where you store data. This is optimized for high performance, is read-intensive, and provides resilience by holding multiple copies of the data on different machines. A large block size optimizes data movement.
MapReduce – involves 2 components: mappers that analyze chunks of data, and reducers which aggregate the results of the mappers.
Hadoop’s tools – other components, as described in this book
This was an interesting chapter, laying the groundwork for the rest of the book, identifying what Hadoop is, its major components, and how they work. Helpful links to tutorial information are provided, together with outline code examples (as they are throughout the book).
Perhaps some emphasis could have been given to describing the attributes of big data (i.e. volume, velocity and variety) that require a system like Hadoop to process it. I’m not sure why Spark was included in this core section.
Chapter 2 Database and Data Management
With so much data being stored, there is a need for some kind of database. The chapter opens with a look at the various types of NoSQL databases that exist (e.g. column store, document store, key-value). The chapter continues with a brief overview of the major databases, including:
Cassandra – distributed key-value data store. Quick and easy to use, scales easily.
HBase – ideal for sparse data, key-value data store. No joins, no indexes. Fast.
MongoDB – JSON document-oriented database. Supports secondary indexes.
Hive – access data using SQL-like language.
It should be noted that although the book says Hive does not support delete and update statements, they are supported in later versions (from version 0.14.0 onwards, released November 2014).
This chapter provides a useful, up-to-date view of the various types of data stores that can be used with Hadoop. Occasionally, helpful comparisons between the databases are made. The chapter notes that although MongoDB and Cassandra are currently the most popular databases, HBase is increasing popular and may soon be the leader.
Chapter 3 Serialization
This chapter looks at the format of stored data. Various tools are described, and the trade-off between tool flexibility and complexity discussed. Tools discussed include:
Avro – used for data serialization. Integrates efficiently with Hadoop. Very efficient for large volumes of data. Creates a schema that describes the data dynamically at runtime.
Parquet – columnar data storage, very good for structured data with repetitions. Relatively complex, and doesn’t perform well if you only want a few records.
This chapter provides a useful, up-to-date view of the various types of tool that can be used to serialize/ deserialize data. Helpful example code is provided.
Chapter 4 Management and Monitoring
With a diverse collection of tools, and a large collection of machines, it’s important to be able to monitor and manage the system. Various tools are described here, some are concerned with node configuration management, and others provide a system health overview. Tools examined include:
Ambari – web-based tool with many functions, including installing the various tools in the Hadoop stack. Integrates with Nagios and Ganglia for monitoring/alerts. Has a system health dashboard.
Ganglia – monitoring tool. Designed to work with many clusters and grids. Quickly visualize how system being used, and general system welfare.
Nagios – monitoring tool. Alerts when things go wrong. Graphical tool showing what’s happening with the environment.
Oozie – workflow scheduler. Links jobs together, can stop, start, and restart jobs (cf: SSIS).
I enjoyed this chapter - anything that makes monitoring and management of a large number of tools across a large number of machines should prove very helpful. Tools like Ambari in particular are known to save hours of work and frustration when installing Hadoop and its tools.
Chapter 5 Analytic Helpers
This chapter is concerned with both cleansing/transforming data, and using machine-learning algorithms to categorize and discover things about data. The chapter first looks at MapReduce interfaces, these make MapReduce programming easier. The chapter then looks at analytic libraries, which make data easier to analyze. The tools examined include:
Pig – compared with Java, Pig is a higher level language. Typically slower, but gives easier/ faster development. Often used for ETL. Translated/compiled to MapReduce.
Mahout – machine learning and data analytics. Growing number of libraries. Moving away from MapReduce to domain specific language (DSL) based on Scala.
MLLib – machine learning tool for Spark, best if a Spark shop, similar libraries to Mahout.
This chapter provides a very helpful overview of the current tools that make MapReduce programming easier, and tools that make data easier to analyze. Perhaps this chapter should have been split into two chapters, relating to the two discrete areas covered here.