Page 1 of 2
Author: Shiva Achari
Publisher: Packt Publishing
Audience: Developers new to Hadoop
Reviewer: Ian Stirk
This book aims to give you an understanding of Hadoop and some of its major components, explaining how and when to use them, and providing scenarios where they should be used.
It is aimed at application and system developers that want to solve practical problems using the Hadoop framework. It is also intended for Hadoop professionals who want to find solutions to the different challenges they come across in their Hadoop projects.
A prerequisite is a good understanding of Java programming, additionally, a basic understanding of distributed computing would be helpful.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Introduction to Big Data and Hadoop
The chapter opens with the emergence of big data systems as a response to the limitations of relational databases (RDBMS), which were unable to process big data in a timely and cost-effective manner.
The next section looks at explaining the need for big data systems with reference to the 3 Vs of big data:
- Volume (1.8 zettabytes of data created in 2011, 35 zettabytes expected by 2020)
- Velocity (data arriving quickly)
- Variety (structured and semi-structured data e.g. emails)
The chapter continues with a look at the sources of big data, including: monitoring sensors, social media posts, videos/photos, logs etc. Some big data use case patterns are briefly described.
Next, Hadoop is examined, being the most popular big data platform. Hadoop is open source, and offers large-scale massively parallel distributed processing. Hadoop has 2 major components: HDFS (Hadoop Distributed File System) - Hadoop’s storage system, and MapReduce – Hadoop’s batch processing model. The section continues with a look at Hadoop’s history, advantages, uses, and related components.
The remainder of the chapter provides an overview of the other chapters of the book, namely:
- Pillars of Hadoop (HDFS, MapReduce, YARN)
- Data access components (Hive, Pig)
- Data storage component (HBase)
- Data ingestion in Hadoop (Sqoop, Flume)
- Streaming and real-time analysis (Storm, Spark)
This chapter provides a useful understanding of how big data processing arose, and how Hadoop fulfils this need. There’s a useful overview of the four main types of NoSQL database. There’s a helpful overview of Hadoop, its history, advantages, uses, and associated components. There are plenty of helpful diagrams to aid understanding (as there are in the rest of the book), and a useful introduction to what’s coming in the rest of the book.
Sometimes, the English grammar is substandard; this occurs in various sections of the book. Some subsections seem disjointed (okay within themselves, but not part of a wider coherent section) – again this occurs in other parts of the book. There’s a small error relating to the amount of total data created in 2009, the value given is 800GB, the correct value is 800 exabytes or 0.8 zettabytes. All these problems should have been caught by the reviewers/editors.
Chapter 2 Hadoop Ecosystem
The chapter opens with a look at traditional database systems, which are good for Online Transaction Processing (OLTP) and Business Intelligence (BI), however they are unable to scale to very large data volumes.
Hadoop is able to process large data volumes in a timely and cost-effective manner. Example Hadoop use cases are listed, including: fraud detection, credit and market risk, predictive aircraft maintenance, text mining, social media, and sentiment analysis.
There are useful diagrams showing various Hadoop components, the basic data flow between components, and how they’re linked together.
The chapter continues with a very brief look at various components that extend Hadoop, namely:
- distributed programming (MapReduce, Hive, Pig, Spark)
- NoSQL databases (HBase)
- data ingestion (Flume, Sqoop, Storm)
- service programming (YARN, ZooKeeper)
- scheduling (Oozie)
- machine learning (Mahout)
- system management (Ambari)
This chapter introduces some of the problems Hadoop can help solve, and continues with a helpful, if brief, look at some of the more common Hadoop related components together with their areas of functionality (more detail is provided later). Helpful diagrams are provided throughout.
Chapter 3 Pillars of Hadoop – HDFS, MapReduce, and YARN
At its core, Hadoop consists of the storage systems (HDFS) and the batch processing model (MapReduce). In Hadoop 2, YARN (Yet Another Resource Negotiator) was added; this allows greater scalability, improved performance, and hosting of non-batch processing frameworks (e.g. Spark).
The chapter opens with a look at HDFS. Like many Hadoop components, HDFS follows a master/slave pattern. Various features of HDFS are examined, including: scalable, reliable/fault tolerant, hardware failure recovery, portability, and having computation closer to data. Architecture is then examined, in relation to the four types of node:
- NameNode (master, coordinates storage, metadata for location of each block on DataNode)
- DataNode (holds data. Sends heartbeat to NameNode regularly)
- Checkpoint NameNode or Secondary NameNode (in case primary NameNode fails)
- BackupNode (similar to checkpoint node, but keeps updated copy FsImage in RAM)
Next, data storage in HDFS is examined. Data is held in blocks, and these are split and distributed to other nodes. Additionally, copies of this data are stored on other nodes. This aids parallel processing and fault tolerance. The section ends with a look at some of the typical Linux-like HDFS commands used, including: creating a directory, listing a directory, view file contents, and copying a file.
The chapter then looks at MapReduce. MapReduce is a parallel processing batch framework, allowing data to be processed on a scalable, fault-tolerant, distributed environment. There’s a very helpful MapReduce example provided, and a helpful diagram explained in a step-wise manner. The MapReduce process is then examined in detail, discussing the mapper, shuffle and sort, and reducers. A helpful simple word-count MapReduce program, written in Java, is provided.
The last section looks at YARN, which can run different types of distributed applications, including: batch (MapReduce), interactive and real-time. YARN delegates responsibility, allowing better performance and fault tolerance. A list of applications that use YARN is given.
This chapter provides a useful overview of the features and functionality of the core Hadoop components (HDFS, MapReduce, and YARN). The MapReduce word-count example, the various diagrams, step-by-step explanation, and Java code, were all helpful.