Page 1 of 2
Authors: Mark Grover et al
Audience: Hadoop developers and architects
Reviewer: Ian Stirk
This book aims to provide best practices and example architectures for Hadoop technologists, how does it fare?
This book is written for developers and architects that are already familiar with Hadoop, who wish to learn some of the current best practices, example architectures and complete implementations. It assumes some existing knowledge of Hadoop and its components (e.g. Flume, HBase, Pig, and Hive). Book references are provided for those needing topic refreshers. Additionally, it’s assumed you are familiar with Java programming, SQL and relational databases. It consists of two sections, the first of which has seven chapters and looks at factors that influence application architectures. The second consists of three chapters, each providing a complete end-to-end case study.
Below is a chapter-by-chapter exploration of the topics covered.
Section I Architectural Considerations for Hadoop Applications
Chapter 1 Data Modeling in Hadoop
The chapter opens with a look at storage considerations. Various file types are discussed, and the importance of spilltable compressed data highlighted. Avro and Parquet are generally the preferred file formats for row and columnar based storage respectively.
The chapter continues will at look at factors to consider when storing data in HDFS. Directory structures are recommended (e.g. /users/<username>). If you know what tools you intend to use to process the data (e.g. Hive), you can take advantage of partitioning – reduces IO, bucketing – improves performance of joins, and denormailization – eliminates the need for joining data.
Factors to consider when storing data in HBase are discussed next. HBase is a NoSQL database, often thought of as a huge distributed hash table. This key-value store is optimized for fast lookups, and is especially suitable for problems having relatively few get and put requests. HBase tables can have millions of columns and billions of rows. Important considerations for choosing the row key are discussed. Other aspects of HBase covered include: use of timestamps, hops, tables and regions, and the use of column families.
The chapter ends with a look at metadata, describing what metadata is, and why it’s important. The importance of the Hive metastore and its reuse by other tools is discussed.
This chapter provides a useful discussion of features to consider in data modeling. Some sections seem wordy, but probably need to be so. Some useful recommendations are given (e.g. use the Avro file format), together supporting reasons.
From its start, it’s clear this is not a book for beginners. The chapter is well written, has useful explanations, discussions, diagrams, references, links to other chapters, and considered recommendations. A useful chapter conclusion is provided. These features apply to the whole book.
Chapter 2 Data Movement
This chapter discusses factors to consider when looking at moving data into and out of Hadoop. Factors discussed include: the various data sources (e.g. RDBMS, mainframes, logs etc), how often the data should be extracted, access patterns, if the data should be appended or overwritten, data transformations, and handling failures.
The chapter continues with a look at various tools for ingesting data. In each case, an introduction to the tool is provided, together with references on where to find detailed information. The advantages and disadvantages of each tool are discussed. Useful discussions, diagrams, and example code is provided for: File Transfers - HDFS commands, Sqoop –transfers data between Hadoop and relational database, Flume – typically processes log files, and Kafka - typically processes log files. Recommendations for when to use each tool are provided, together with tips for trouble shooting and resolving bottlenecks.
The chapter ends with a brief look at tools used for getting data out of Hadoop, these include Sqoop, distcp (distributed copy), and the get command.
This chapter provides a useful overview of factors to consider for data movement with Hadoop. Helpful example usage of various tools (e.g. Sqoop) is given.
Chapter 3 Processing Data in Hadoop
This chapter looks at the various factors to consider, so you can select an appropriate processing tool. The range of tools examined range from low-level Java routines to high-level SQL queries.
Traditionally, Hadoop processing has revolved around MapReduce, here the work is split over many data nodes (Map phase), and the outputs combined and sorted (Reduce phase). A helpful annotated example is provided (joining and filtering data in 2 datasets). Recommendations for when to use MapReduce are provided. It’s noted this is a low-level framework, and typically has more opportunities for bugs, and a longer development cycle.
The chapter continues with a look at Spark. In many ways, Spark is the replacement for MapReduce, typically processing data much faster. The section opens with an overview of Spark, with Resilient Distributed Datasets (RDDs) being at its core. RDDs can store data over various nodes, and are processed in parallel. Transformations can be lazily applied to RDDs to produce new RDDs, these are not evaluated until an action is performed - allowing some clever optimizations to occurs, but this can make debugging more difficult. The benefits of using Spark are discussed.
The chapter moves on to look at tools that provide a higher-level of abstraction – typically providing simple frameworks that allow MapReduce jobs to be created underneath. The tools examined are: Pig, Crunch, Cascading, Hive and Impala. For each tool, an overview is given, followed by example code, and a discussion of when to use the tool. Where possible, comparisons are provided.
This chapter provides a useful overview of tools that process data in Hadoop. Especially useful were the comparisons, advantages/disadvantages, and recommendations of when to a given tool. The section on Cascading is not needed, since it’s similar to Crunch and Pig, and uncommon.
Chapter 4 Common Hadoop Processing Patterns
As you create solutions with new technologies, certain common problems appear. Eventually, common solutions to these problems also emerge. This chapter discusses these common big data problems together with their solutions.
The patterns examined are:
In each case, the underlying problem is described, test data generated, and example code is supplied for several solutions to the problem (typically in both Spark and SQL).
This chapter provides useful solutions to common problems that occur in processing Hadoop data. It’s suggested that SQL will become increasingly powerful, but unlikely to replace Spark.