Hadoop: The Definitive Guide (4th ed)
Article Index
Hadoop: The Definitive Guide (4th ed)
Parts II and III
Part IV
Part V; Conclusion

Part IV Related Projects


This part has nine chapters and it should be noted there is much more to know about each of these topics, but the aim is to provide a working introduction.  

Chapter 12 Avro

Avro is a language neutral serialization system, this is important since it allows the sharing of datasets with a wide set of languages/frameworks.

The chapter opens with a look at the Avro data types and schemas. Avro has a small number of primitive data types and some complex types – a table of these is given. A helpful table mapping Avro data types to Java data types is provided.

The chapter continues with a look at in-memory serialization and deserialization, which is helpful for integrating Avro into existing systems. Next, the structural content of Avro data files is discussed. Interoperability is demonstrated with examples of an Avro file being written from a Python program and being read by a Java program.

A section on the use of Avro in MapReduce follows, showing the various classes that make it easier to run MapReduce on Avro data. Sorting using Avro MapReduce is also examined. The chapter ends with a reiteration of the language independent nature of Avro, showing Avro usage in Pig and Hive.

This chapter provides a useful overview of a very popular language neutral serialization system.

Chapter 13 Parquet

Like Avro, Parquet is another popular serialization system. Parquet is a columnar storage format, and especially suited for efficient storage of nested data.

The chapter take a look at the Parquet file format (header, blocks, footer), and then looks at writing and reading Parquet files, this is often done by higher level tools such as Pig and Hive. A low-level example of using Java classes is given.

The chapter ends with a look at using Parquet in MapReduce. Parquet has some specific MapReduce input and output formats suitable for reading and writing Parquet files from MapReduce jobs. A MapReduce example is provided that converts text files into Parquet files.

This chapter provides a useful overview of a popular serialization system, optimized for nested data, and especially useful when you only need a few columns of data from each row.

Chapter 14 Flume

Hadoop excels at processing large amounts of data in a distributed manner. However, the data first needs to get into Hadoop. Flume is a popular tool for importing data into HDFS and HBase.

The chapter opens with a look at where to download Flume, and continues with installation details. Next a simple Flume example is given, specifically watching a local directory for a new file, then sending each line to the console. The example shows the architecture of a Flume job, a Flume agent gets data from a source (e.g. text file) to a sink (e.g. console) via channels (e.g. memory), this configuration is achieved via configuration files. Later, an example using HDFS as the sink is given.

The chapter next discusses transactions (there are separate transaction on the source and sink sides) and reliability (if error, can redeliver later). There’s a useful table showing the various Flume components together with a brief description of their purpose.

This chapter provides a helpful introduction to Flume, a popular tool for getting data into Hadoop.

Chapter 15 Sqoop

Sqoop is a popular tool for moving data between relational databases and Hadoop (i.e. HDFS, Hive, HBase etc) via MapReduce jobs.

The chapter opens with a look at where to download Sqoop, and continues with installation details. Next, a simple Sqoop example is given, data from a MySQL table is imported into HDFS using Sqoop. The default file format is CSV, but other formats are possible, including Avro. In addition to HDFS, data can also be imported into HBase and Hive. Later an export example is provided.

The chapter explains that when a Sqoop job starts, it inspects the source for details of the data (names and data types), and builds Java classes from this which are used in the MapReduce jobs that perform the import or export. These jobs can run over the cluster, and the number of mappers can be supplied as a parameter.

Since much data resides in relational databases, this chapter should prove very useful in your own projects that import data into Hadoop. Perhaps more examples could have been provided.

Chapter 16 Pig

Any significant MapReduce programming typically requires detailed programming knowledge. Using higher-level tools can make development easier and quicker. These tools provide powerful transformations and include joins, which can be troublesome in MapReduce. Pig is one such high-level tool. Pig consists of a data flow language called Pig Latin, and the execution environment.

The chapter opens with a look at where to download Pig, and continues with installation details. Pig runs jobs to interact with HDFS from client workstation. Pig is typically run in local mode during testing, and MapReduce mode when full cluster capabilities are required.

The chapter continues with an example using Pig Latin code to calculate the maximum temp per year, which requires only a few lines of code – much shorter than the corresponding Java code. Next, a comparison between Pig and databases is made, with Pig being procedural and SQL being declarative, additionally databases have schemas, constraints, indexes and transactions.

Next, a section provides a brief introduction to Pig Latin, covering basic syntax and semantics. Structures, statements, expressions, types, schemas, functions and User-Defined Functions (UDFs) are outlined. The next section gives details of various data processing operations, including: loading and storing data and filtering data. This is followed by a section on grouping and joining data.

The chapter ends with a look at using Pig in practice. Various performance tips are outlined, including: Parallelism – can set #reducers via PARALLEL clauses, and the use of parameter substitution which can be useful in your scripts.

This chapter provides a helpful introduction to Pig, allowing MapReduce jobs to be written more easily and quickly using a higher-level of abstraction.

Chapter 17 Hive

Hive is Hadoop’s data warehouse. Hive takes advantage of the existing large base of SQL skills that exists amongst analysts (much more so than Java skills).

The chapter opens with a look at where to download Hive, and continues with installation details. The chapter then looks at running Hive, specifically the various configuration files (e.g. hive-site.xml.), the Hive services, and the metastore (repository of hive metadata).

Hive is then compared with traditional databases. The main way to interact with Hive data is via its query language HiveQL, this is very similar to SQL. Recent releases of Hive have added support for updates, transactions, and indexes. Initiatives are currently progressing to move Hive towards non-MapReduce systems (e.g. Hive on Spark) aiming to provide much faster querying.

The next section takes a look at HiveQL, providing a helpful table comparing HiveQL with the equivalent SQL command. Hive’s various data types are described, together with its common operations and functions.

A large section relating to tables follows, highlighting the difference between internal tables (managed by Hive) and external tables (handled by you). Distributing data via partitioning is discussed, this often provides more efficient queries (you only access the relevant partitions), similarly, buckets are explained (these provide an additional structure that can be used in joins).

Another large section relating to querying data follows. This includes discussions and examples relating to sorting and aggregation, MapReduce processing, joins, subqueries and views. The chapter ends with a look at UDFs, which tend to be written in Java and then used from HiveQL.

This chapter provides a useful overview of what Hive is, and how it can be used. There’s a useful note about recent activity concerning Hive moving away from MapReduce and becoming more interactive. There’s a useful comparison between SQL and HiveQL. Perhaps something could have been said about buckets given a more even distribution of data – important for MapReduce.

Chapter 18 Crunch

Similar to Pig, Crunch is another high-level tool that abstracts MapReduce programming, making it much easier. Crunch has programmer-friendly types, and data transformation. In many ways is like a Java version of Pig. It has the advantage that functions can be written in Crunch themselves, so you can stay working within the same development environment.

The chapter opens with a simple example Crunch application that finds the maximum temperature from the weather dataset. Next, the core Crunch API is examined, including: primitive operations (unions, parallelDo, groupByKey and combineValues), types, sources and targets, and functions.

The chapter continues with a look at creating a pipeline execution, where a workflow is created and Crunch builds an execution plan. Examples are provided for running and stopping a pipeline. The Crunch execution plan is examined, since this often gives a useful insight into pipeline processing.

The chapter ends with a look at some Crunch library functions, a useful listing of their methods together with brief description of its usage is supplied.

This chapter provides a good overview of Crunch, what it is and how it can be used. In many ways it provides similar functionality as Pig, so perhaps this chapter wasn’t required?!

Last Updated ( Tuesday, 21 July 2015 )