Reading Your Way Into Big Data
Written by Ian Stirk   
Monday, 14 December 2015
Article Index
Reading Your Way Into Big Data
Introductory and Beyond
Going Further
Paths In Many Directions


To make sense of the huge amounts of data stored in Hadoop it needs to be queried. Various tools exist for querying this data, although they are different, most of the tools provide SQL-like functionality to query the data. It's been suggested that in the future, maybe 20% of Big Data roles will involve low-level programming knowledge (e.g. MapReduce, Scala), but 80% of the roles will involve querying the data via SQL–like queries. The development and maintenance cycles for SQL-like queries are much shorter than for low-level programming languages. Time will tell.

Apache Hive Essentials aims to introduce you to a popular platform for storing and analyzing Big Data on Hadoop. Hive tends to be popular because it uses a SQL-like syntax, familiar to many people. With plenty of built-in functionality, Big Data analysis can be done in Hive without advanced coded skills.

The book is aimed at both the beginner and the more advanced audience (data analysts, developers, and users). Some previous experience of SQL and databases is advantageous. 

Most topics are explained in a very readable manner, a few sections could do with more detail (e.g. transactions). Throughout, there are helpful explanations, screenshots, practical code examples, and inter-chapter references. Some links to websites are provided for further information. This book is especially suitable for developers and data analysts starting out with Hive. Additionally, since it also contains advanced and up-to-date material, it is also suitable for more advanced developers/analysts. If you have a background in SQL the book is even easier to understand.

There are very few books dedicated to Hive, and these tend to be out of date now (especially since Hive changes regularly). If you want an up-to-date, practical, wide-ranging review of Hive's functionality, I highly recommend this book.

Getting Started with Impala aims to get you up-and-running with Impala – a tool for quickly querying Hadoop’s Big Data, and succeeds commendably. This is a short book, containing 110 pages split into five chapters.




Throughout, there are helpful explanations, screenshots, practical code examples, inter-chapter references, and links to websites for further information. It’s packed with useful instructions, but some sections could benefit from more code examples.

This book is suitable for analysts, developers and users that are starting out with Impala. Although aimed at the beginner, several later sections contain more advanced topics (e.g. performance). If you have a background in SQL, you will have a head start, and if you know about data warehousing, the book is even easier to understand. 

Impala is a popular tool for querying Hadoop’s data quickly, much quicker than other tools. Additionally, the development cycle for Impala queries is much shorter than for comparable tools like Java and MapReduce processing. I would suggest Impala should be your first choice for querying data, even if the underlying data is stored in some other component (e.g. Hive).

Obviously there is much more to learn about Impala than what’s given in this small book, but this book is a great place to start learning. Highly recommended.


Advanced books

This section contains details of advanced books. Ideally these should be read after the introductory books, else you might get discouraged by the detail of their content.

Hadoop: The Definitive Guide is a very popular Hadoop book, which recently reached its fourth edition. This updated book covers Hadoop 2 exclusively, with new chapters on several of Hadoop’s components. It is aimed at developers, architects, and administrators.




This is a wide ranging book divided into five parts. It covers Hadoop’s core components (HDFS, MapReduce, and YARN), Hadoop installation and maintenance, various related projects (e.g. Sqoop, Spark), and some case studies, spread over twenty four chapters.

The book is well written, providing good explanations, examples, walkthroughs, and helpful diagrams. Useful links are given between chapters and to websites. Most chapters have footnotes and a “further reading” section so you can obtain more information. You probably need an understanding of Java or a similar language to get the most out of the book. It should take your general level of understanding from level 3 to level 8.

Since the book covers internals, administration, and development, I’m not sure who will read the entire book. Some sections seemed dry on first reading. Some of the books that are referenced are getting old. Not all components are covered (e.g. Storm), but many popular ones are.

I did wonder if there was too much emphasis on MapReduce, since there seems to be movement away from MapReduce batch processing towards interactive processing, as shown with the growing popularity of Spark.

Despite these minor criticisms, if you want to gain a good understanding of the current state of Hadoop and its components, I can highly recommend this book.

Hadoop Application Architectures aims to provide Hadoop current best practices, example architectures and complete implementations – and succeeds in each area.


This book is written for developers and architects that are already familiar with Hadoop, who wish to learn some of the current best practices, example architectures and complete implementations. It assumes some existing knowledge of Hadoop and its components (e.g. Flume, HBase, Pig, and Hive). Book references are provided for those needing topic refreshers. Additionally, it’s assumed you are familiar with Java programming, SQL and relational databases. It consists of two sections, the first of which has seven chapters and looks at factors that influence application architectures. The second consists of three chapters, each providing a complete end-to-end case study.

The book is well written, providing good explanations, examples, walkthroughs, and diagrams. Useful links are given between chapters, and there’s a valuable conclusion at the end of each chapter. The order of the chapters is helpful in understanding the flow of topics. This is not a book for beginners, but does contain useful references to books to get you up to speed.

In many ways, this book follows on naturally from “Hadoop: The Definitive Guide”. It provides practical discussions of the many factors to consider when presented with common Hadoop architectural concerns (e.g. whether to use HDFS or HBase?). The book offers recommendations, and provides supporting information that backs these up.

The book doesn’t cover all Hadoop technologies (e.g. it omits Machine Learning), but it does cover many popular ones. Some of the books referenced are getting old and some chapters have footnotes at the end, which would be better placed on the pages where they are referenced.

Hadoop is changing rapidly, this book suggests the near future will see a decline in MapReduce processing, and a rise in processing using Spark. Similarly, at the higher-level of abstraction, SQL in its various flavours also appears to be in ascendancy.

If you want to know the current state of Hadoop and its components, want a practical discussion of the pros and cons for using various tools, and want solutions to common problems, I can highly recommend this book. 



<ASIN: 1491901632>


Last Updated ( Monday, 14 December 2015 )