Reading Your Way Into Big Data
Written by Ian Stirk   
Monday, 14 December 2015
Article Index
Reading Your Way Into Big Data
Introductory and Beyond
Going Further
Paths In Many Directions

Spark code tends to be written in Java, Python, or Scala. To perform a given piece of work, typically, Python code is twice the size of Scala code, and Java code is five times the size of the Scala code. Additionally, some Spark functionality may not yet exist in Python and Java. I recommend you learn Scala in place of Java or Python. With this is mind, I reviewed Learning Scala.The audience for this book is primarily developers that have worked with object-oriented languages, e.g. Java, Ruby, and Python, who want to learn Scala. Scala marries object-oriented features with functional programming, so any experience with these will help in understanding this book. 


Scala is very popular with Big Data systems, being used increasingly with interactive processing (e.g. Spark). The book is relatively small, having around 220 working pages, consisting of two sections: Core Scala (7 chapters) and Object-Oriented Scala (3 chapters). It aims to help developers learn the Scala programming language, and succeeds admirably - provided you are already familiar with programming concepts using another language, especially an object-oriented language. I found myself constantly aligning my existing programming knowledge with Scala's syntax.

The book is well written, concise in its explanations, with plenty of helpful examples to follow along with. The summaries and exercises at the end of each chapter are useful. The answers to the chapter exercises can be found at Learning-Scala-materials.

While the book concentrates on how to use the Scala language, there is little on Scala's associated tools (e.g. Spark). Additionally, it might have been useful to include a section on where to find further information (books, websites, blogs etc). However, these are minor concerns.
Overall, this is a very useful, concise, introduction to the Scala language for existing developers. Highly recommended.


As you look further into Hadoop, you'll quickly become aware that it has a great many associated components. The Field Guide to Hadoop aims to give you a short introduction to Hadoop and its various components. The authors compare this to a field guide for birds or trees, so it is broad in scope and shallow in depth.  It provides up-to-date but limited detail on the major components of the Hadoop Big Data system. Helpful links are provided for further information. Each chapter briefly covers an area of Hadoop technology, and outlines the major players. The book is not a tutorial, but a high-level overview, consisting of 132 pages in eight chapters.



The book is mostly easy to read, with a consistent layout of content (i.e. License, Activity, Purpose, Official Page, Hadoop Integration, description, tutorial link, and simple example code). Useful comparisons between tools are occasionally provided. It should prove helpful to managers, developers, and architects, which are new to Big Data and want a quick overview of the major components of Hadoop. Most Hadoop books discuss some of the components listed here, but this book contains a much wider range of components than other books.
The authors intend to update this book regularly (every year or two), which is ideal if you want to know about the current popular components, and especially good if you have access to safari online (but bad if you need to keep buying the updated book).

If you're new to Big Data and Hadoop, and you want to quickly review what it is, and the current state of its major components, I highly recommend this small book.

Introductory Books Summary

If you are new to Big Data and Hadoop, I recommend you read Hadoop Finance Essentials to get a background understanding of Hadoop and its major components. If you already have some understanding of Hadoop, or you feel confident, the next book to read is Big Data Made Easy, this book is both practical and wide-ranging.

Most introductory Hadoop books have a section on Spark, but for a more detailed approach, I recommend Learning Spark. Spark can be programmed using Java, Python or Scala, I recommend you try Scala when using Spark – since is it more concise and tends to get the Spark functionality first. You can learn more about the language in Learning Scala.
To get a quick overview of the current state of the many Hadoop technologies/buzzwords, I recommend you dip into Field Guide to Hadoop.


Before you can process huge volumes of data, you first need to get the data into Hadoop. Typically, Sqoop is used to import data from relational databases into Hadoop, and Flume is used to import other data (e.g. log files).

I have read and used Apache Sqoop Cookbook extensively, but I haven't yet reviewed it, waiting for an updated version of the book. Published in 2013, it is getting a bit old, and doesn't cover the latest developments of Sqoop 2. That said, it is a very useful introductory guide, very easy to read, wide in scope, and provides plenty of example template code that you can integrate into your own solutions.

Apache Flume is a popular tool for moving log data into Hadoop. This book is aimed at people responsible for getting data from various sources into Hadoop. No previous experience of Flume is assumed. The book does assume a basic knowledge of Hadoop and Hadoop Distributed File System (HDFS), and some Java if you want to make use of any custom implementations. Most Flume development revolves around configuration settings.




This book has well-written discussions, useful hands-on walkthroughs, diagrams, configuration settings, website links, inter-chapter links, chapter summaries, and miscellaneous tips throughout. I enjoyed the author's approach, he's enthusiastic, and explains choices in a considered manner, acknowledging other opinions exist. He encourages you to test your own use cases on your system. You do need to have an awareness of Hadoop to make full use of this book.

This book will enable you to create Flume agents to transfer log data into Hadoop, with due consideration. I highly recommend this book.



<ASIN: B00QW1RQ94>







Last Updated ( Monday, 14 December 2015 )