Page 1 of 4
This addition to Programmer's Bookshelf is a roadmap of the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark.
It is based mainly on books reviewed on i-programmer, together with my own experiences. For each book I outline its purpose, list any prerequisites, and identify its level, range and depth. Additionally I will outline positive and negative features of the book. Clicking on the link associated with each book's title will take you to the detailed review of the book. You should note that most Big Data technologies provide tools to allow you to experiment with them (e.g. REPL for Spark), you should supplement your reading with the practical use of these tools to get the most from the topic.
Increasing amounts of data are being generated, however, traditional means of processing this Big Data, e.g. Relational Database Management Systems (RDBMS) are unable to process this data in a timely manner.
Hadoop, the most popular Big Data platform, provides a solution to processing this data. In essence, the data is stored across various connected servers (nodes), and this data is replicated across the nodes (to provide fault tolerance and a local copy of the data). To process this data, programming instructions are sent to the nodes (typically faster than sending data to the code), the nodes each perform a relatively small amount of processing with their local data, and the results from all the nodes are subsequently amalgamated and returned to the client. This distributed processing approach allows large amounts of data to be processed in a timely manner.
Originally, Hadoop consisted of a distributed file system (HDFS, Hadoop Distributed File System), and a distributed batch processing model (MapReduce). Hadoop was subsequently improved with the addition of YARN (Yet Another Resource Negotiator), which can run various processing models, including MapReduce.
Spark provides an alternative method of processing Big Data. Spark typically provides significantly faster processing than Hadoop’s MapReduce batch processing.
Big Data related jobs, are increasingly in demand. For example, according to the UK website ITJobsWatch, the job ranking for the skill Big Data, was 225 in Dec 2013, 171 in Dec 2014, and 89 in Dec 2015. Other Big Data related skills (e.g. Hadoop) follow a similar trend.
You can clearly see the recent and on-going increase in demand for Big Data related skills. Hence learning about Big Data, Hadoop, and Spark should prove advantageous in terms of interest, job security, and finance!
If you are new to Big Data/Hadoop/Spark here are some titles that provide a helpful starting point.
Hadoop for Finance Essentials aims to introduce Hadoop from a finance perspective, covering a broad range of topics and tools, albeit briefly. The target audience includes developers, analysts, architects and managers. No previous knowledge of Hadoop or its components is assumed.
The book is generally easy to read, has good explanations, useful diagrams, and links to websites for further information. Assertions are backed by supporting evidence. There are plenty of finance use cases for you to consider, and a good section on recommended skills.
Sometimes the examples are unnecessarily complex (e.g. online archiving). This is an introductory book, the examples should be simple. The book’s examples relate largely to investment banking rather than finance as a whole. Most sections are brief, and not deeply technical.
This book should give you a good basic understanding of Hadoop, its scope and possibilities. This book is a useful, if brief, introduction to Hadoop and its major components, using examples from investment banking.
Big Data Made Easy sets out to be "A Working Guide to the Complete Hadoop Toolset" and is both wide ranging in content, and practical in its approach. The author assumes some knowledge of Linux and SQL (but only a little), and no knowledge of Hadoop or its tools.
There is a step by step approach to tool download, installation, execution, and error checking. The following areas of functionality and associated tools are covered:
Hadoop installation (version 1 and version 2)
Web-based data collection (Nutch, Solr, Gora, HBase)
Map Reduce programming (Java, Pig, Perl, Hive)
Scheduling (Fair and Capacity schedulers, Oozie)
Moving data (Hadoop commands, Sqoop, Flume, Storm)
Monitoring (Hue, Nagios, Ganglia)
Hadoop cluster management (Ambari, CDH)
Analysis with SQL (Impala, Hive, Spark)
ETL (Pentaho, Talend)
Reporting (Splunk, Talend)
The book is easy to read, with helpful explanations, screenshots, listings, outputs, and a logical flow between the sections and chapters. There are good links between the chapters, and to websites containing further information. It also steps back and puts what’s being discussed into the larger context of Big Data. The book will certainly give you more confidence in the topic.
It should be noted that there is much more information available on all the tools discussed, however this book is a great starting point, and it does an excellent job of introducing the many tools in an easily understandable manner.
If I have one concern, it relates to who will use the whole book. This book contains both admin and development sections, however large companies typically separate out their admin and development teams.
If you want a useful working introduction and overview of the current state of Big Data, Hadoop and its associated tools, I can highly recommend this very instructive book.
Traditionally (and here we're really only talking a few years ago), Big Data has been processed using batch MapReduce jobs. However, recently there has been a move towards using Spark
provides fast in-memory processing, which can be used for various types of processing (e.g. batch, interactive, Machine Learning etc). One of its major features is the Resilient Distributed Dataset (RDD), which provides a consolidated view of data stored over different nodes. Transformations can be applied to RDDs to alter the data, however, transformations are lazy, and are not evaluated until an action is performed – this can have performance advantages.
The Spark title that I reviewed was very awkward to read so I cannot recommend it. I had to read Learning Spark, which was reviewed by Kay Ewbank, to understand the book I was reviewing! I agree with Kay’s conclusion:
“Overall, this is a good introduction to Spark. A lot of the material could be found separately on various Internet sites, but the authors pull it all together and give a cohesive view. If you’re interested in Spark, it’s a good buy.”