|Reading Your Way Into Big Data|
|Written by Ian Stirk|
|Monday, 14 December 2015|
Page 4 of 4
Mastering Apache Spark is for anyone who wants to know more about Spark. In particular, the basic Spark components are discussed, and then Spark is extended with some of the more experimental components.
The book assumes a basic knowledge of Linux, Hadoop, Spark, SBT, and a reasonable knowledge of Scala. The author suggests using the internet to fill any gaps in your prerequisites knowledge.
This book has well-written discussions, helpful examples, diagrams, website links, inter-chapter links, and useful chapter summaries. It contains plenty of step-by-step code walkthroughs, to help you understand the subject matter.
The book describes Spark’s major components (i.e. Machine Learning, Streaming, SQL, and Graph processing), each with practical code examples. Some of the template code could form the basis of your own application code.
Several of the core Spark components are extended using less well-know components, many of these are still works in progress. I’m not sure how many readers will find these chapters/sections useful, since they often involve workarounds, or the components might not exist or be superseded later – they can also distract from the book’s core. That said, if you enjoy working at the bleeding edge of technology, you’ll enjoy what these extensions add.
Although the book assumes some knowledge of Spark, for completeness, it might have been useful to have some introduction to it (e.g. explain RDDs, introduce the spark-shell etc). Developers coming from a Windows environment might struggle initially understanding Linux, SBT, JARs etc.
Despite these concerns, it contains plenty of useful detail. Spark is a rapidly changing technology, so check http://spark.apache.org/ for the latest changes. The book is highly recommended.
This section contains books that don’t fall easily into the previous sections.
HBase Essentials aims at getting you started in programming with HBase. Hadoop is the most popular platform for processing big data, and HBase is the NoSQL database included with Hadoop. The book is aimed at software developers that have no previous experience of HBase, wanting a hands-on approach.
The book has well-written discussions which are generally easy to read, helpful diagrams, outputs, scripts, and brief practical walkthroughs. There are useful links to other chapters.
Although the book aims to get you started in programming with HBase, it deviates from this, containing as much administration detail as programming. Sometimes terms are used before they are defined (e.g. HMaster and Zookeeper), suggesting you need some knowledge of Hadoop. It would have been helpful to list where to go next to extend your HBase knowledge.
This book will help you get up and running with HBase, show you how to use HBase from various clients, and give you an understanding of its internal structure, I can recommend it as a starter book.
Monitoring Hadoop aims to provide details on how to implement monitoring on Hadoop, and succeeds. With more Big Data systems being implemented, there’s an increasing need to monitor and report any problems on these systems as soon as possible. This short book aims to show how to use popular tools (Nagios and Ganglia) to provide a monitoring system.
The book is generally well-written, if brief in many sections. It provides useful descriptions, diagrams and scripts to install and set up monitoring on Hadoop using the popular tools Nagios and Ganglia. It is written by a Hadoop administrator for other Hadoop administrators, so you need to be familiar with Linux, administration, scripts etc. It is a quick-start guide, rather than a deep reference.
It should be noted that many popular vendors (e.g. Cloudera), have cluster managers that make installation and configuration of tools like Nagios and Ganglia much easier than the manual instructions given here - this should have been mentioned. Some of the book’s discussions are curt.
If you want to know how to implement monitoring on Hadoop, using the popular monitoring tools Nagios and Ganglia, I can recommend this book.
Hadoop Interview Guide is a Kindle-only e-book which aims to help you pass an interview for a job as a Hadoop developer at the junior or mid-level position.
It is targeted at existing Hadoop developers, and aims to provide in-depth knowledge of Hadoop and its components. It’s also a starting point for anyone wanting to venture into the Hadoop field from other IT fields.
It is divided into 10 sections (Hadoop, HDFS, MapReduce, Flume, Sqoop, Oozie, Hive, Impala, Pig, and Java) with a total of 434 questions, and 23 example tasks to programs or follow along with.
This book contains a wide-range of questions about Hadoop and its components, and the answers generally provide accurate explanations with sufficient detail. The tasks to program at the end of each chapter should prove useful in demonstrating your practical understanding of the topics.
Many other common Hadoop components could have been included e.g. HBase (NoSQL database), Spark (fast MapReduce alternative), ZooKeeper (configuration), and Mahout (machine learning). Perhaps it can be expanded in the future to include these topics - that said, the book does cover many of the core Hadoop technologies.
I don’t think you can learn the subjects directly from this book, but you can use it as a benchmark to measure how much you have learned elsewhere.
Generally the book is well written, however, some of the questions have substandard English grammar. Some of the longer example code is difficult to read because it is not formatted adequately. The references at the end of the book are useful, but details of publication date, edition, and publisher are missing (one has only the author’s first name!). There is a large list of websites at the end of the book, however, none of the sites is annotated.
It should be remembered that Hadoop and its components are rapidly changing. So it’s important to view the answers in the context of the version used e.g. CDH 5.3 has Hive version 0.13.1 which does not support data modifications (as answered in the book), however Hive version 0.14.0 does.
This book contains a wide-range of useful questions about Hadoop and many of its components. Overall a helpful book for the interview!
This article provided a review of recommending reading material, for Hadoop, Big Data, and Spark, covering the following areas:
Big Data, Hadoop, and Spark are evolving technologies, with many tools/components competing for attention. Additionally, many of these tools change relatively rapidly. This has resulted in several omissions (e.g. Machine Learning), perhaps these could be included in a future updated version of this article.
I hope I’ve succeeded in providing a pathway to learning about Big Data, Hadoop, and Spark. Remember, this is just a start, you can keep up to date with Big Data developments, news and book reviews at I Programmer.
Follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for our new reviews and for each day's new addition to Book Watch and visit Book Watch Archive for hundreds more titles.
|Last Updated ( Monday, 14 December 2015 )|