Data Analytics With Hadoop
Data Analytics With Hadoop

Author: Benjamin Bengfort & Jenny Kim
Publisher: O'Reilly
Pages: 150
ISBN: 978-1491913703
Print: 1491913703
Kindle: B01GGQKXO4
Audience: Data Scientists familiar with Python

Rating: 4.5
Reviewer: Kay Ewbank


A book that is short and to the point - recommended

This is a book that concentrates on using Hadoop for data analysis rather than wasting time on deployment and management of Hadoop. It shows how to work in Python with MapReduce and Spark, Hive and HBase. 



The first half of the book takes a high level view of distributed computing and aims to tell you how to run computations on a cluster. The second half then looks at the tools and techniques you might use, along with an explanation of why particular types of analysis and techniques are useful.

Having introduced the concept of the data product, the authors introduce the core concepts of Hadoop family, focusing on YARN and HDFS.

By Chapter 3, Bengfort and Kim get to MapReduce, and in particular how to write MapReduce jobs in Python (as the MapReduce API is written in Java). Soark is next to be introduced, which is the choice for everyday interactions and analysis on a Hadoop cluster.

Chapter 5 takes a practical look at how to write distributed data analysis jobs. The authors say that coming into this chapter, you should understand the mechanics of writing Spark and MapReduce jobs, and by the end of it you should feel comfortable actually implementing them.

A chapter on data mining and warehousing comes next, focusing on Hive, Hadoop's SQL-based query engine, along with its NoSQL database, HBase. This is followed by chapter exploring how to get data into a distributed system. There's a good description of how to use Sqoop for bulk loading, and Apache Flume for dealing with unstructured data such as logs.

Analytics with higher-level APIs is covered next, with a look at Apache Pig and Spark DataFrames API. There's a good chapter on machine learning and how to use Spark MTLib, and the book ends with a summary chapter called Doing Distributed Data Science in which the authors go through the whole lifecycle of distributed data science showing how it all fits together.



I liked this book; it gives a good introduction to the Hadoop ecosystem, concentrating on the analysis side and mainly ignoring the day-to-day administration.The descriptions are good, not too long-winded. The authors give pointers to places where you can read more, so you don't miss out where they give an overview rather than a detailed explanation. The Python examples are good, and used well to explain ideas. Perhaps the most useful part of the book is the final chapter where they show how to do an entire analytic workflow from start to finish. Well worth a read.


To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.


Begin to Code with C#

Author: Rob Miles
Publisher: Microsoft Press
Pages: 512 
ISBN: 978-1509301157
Print: 1509301151
Kindle: B01LDAJTQG
Audience: Complete beginners
Rating: 4.5
Reviewer: Mike James

Books that are aimed at the complete beginner are rare and good books aimed at the complete beginn [ ... ]

Microsoft SQL Server 2012 T-SQL Fundamentals

Author: Itzik Ben-Gan
Publisher: Microsoft Press
Pages: 448
ISBN: 9780735658141
Print: 0735658145
Kindle: B00JDMPI0I
Audience: Beginner T-SQL developers
Rating: 5
Reviewer: Ian Stirk

A well-known SQL Server expert explains the fundamentals of T-SQL, how does he fare?

More Reviews

Related Reviews
Field Guide to Hadoop

Data Science and Big Data Analytics

Hadoop: The Definitive Guide (4th ed)

Hadoop Application Architectures

See also Reading Your Way Into Big Data



Last Updated ( Friday, 23 September 2016 )

RSS feed of book reviews only
I Programmer Book Reviews
RSS feed of all content
I Programmer Book Reviews
Copyright © 2017 All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.