Field Guide to Hadoop
Article Index
Field Guide to Hadoop
Chapters 6-8, Conclusion

 

Authors: Kevin Sitto & Marshall Presser
Publisher: O’Reilly 
ISBN: 978-1491947937
Print: 1491947934
Kindle: B00U6P2Q9M  
Audience: Managers, architects and developers new to Hadoop
Rating: 4.7
Reviewer: Ian Stirk 

 

Chapter 6 Data Transfer

Typically, data arrives in Hadoop from other systems. These include relational databases and various flat file sources. This chapter outlines some of the major tools that move data in and out of Hadoop, and between Hadoop clusters. Tools discussed include:

  • Sqoop – moves data between relational databases and HDFS. Uses MapReduce.

  • Flume – used for data collection and aggregation, especially log files.

  • DistCp – Distributed Copy, used to copy files between Hadoop clusters e.g. move data from live cluster to test cluster.

Many big data projects will require data from external sources, so the popular tools discussed in this chapter should prove very helpful.

 

Chapter 7 Security, Access Control, and Auditing

This chapter opens with a bit of history. The initial approach to security was just to ring fence Hadoop with a firewall, once inside you could do much as you pleased. Things are now changing, with more granular security. Tools examined include:

  • Kerberos – provides secure network based authentication

  • Knox – provides a secure gateway between external systems and Hadoop

Security is increasingly important, perhaps more so with the increasing amount of data stored in big data systems, the tools in this chapter should help secure your data.

 

Chapter 8 Cloud Computing and Virtualization

Most Hadoop systems run on physical systems, however there are advantages in using cloud and virtual systems, chief among them is the ease of creating a system, on-demand scalability, and less up-front costs. Tools examined include:

  • Serengeti – provides Hadoop virtualization. It’s very good for quickly spinning up a cluster. No need to configure. Quickly change size of cluster as needs demand.

  • Whirr – provides cluster deployment. Building cluster can be expensive/time-consuming, can spin up a cluster quickly to test something.

This chapter looks at some of the cloud and virtualization tools available for Hadoop. Although there are some disadvantages (often slower performance, YARN/MapReduce do not have complete control of the box), they need to be weighed against the advantages (quick to create, on-demand scalability, low up-front cost).

fghadoop

 

Conclusion

This book is very broad in scope, and by necessity (since it’s a field guide), shallow in depth. It provides up-to-date but limited detail on the major components of the Hadoop big data system. Helpful links are provided for further information.

The book is mostly easy to read, with a consistent layout of content (i.e. License, Activity, Purpose, Official Page, Hadoop Integration, description, tutorial link, and simple example code). Useful comparisons between tools are occasionally provided.

This book should prove helpful to managers, developers, and architects, which are new to big data and want a quick overview of the major components of Hadoop.

Most Hadoop books discuss some of the components listed here, but this book contains a much wider range of components than other books. That said, there are omissions, including:

  • Hue - a popular web-based tool providing centralised access to many underlying Hadoop tools (e.g. Sqoop, Hive, Pig, Oozie, HBase, ZooKeeper, Impala, HDFS etc)

  • Impala – a fast parallel processing SQL query engine for Hadoop

The authors intend to update this book regularly (every year or two), which is ideal if you want to know about the current popular components, and especially good if you have access to safari online (but bad if you need to keep buying the updated book).

Where should you go next after reading this book? I would suggest gaining some detail by reading Big Data Made Easy, which I recently reviewed.

If you’re new to big data and Hadoop, and you want to quickly review what it is, and the current state of its major components, I highly recommend this small book.

 

Banner


Core Java for the Impatient, 3rd Ed

Authors:  Cay S. Horstmann 
Publisher: Addison Wesley
Pages: 576
ISBN: 9780138052102
Print: 0138052107
Kindle: B0B8RZZBDJ
Audience: Smart programmers wanting in-depth coverage
Rating: 4.8
Reviewer: Mike James

The key to this book is the word "impatient" in the title. What does this m [ ... ]



Query Store for SQL Server 2019 (Apress)

Author: Tracy Boggiano & Grant Fritchey
Publisher: Apress
Pages: 234
ISBN: 978-1484250037
Print: 1484250036
Kindle: B07YNL3X4X
Audience: SQL Server DBAs and Devs
Rating: 4
Reviewer: Ian Stirk

This book aims to use Query Store to improve your SQL Server queries, how does it fare?


More Reviews



Last Updated ( Wednesday, 22 April 2015 )