Hadoop Interview Guide

Author: Monika Singla and Sneha Poddar
Publisher: Amazon Digital Services
Pages: 320
Kindle: B00UKPQAD6
Audience: Hadoop job candidates
Rating: 4
Reviewer: Ian Stirk

This Kindle-only e-book aims to help you pass an interview for a job as a Hadoop developer at the junior or mid-level position, how does it fare?

Targeted at existing Hadoop developers, it aims to provide in-depth knowledge of Hadoop and its components. It’s also a starting point for anyone wanting to venture into the Hadoop field from other IT fields.

It is divided into 10 sections (Hadoop, HDFS, MapReduce, Flume, Sqoop, Oozie, Hive, Impala, Pig, and Java) with a total of 434 questions, and 23 example tasks to programs or follow along with.

Below is a chapter-by-chapter exploration of the topics covered. 

Banner

Chapter 1 Introduction to Hadoop

 raditionally, Hadoop was considered as a combination of the Hadoop Distributed File System (HDFS) and the batch programming model MapReduce. In Hadoop 2, Yet Another Resource Negotiator (YARN) was added. However, increasingly, Hadoop is taken to mean all these things, together with Hadoop’s wider range of components.

Example questions include:

  • What is big data?

  • Why do we need a new framework for handling big data?

  • What is Hadoop? 

This chapter provides a useful introduction to Hadoop in the wider context of big data. There’s a very useful step-by-step walkthrough on how to set up a standalone Hadoop environment using Cloudera’s QuickStart VM. 

Chapter 2 HDFS

HDFS is the underlying file system for Hadoop, it has built-in functionality to split and distribute files over multiple nodes in the cluster, and to store multiple copies of these files – which helps with both resilience and parallel processing.

Example questions include:

  • What is HDFS?

  • What are the problems with Hadoop 1.0?

  • What are the functions of NameNode?

  • What is HDFS Federation?

  • What is the purpose of dfsadmin tool? 

This chapter provides some helpful questions about one of the core components of Hadoop. It’s becoming clear that you will need to know what version of Hadoop and its components you’re dealing with, since defaults can change with version (e.g. the book says HDFS default block size is 64MB, but this changed to 128MB in Hadoop 2). 

Chapter 3 MapReduce

MapReduce is a batch programming model used by Hadoop, where the work to be done is split across multiple machines (Map) and the results combined and aggregated (Reduce).

Example questions include:

  • What is MapReduce?

  • What is the fundamental idea behind YARN?

  • What’s the function of the NodeManager?

  • What are the steps to submit a Hadoop job?

  • Explain speculative execution 

Again a wide range of useful questions is provided. One of the example programs to create is the standard “count the number of words in an input file”. Perhaps a question relating to the current vogue of using Spark in place of MapReduce could have been included.

 

hadoopintguide

 

Chapter 4 Flume

Flume is a well known tool for moving unstructured data from various sources (e.g. log files) to various destinations (e.g. HDFS for subsequent processing).

Example questions include: 

  • What is Apache Flume?

  • Explain a common use case for Flume?

  • How to start Flume agent?

  • Describe memory channel

  • Describe Avro sink

  • Explain Interceptor interface 

This chapter asks questions about both theoretical and practical aspects of Flume’s data transfer functionality. Sections include: Basics, Configuration, Channels, Sinks, and Interceptors. The example programs illustrate the movement of data from various sources (e.g. twitter, Netcat) to various destinations (e.g. console logger, HDFS). 

Chapter 5 Sqoop

Sqoop is a well known tool for moving structured data (e.g. relational databases) in and out of Hadoop.

Example questions include: 

  • Explain the Sqoop import command

  • How to control the import of the subset of rows only?

  • How can Sqoop Override Type Mapping?

  • What is the purpose of Sqoop eval tool?

  • How to updating an existing Data Set in export?

  • What are the options in Sqoop for HBase? 

The chapter has an in-depth look at Sqoop’s data transfer capabilities. I particularly liked the tables giving the meaning of the various import and export features of Sqoop. Sections include: Basics, Import, and Export.

Chapter 6 Oozie

Oozie is a workflow and scheduler system for Hadoop jobs.

Example questions include: 

  • How many types of jobs are there in Oozie?

  • What is an Oozie workflow?

  • List the various types of Workflow actions

  • What is the flow of coordinator job?

  • What is the Oozie bundle system? 

The chapter provided an in-depth look at Oozie’s workflow and scheduler capabilities. The useful example programs illustrate the integration of Sqoop and Oozie workflow. 

Chapter 7 Hive

Hive is Hadoop’s data warehouse, allowing queries to be processed in batch using MapReduce.

Example questions include:

  • What are the limitations of Hive?

  • What is a Metastore and what it stores?

  • What are External Tables?

  • What is bucketing?

  • Why should normalization be avoided?

  • Describe the EXPLAIN command in hive

  • How do we get authentication with Hive? 

This wide-ranging chapter contains more than 30% of the book’s questions. Sections include: Basics, Hive Query Language (DDL, DML), Partitioning and Bucketing, Views, Query Optimization, Compression, Functions and Transformations, SerDe, and Advanced Hive.  

Chapter 8 Impala

Impala allows queries to be processed interactively, often against Hive tables.

Example questions include:

  • What problem does Impala solve?

  • Describe the functioning of the statestored daemon?

  • What are the similarities between Impala and hive? 

This short chapter contains just 3% of the book’s questions. I often wonder why anyone would want to use Hive queries when Impala is available - since it can query the Hive tables much faster. 

Chapter 9 Pig

Pig provides workflow and scripting functionality, at a higher level than Java and MapReduce programming.

Example questions include: 

  • In which scenario MapReduce is a better fit than Pig?

  • What the different ways to develop PigLatin scripts?

  • What is a relation in Pig?

  • List the different Pig data loaders

  • How to register a UDF in Pig?

  • What is a skewed join? 

Java can be a difficult language to learn, Pig provides an easier way of programming MapReduce. Perhaps in the future, higher level tools (e.g. Pig and the various querying languages) will be used for most processing, and Java for only the low-level complex work. Sections include: Basics, Datatypes, Pig Latin, and Joins. 

Chapter 10 Java Refresher for Hadoop

Java is a general purpose language, often used as the default language with various Hadoop components.

Example questions include: 

  • What are transient variables?

  • What is final, finally, and Finalize?

  • What is the difference between an ArrayList and a LinkedList?

  • Does Java support multiple inheritance? 

This section really is just a brief refresher, it contains a list of Java questions that you might be asked when going for a junior Java developer role. 

Conclusion

This book contains a wide-range of questions about Hadoop and its components, and the answers generally provide accurate explanations with sufficient detail. The tasks to program at the end of each chapter should prove useful in demonstrating your practical understanding of the topics.

Many other common Hadoop components could have been included e.g. HBase (NoSQL database), Spark (fast MapReduce replacement), ZooKeeper (configuration), and Mahout (machine learning). Perhaps it can be expanded in the future to include these topics - that said, the book does cover many of the core Hadoop technologies.

I don’t think you can learn the subjects directly from this book, but you can use it as a benchmark to measure how much you have learned elsewhere.

Generally the book is well written, however, some of the questions have substandard English grammar. Some of the longer example code is difficult to read because it is not formatted adequately. The references at the end of the book are useful, but details of publication date, edition, and publisher are missing (one has only the author’s first name!). There is a large list of websites at the end of the book, however, none of the sites is annotated.

It should be remembered that Hadoop and its components are rapidly changing. So it’s important to view the answers in the context of the version used e.g. CDH 5.3 has Hive version 0.13.1 which does not support data modifications (as answered in the book), however Hive version 0.14.0 does.

This book contains a wide-range of useful questions about Hadoop and many of its components. Overall a helpful book for the interview! 

 

Banner


Data Structures & Algorithms in Python

Author: Dr. John Canning, Alan Broder and Robert Lafore
Publisher: Addison-Wesley
Date: October 2022
Pages: 928
ISBN:978-0134855684
Print: 013485568X
Kindle: B0B1WJF1K9
Audience: Python developers
Rating: 4
Reviewer: Mike James
Data structures in Python - a good idea!



Classic Computer Science Problems in Java

Author: David Kopec
Publisher: Manning
Date: January 2021
Pages: 264
ISBN: 978-1617297601
Print: 1617297607
Audience: Java developers
Rating: 4
Reviewer: Mike James
Getting someone else to do the hard work of converting classic problems to code seems like a good idea. It all depends which problems [ ... ]


More Reviews

Last Updated ( Friday, 03 July 2015 )