Page 1 of 2
Author: Rajiv Tiwari
Publisher: Packt Publishing
Audience: Hadoop beginners
Reviewer: Ian Stirk
This book aims to introduce Hadoop from a finance perspective, how does it fare?
Big data is a growing area of technological interest, and Hadoop is the most popular platform for implementing big data. An introductory book should prove a useful entry point into this field.
The target audience includes developers, analysts, architects and managers. No previous knowledge of Hadoop or its components is assumed, and the examples relate to the finance industry.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Big Data Overview
This chapter open with a look at what big data is, being typically defined by the 3 Vs: huge volumes, high velocity, and a wide variety of data. In essence, traditional relational database management systems (RDBMS) are no longer able to process data for some applications in a timely manner – hence the need for big data systems.
The chapter continues with an overview of the history of big data, starting with the Nutch project, Google papers (Google File System and MapReduce), Doug Cutting joining Yahoo!, and developing what was to become Hadoop. Many of the largest web companies run on Hadoop, and an increasing number of companies are now taking advantage of this distributed processing platform.
A high-level overview of the Hadoop architecture is given (clusters, nodes, racks). Next, Hadoop and its various components are briefly described, with core Hadoop consisting of the Hadoop Distributed File System (HDFS) storage, and MapReduce processing. Hadoop components outlined include: HBase, Hive, Pig, Flume, Oozie and Sqoop.
The chapter ends with a look at Hadoop distributions, these are pre-packaged stacks consisting of Hadoop and its various components in which the versions of each are known to work together.
This chapter provides a useful introduction to big data and Hadoop. The need for Hadoop is described, together with a brief history. The Hadoop architecture is introduced, together with various associated components.
In describing NoSQL databases, the terms BASE and ACID are used, but not explained. Core Hadoop is described as being composed of HDFS and MapReduce, I think this is a old definition – perhaps YARN (Yet Another Resource Manager) and in-memory processing (e.g. Spark ) should also be included, and in some circles the definition also includes Hadoop’s components.
This chapter is easy to read, has good if brief explanations, useful diagrams, and links to websites for further information. Assertions have supporting evidence. These traits apply to the book as a whole.
Chapter 2 Big Data in Financial Services
The chapter opens by discussing why the finance industry is especially suitable for big data processing – it’s the industry that generates the highest volume of data (e.g. the New York stock exchange creates a terabyte of data each day).
The chapter continues with a very brief look at big data use cases across various industries. A graphic from McKinsey shows how the finance industry is expected to gain the most from big data processing. Use cases within the finance industry are briefly described, including: archiving to HDFS, regulatory work, fraud detection, risk analysis, behaviour prediction, and sentiment analysis.
Next, some popular Hadoop tools to learn are outlined, these include:
- For querying HDFS data: Pig, Hive, and MapReduce
- For SQL querying: Hive, SparkSQL, and Impala
- For real-time processing: Spark, and Kafka
- For analytics and Business Intelligence (BI): Tableau and Pentaho
The chapter ends with a look at implementing big data project in finance. The standard practices of gathering user requirements, gap analysis and project planning still apply. Various concerns are briefly discussed, including: being the first big data project, getting skilled staff, and security.
This chapter looks at Hadoop from the context of the finance industry. It shows why the finance industry has the most to gain from the Hadoop platform. Useful non-finance and finance use cases are presented. There’s a very helpful section on popular Hadoop skills – there are a great many, I wonder if anyone can be proficient in more than a few?
The chapter mentions that Hive does not allow updates or transactions, however both these features were present in version 0.14.0, released November 2014, since this book was released in April 2015, the author/technical reviewers should have been aware of this.
Chapter 3 Hadoop in the Cloud
The cloud provides an environment where Hadoop applications can run with little setup costs, it’s especially suitable for proof-of-concept work, and systems with unpredictable resource demand.
The chapter opens with the advantages of cloud computing, these include: reduced infrastructure costs, and elastic capacity. Security and performance are viewed as disadvantages. It goes on to look at how to implement a risk simulations project in the cloud. The risk measurement used is Value-at-Risk (VaR), this uses the Monte Carlo simulation to estimate the probability of risk using various scenarios. The example given suggests it would take 20-30 hours to complete on a standard platform, but less than 1 hour on Hadoop. Details are provided to implement the solution using Amazon Web Service (AWS) and Elastic MapReduce (EMR).
This chapter provides a useful overview of the advantages of implementing Hadoop applications in the cloud, some potential drawbacks are also discussed. The implementation of the VaR calculations should prove useful if you want to implement your own proof-of-concept project in the cloud.
Chapter 4 Data Migration Using Hadoop
This chapter discusses some common use cases for migrating trade data from a RDBMS into Hadoop, in essence providing online archiving. Hadoop’s cheaper storage provides both an archive facility and enables the data to be easily queried.
The first phase of the project splits new trade data between the RDBMS and Hadoop (the database holds the most important columns, whereas Hadoop holds all the columns). The second phase moves RDBMS transactions that are older than one year into Hadoop. Walkthroughs with code are provided for both project phases, together with code to query the resulting data stores.
This chapter provides practical details to implement a very common use case – archiving data from a RDBMS to Hadoop. I thought the two examples given were unnecessarily complex – it would have been better to show the simple migration of a RDBMS table to Hadoop, no need to complicate the picture with two phases and intricate detail about splitting the input etc – we are here to learn, so make it simple! I was surprised that Flume wasn’t discussed as another popular data migration tool.
Sqoop commands, supply a userid and password to access the SQL Server database, are included. However, many enterprise-level organizations do not allow this type of access. Instead integrated security needs to be used; this is possible via Sqoop, but is not well documented.