|Hadoop for Finance Essentials|
Page 1 of 2
Author: Rajiv Tiwari
This book aims to introduce Hadoop from a finance perspective, how does it fare?
Big data is a growing area of technological interest, and Hadoop is the most popular platform for implementing big data. An introductory book should prove a useful entry point into this field.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Big Data Overview
The chapter continues with an overview of the history of big data, starting with the Nutch project, Google papers (Google File System and MapReduce), Doug Cutting joining Yahoo!, and developing what was to become Hadoop. Many of the largest web companies run on Hadoop, and an increasing number of companies are now taking advantage of this distributed processing platform.
A high-level overview of the Hadoop architecture is given (clusters, nodes, racks). Next, Hadoop and its various components are briefly described, with core Hadoop consisting of the Hadoop Distributed File System (HDFS) storage, and MapReduce processing. Hadoop components outlined include: HBase, Hive, Pig, Flume, Oozie and Sqoop.
The chapter ends with a look at Hadoop distributions, these are pre-packaged stacks consisting of Hadoop and its various components in which the versions of each are known to work together.
In describing NoSQL databases, the terms BASE and ACID are used, but not explained. Core Hadoop is described as being composed of HDFS and MapReduce, I think this is a old definition – perhaps YARN (Yet Another Resource Manager) and in-memory processing (e.g. Spark ) should also be included, and in some circles the definition also includes Hadoop’s components.
This chapter is easy to read, has good if brief explanations, useful diagrams, and links to websites for further information. Assertions have supporting evidence. These traits apply to the book as a whole.
The chapter continues with a very brief look at big data use cases across various industries. A graphic from McKinsey shows how the finance industry is expected to gain the most from big data processing. Use cases within the finance industry are briefly described, including: archiving to HDFS, regulatory work, fraud detection, risk analysis, behaviour prediction, and sentiment analysis.
Next, some popular Hadoop tools to learn are outlined, these include:
The chapter ends with a look at implementing big data project in finance. The standard practices of gathering user requirements, gap analysis and project planning still apply. Various concerns are briefly discussed, including: being the first big data project, getting skilled staff, and security.
This chapter looks at Hadoop from the context of the finance industry. It shows why the finance industry has the most to gain from the Hadoop platform. Useful non-finance and finance use cases are presented. There’s a very helpful section on popular Hadoop skills – there are a great many, I wonder if anyone can be proficient in more than a few?
The chapter opens with the advantages of cloud computing, these include: reduced infrastructure costs, and elastic capacity. Security and performance are viewed as disadvantages. It goes on to look at how to implement a risk simulations project in the cloud. The risk measurement used is Value-at-Risk (VaR), this uses the Monte Carlo simulation to estimate the probability of risk using various scenarios. The example given suggests it would take 20-30 hours to complete on a standard platform, but less than 1 hour on Hadoop. Details are provided to implement the solution using Amazon Web Service (AWS) and Elastic MapReduce (EMR).
This chapter provides a useful overview of the advantages of implementing Hadoop applications in the cloud, some potential drawbacks are also discussed. The implementation of the VaR calculations should prove useful if you want to implement your own proof-of-concept project in the cloud.
Chapter 4 Data Migration Using Hadoop
The first phase of the project splits new trade data between the RDBMS and Hadoop (the database holds the most important columns, whereas Hadoop holds all the columns). The second phase moves RDBMS transactions that are older than one year into Hadoop. Walkthroughs with code are provided for both project phases, together with code to query the resulting data stores.
This chapter provides practical details to implement a very common use case – archiving data from a RDBMS to Hadoop. I thought the two examples given were unnecessarily complex – it would have been better to show the simple migration of a RDBMS table to Hadoop, no need to complicate the picture with two phases and intricate detail about splitting the input etc – we are here to learn, so make it simple! I was surprised that Flume wasn’t discussed as another popular data migration tool.
Sqoop commands, supply a userid and password to access the SQL Server database, are included. However, many enterprise-level organizations do not allow this type of access. Instead integrated security needs to be used; this is possible via Sqoop, but is not well documented.
|Last Updated ( Friday, 28 August 2015 )|