Hadoop for Finance Essentials

Article Index
Hadoop for Finance Essentials
Chapters 5 - 8, Conclusion

Page 2 of 2

Author: Rajiv Tiwari
Publisher: Packt Publishing
Pages: 168
ISBN: 978-1784395162
Print: 1784395161
Kindle: B00X3TVGJY

Chapter 5 Getting Started
Having looked at the background of big data, various Hadoop tools, and a sample data migration project, this chapter takes a look at how to implement a larger regulatory risk project in Hadoop.

The chapter opens with a look at why regulatory reporting is important, previously many companies went bust due to poor risk evaluation. A risk problem is outlined (duplicate, disparate data), with the potential solution (a simpler centralized data store in Hadoop). Standard naming conventions for configuration files, directories etc are given. Next, details of how to implement the required system are given, code is provided to ingest data using Oozie (a workflow/scheduler), and a discussion is provided on how this solution can also be achieved with popular ETL tools.

Next, details are given on how to transform the stored data. Code examples are provided using Hive, Pig and Java MapReduce. There’s a helpful, if simple, flowchart on how to choose between using Hive, Pig and Java MapReduce.

The chapter ends with a look at data analysis, specifically the integration of BI tools with Hadoop. An example is provided that shows how to setup the Hortonworks Hive ODBC Driver, and use it to access Hadoop data via a Qlikview client.

This chapter provides a useful overview of how to implement a larger Hadoop project. The suggestions for standards are particularly useful. The chapter title seems inappropriate, since several Hadoop projects have already been implemented in previous chapters.

Chapter 6 Getting Experienced
This chapter provides details on how to implement real time processing on Hadoop. The specific use case examines credit card fraud detection, a huge problem for the finance industry.

The chapter opens with a look at what real time big data is. Definitions of real time vary, but it’s typically taken to mean a few seconds or less. Some real time processing is actually micro-batches (e.g. Spark). Tools for real time processing are briefly examined, including: Spark, Storm, and Kafka.

The chapter continues with an overview of the project, that identifies fraudulent transactions. The proposed solution involves identifying transactions that are outliers. Historic transactions are used as input to the Markov Chain model (processed as MapReduce batch jobs), and current transactions (held on a queue in Kafka) are compared with these to identify outliers. Code is provided for the MapReduce jobs, and Kafka queues. The Storm and Spark real time architectures are briefly outlined, and code is provided (for both) to implement data collection and transformations.

This chapter provides a practical implementation of a real time fraud detection use case. The chapter’s title is incorrect, it should read “Real time processing in Hadoop”.

Chapter 7 Scale It Up
Having looked at implementing individual Hadoop applications, this chapter looks at how various Hadoop systems should be integrated. Often, with a piecemeal approach, each department can end up with its own Hadoop system, usually a better approach would be to have a single Hadoop system.

The chapter suggests getting the business users involved in the projects early. Various project considerations are examined briefly, including: projects with clear benefits, start with small projects, data lakes, lambda architecture, and security/privacy. These are examined in more detail later.

The chapter looks at some more big data use cases, these outline a problem and provide brief solutions, they include: customer complaints, algorithm trading, and social media based trading.

Next, data lakes are examined, its purpose is to prevent Hadoop silos, combining relational databases with Hadoop. The relational database processes low-volume but high-value data, while Hadoop processes high-volume and new types of data. Some analysis tools are listed (e.g. SAP).

The chapter then looks at lambda architecture, this combines batch and real-time processing on one platform. Historical data has aggregated views on it, and this can be combined with newly received data. Periodically the new data is moved to the historic data and the views recalculated. It ends with a look a security. This involves authentication (Kerberos and file system security), authorization (Kerberos principals to usernames), and encryption.

This chapter provides a useful list of topics to consider when you want to scale up your Hadoop systems. There’s a useful list of finance use cases, which should give you ideas for your own systems. Integration of Hadoop and relational databases via data lakes was explained, as was the integration of historic data and new data via lambda architecture.

Chapter 8 Sustain the Momentum
The chapter open with a look at the Hadoop distribution upgrade cycle, with important updates every 6 to 12 months. This often runs contrary to the slow upgrade policy of many finance houses (e.g. some still use Microsoft Office 2003). It then looks at best practices and standards, including:

Business: share successes, prototype on public cloud
Infrastructure: have large HDFS blocksize, use compression
Coding: clean data early, unit test mappers etc

The chapter ends with a look at new trends, including: Hive/Pig getting faster with each release, the increasing use of in-memory processing (e.g. Spark), and the growth of Machine Language and R.

This chapter provides some helpful discussions concerning the Hadoop distribution upgrade cycle and how it integrates (or not) with other finance related software upgrades. There are some useful best practices and standards suggested.

Conclusion
This book aims to introduce Hadoop from a finance perspective, and generally succeeds, covering a broad range of topics and tools, albeit briefly.

The book is generally easy to read, has good explanations, useful diagrams, and links to websites for further information. Assertions are backed by supporting evidence. There are plenty of finance use cases for you to consider, and a good section on recommended skills.

Sometimes the examples are unnecessarily complex (e.g. online archiving). This is an introductory book, the examples should be simple. The book’s examples relate largely to investment banking rather than finance as a whole. Most sections are brief, and not deeply technical.
This book should give you a good basic understanding of Hadoop, its scope and possibilities. It should take your level of understanding from level 1 to level 4.

Perhaps the next book to read is Big Data Made Easy which I gave a rating of 4.5 in my review. I found it to be a useful working introduction and overview of the current state of big data, Hadoop and its associated tools.

This book is a useful, if brief, introduction to Hadoop and its major components, using examples from investment banking.

SQL Server 2022 Query Performance Tuning (Apress)

Author: Grant Fritchey
Publisher: Apress
Pages: 745
ISBN:978-1484288900
Print:1484288904
Kindle:B0BLYD98SQ
Audience: DBAs & SQL Devs
Rating: 4.7
Reviewer: Ian Stirk

A popular performance tuning book gets updated for SQL Server 2022, how does it fare?

+ Full Review

Discovering Modern C++, 2nd Ed

Author: Peter Gottschling
Publisher: Addison-Wesley
Pages: 576
ISBN: 978-0136677642
Print: 0136677649
Kindle: ‎ B09HTJRJ3V
Audience: C++ developers
Rating: 5
Reviewer: Mike James

Modern C++ who would want to write anything else? Is this a suitable introduction for the rest of us?

+ Full Review

More Reviews

<< Prev - Next

Last Updated ( Friday, 28 August 2015 )

Recent Articles

Recent Book Reviews

Popular Articles