|Programming Elastic MapReduce|
Authors: Kevin J Schmidt and Christopher Phillips
Aimed at: data developers who want to learn Elastic MapReduce
Reviewed by: Kay Ewbank
This slim book has a simple aim – to show developers and data center managers how to make use of Amazon’s Elastic MapReduce, Amazon’s pay-as-you-go Hadoop solution.
Early on in the book, the authors make a really interesting point. They explain that when NASA landed the Mars rover on Mars in 2012, they used stacks of AWS to support 25 Gbps of data throughput so scientists and anyone else who was interested could get up-to-the-minute info about the rover and the landing.
The significance of this is that in the old days, only a government or large multinational corporation could have access to resources on that scale. If you wanted to analyze large volumes of data you started by building a data center, buying a mainframe, installing miles of networking, setting up complex database servers. Now, if you’ve a laptop and a credit card you can get started, paying as you go for the amount of data and analysis you actually use. Anyone can be a big data analyst.
The book has just five chapters. The first introduces cloud computing, Amazon Web Service and Elastic MapReduce. The authors, use as their example throughout the book an application to analyze log data, partially on the basis most of us have access to plenty of logfiles brimming with data that can be analyzed without needing to look further for sample data. Having introduced the services, the next chapter covers the Amazon tools for collecting and examining the log data. They use a Linux Bash script to generate some syslog data, move it to S3 storage, and go as far as generating a custom JAR MapReduce job. You’re shown how to create and run an Amazon EMR cluster, view the results and debug the job flow.
Chapter 3 covers data filtering design patterns and scheduling work. The authors, Kevin Schmidt and Christopher Phillips, work through the Mapper, Reducer and Driver code for filtering the data, then for building summary counts. They then show how to schedule jobs using Amazon’s Elastic MapReduce Ruby client utility, and AWS Data Pipeline.
Next to be tackled is data analysis with Hive and Pig in EMR. This starts from assuming you know nothing about Pig and Hive, and works through how to use both in EMR, how to explore data using Pig Latin and Hive, and how to find the Top 10 with Hive. You wouldn’t be either a Pig or Hive expert after this chapter, but you’d know why they’re useful and how to run them with AWR.
Machine Learning with EMR is next on the list, using Python to work out k-means to find data clusters. The authors discuss how machine learning can be used to create systems that can take action or recommend a solution to a problem. As with earlier chapters, this is definitely an introduction to a massive topic, but well written and informative.
The final chapter looks at planning AWR projects and managing costs, using techniques such as how to use Amazon regions and availability zones, and options such as reserve and spot instances. The advice for reducing project costs gives you a list of things to bear in mind when developing an app, such as the fact AWS charges by the hour, so if you set up a ten instance cluster that fails almost immediately and only runs for a minute, you’ll still be charged for ten hours – one hour on each instance. There’s some interesting advice scattered through the chapter that could definitely save you money.
I enjoyed this book a lot, and wished it was longer. On the other hand, the fact the authors limit themselves to a really clear introduction is probably why it is enjoyable. If you want to experiment with Elastic MapReduce, this is a good way to learn.