|Mahout in Action|
Authors: Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman
Machine learning is a topic that sounds theoretical but has an increasing number of practical uses. This book provides an introduction to how to make use of Mahout to create useful apps that analyze large data sets.
Mahout is an open source machine learning library from Apache that can be used on top of Hadoop for recommender engines (collaborative filtering), clustering, and classification. If you’re trying to write an app that recommends items to your customers based on what they’ve already bought, Mahout could provide the way to do it. Similarly, if you want to group data based on underlying patterns such as income range or location, clustering using Mahout is an option. The third area covered by Mahout is classification, the concept of working out how much a data item belongs in a particular group (spam email, for example) based on how much it matches previously identified patterns. Machine learning lets you carry out analysis of these types on very large data sets. Until recently this would have been too expensive for most developers, but open source frameworks such as Hadoop, MapReduce and Mahout bring it into everyone’s reach.
After a brief introduction to what Mahout is, Mahout in Action is split into three main parts, covering recommendations, clustering and classification. Each section starts with an introduction to the topic and moves through to a complete solution. One thing to be aware of is that the code in the book was written for and tested with Mahout 0.5. Mahout is now at 0.8, and there have been a lot of changes since 0.5, so the code samples won’t necessarily work ‘as is’ without some revisions. The concepts are still the same, and the code for much of the book is on Github, so it’s more of an irritation than a major drawback. When you buy the book you get access to a free e-book version that includes mini videos that go into more detail in the complex ideas.
The authors have concentrated on Mahout rather than the theories behind machine learning; this is a more practical approach. The section on recommendations starts with an explanation of what recommendations are, and how to evaluate recommenders, precision and recall. The authors then look at how to represent recommender data using preference arrays and in-memory data models. There’s a useful description of how you should deal with situations where you don’t have preference values, for example when a user and item are associated but you don’t know how strongly. Next, the authors look at how to make recommendations, starting with using the user-based recommender (when a human user recommends an item as relevant). The topic of similarity metrics is next. Essentially this is how to decide when one user is similar to another so you can use one user’s preferences to recommend items to another similar user. This topic is important enough to take up the rest of a sizeable chapter, with coverage of item-based recommendation (if this item is similar to that one, we can recommend it); slope-one recommendation (most people think this item is x percent better than that item); and a range of other new and experimental techniques. There’s a chapter looking at taking recommenders to production using a case study of a dating site, and the section ends with a chapter on how you can distribute recommendation computations. This includes how to analyze a massive data set from Wikipedia; how to produce recommendations with Hadoop and MapReduce; and finally how to pseudo-distribute a recommender. This last section shows how run a non-distributed application on multiple machines.
Clustering is the topic covered in Part Two of the book. What is covered seemed to me to be well written, but Mahout has added a lot of extras to clustering since 0.5, including implementations of K-Trusses, Top-Down and Bottom-Up clustering, and a lot of cluster display options. Reasonably enough, these aren’t covered in the book; it’s a problem when software is changing so rapidly.
What is covered is an introduction to clustering and how to represent data using vectors. The ideas behind clustering are explained well, though there’s a bit of a step change from the gentle ‘we all cluster similar things even using simple measures such as salty or sweet’, to working out k-means and Euclidean distance measures without much of a pause for breath. The Mahout algorithms that are covered are K-means, Fuzzy K-means, and Dirilecht. There’s a good chapter on evaluating and improving clustering quality, and a useful case study of running clustering on Hadoop. The section finishes with a look at real-world applications of clustering, in particular finding similar users on Twitter; and suggesting tags for artists on Last.fm.
Classification is tackled in part three of the book. Put simply, classification is the process of deciding on a limited number of potential results, then classifying data to fit into one of those values. In the case of Mahout, classification is an area where you have to carry out supervised machine learning. You train your classification system with example data that fits the targets you’re looking for, then once Mahout has used this to create a model, the model can be used on other data to classify items in the production data. There’s a chapter on training a classifier with advice on how to extract features to use for the trainer, and how to convert classifiable data into vectors. Once you’ve trained your classifier, the authors show how to evaluate and tune it, and how to deploy it. The section ends with a case study on how Shop it to Me has used classification.
There are parts of this book that are excellent, and in general the topics covered were well written. The authors are good at coming up with understandable examples, and make their descriptions easy to read and comprehend. I thought there were some jumps in level from the very easy to follow introductions straight into fairly dense code and techniques, and personally would have appreciated some intermediate introductions to the more advanced material. This wasn’t helped by the code being Mahout 0.5, but overall the book gave me a good introduction to what Mahout can do, and some useful code to illustrate more complex ideas. I’d say be aware of the drawbacks, but if you want to learn Mahout, this is still a worthwhile read.