Author: Andreas C. Müller and Sarah Guido
Audience: Python programmers
Reviewer: Mike James
What exactly is machine learning?
Machine learning is a hot topic and lots of people want to know something about it. In this case it is important to notice that the subtitle of the book is A Guide for Data Scientists. This is not a book about neural network doing amazing things like playing games, driving a car or translating from one language to another. This is statistics seen through the lens of AI methods. The methods selected are also more towards the stats end of the spectrum and some of them could be presented as classical statistics with no mention of machine learning or AI. There is this awkward meeting place of AI and stats, and this book is there it.
The first chapter tells you a little about the Python Jupyter environment, which is where all of the work is done. You need to be reasonably good at Python, and with Jupyter notebooks in particular, because this chapter doesn't really tell you enough. If you are complete beginner then you probably need a book on Python and one on using Jupyter. The chapter ends with a look at using the k-Nearest Neighbor classifier on the well known iris data. This is the standard data set used by R.A Fisher to develop the theory of discriminant analysis - a topic that is completely ignored in this and most other books on machine learning even if its data set is used.
Chapter 2 deals with supervised learning and here we find k-Nearest Neighbour again, linear models (regression), Bayes Classifier, Decision trees, ensembles of Decision trees, Kernal SVMs, neural networks, and, finally, estimating the uncertainty in classification. The introductions to the techniques are very hands on, with real data sets and programs to perform the analysis. You would be well advised to download the programs from the associated GitHub page. The explanations are fairly brief and practically oriented. Whenever some theory might help it is avoided with a comment to the effect that it is beyond the scope of the book. This is not a sentiment I can identify with as the application of statistical methods and machine learning is crucially dependent on a good understanding of what is going on. While the book does contain details of how to think about the data, and sometimes the model being fitted, you will still need more information to really understand what is going on.
Chapter 3 is on unsupervised learning and it first deals with the problems of preparing the data so that you have a chance of finding interesting classes. Next we have a section on various methods of dimension reduction - PCA, non-negative matrix factorization and manifold learning. Then on to more or less classical cluster analysis with k-means, agglomerative clustering, and so on.
After Chapter 3 the book is much more about how to cope with the real world of data than it is about clever analysis and many readers will find this useful. Chapter 4 is about representing data and finding features. This doesn't work logically by telling you about the different types of data you can encounter - interval, ordinal, categorical and so on, which would be my preferred way of proceeding. Instead we have some ad-hoc discussion of categorical data and how to code it. Then on to feature selection. No mention of contingency tables for feature selection and no mention of step-wise regression.
Chapter 5 is called Model Evaluation and Improvement. It is mostly about cross validation and again it lacks any theoretical background - no mention of the basic idea of resampling and the "leaving one out" method, even though this is a special case of k- fold cross validation with k=1. When it comes to evaluating the model contingency tables do finally make an appearance in the form of the confusion matrix.
Chapter 6 takes us about as far away from classical statistics as it is possible to get with a look at algorithm chains and pipelines. Essentially this is about building multi-step procedures to process data in an effort to find the best model. The reason that this isn't classical statistics is that it is more about programming and data processing and no classical statistical procedure would seek a model in this way - you are almost certain to find something that fits the data, even if has no value at all. The only situation in which this isn't the case is if you have lots and lots of data and a fairly simple model.
The penultimate chapter changes the topic to text data, which is a very different situation. Here we meet bag-of-words and the problems of dealing with natural language - stopwords, stemming and lemmatization. Very niche, but very important.
The final chapter is an overview of "where next". My own advice would be to learn some classical statistics and if you don't know enough math learn some math. Going into machine learning without a math and stats background sometimes has a happy outcome, but it is a bit like deciding to drive a car with a blindfold on.
So what is the final verdict?
This is a book packed with lots of practical examples and exercises. If what you are looking for is guidance on how to use Python to do small to medium scale data processing using machine learning, then it has lots of useful information.
It does do some explaining of what is going on, but it is very slight. It avoids the deeper math and the deeper ideas on purpose and this tends to make everything seem like a special case and all methods ad-hoc. If this is what you are looking for then it's great, but, as you can tell, it's not for me.
To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.