Author: Ian H Witten, Eibe Frank & Mark A Hall
Publisher: Morgan Kaufmann,
Aimed at: Those wanting an in-depth introduction
Pros: Very readable and understandable
Cons: Stops short of hard core statistics
Reviewed by: Kay Ewbank
If you are looking for information on data mining that has coverage of the concepts of machine learning is this the book for you?
This is a very readable book that covers an important topic; how can you find patterns in your data. Most companies store quantities of data that would in the past have seemed unbelievable, but very few companies make good use of that data. If it’s customer data, you’ll get the catalog through the post (or probably two catalogu even if you asked not to receive any), but that’ll be more or less it as far as making use of the data goes.
Most data has underlying patterns that can be useful and enlightening. If you’re working with customer data, why have some previous customers stopped buying? If you’re storing data on diseases and patient recovery, why do some patients survive while others don’t?
The book is divided into three parts; first an introduction to data mining, what it means, what machine learning is and how it is used. Knowledge representation and how you can evaluate results that are produced by machine learning round off this introduction. Even if you’re not planning on doing real world data mining, this first part of the book is worth reading so you know what the terms mean and what sort of things are possible.
Part Two of the book looks at more advanced data mining techniques. The chapters take you through real machine learning schemes, data transformations and ensemble learning, ending with an interesting chapter on the future of data mining. For me, this part was the heart of the book. If you’re interested, for example, in how to choose a test for classifying data, what an exemplar is and how to reduce the number you get, different types of clustering, Bayesian networks, they’re all covered. The chapter on ensemble learning was particularly interesting, showing how you can combine the results from different outputs in a variety of ways, use one output to boost the strength of another, and generally improve the results you get by using more than one machine learning technique.
Part Three of the book will either be incredibly useful or a complete waste of space, depending on the way you’re planning to do data mining. It is devoted entirely to the Weka data mining workbench, an open source data mining tool that was developed at the University of Waikato, New Zealand. All three of the authors have been involved in the development of the workbench, so this is an excellent introduction to it, but only if you plan on using it. However, even if you decide to use a different machine learning tool, the chapters on Weka do show how you can put together a data mining model and interpret the results; it’s just you’d have to apply the techniques to the application you were planning to use.
The examples in the book use several sets of data, some of which are well known - Fisher’s Iris data will make anyone who’s ever studied statistics feel immediately at home. Other examples cover weather data, types of contact lenses prescribed to patients, classification of soybean diseases, and Canadian Labor negotiations. While other books on data mining go into great depth on the statistical techniques, this book stops short of that. It does explain the concepts, and makes good use of diagrams and written explanations of how techniques work, but doesn’t really get into the actual equations and statistics. Whether this is an advantage or a drawback depends on your personal point of view, but even if you’re going to go on and get into the heavy statistics, it’s a good start so you know why you’re using one technique rather than another.
In summary, if you want a good introduction to data mining get this book.