Editor: Q. Ethan McCallum
Audience: Data scientists
Reviewer: Mike James
Data doesn't have to be "big" to cause a problem. This book is a collection of essays on bad data - what makes it bad and how to deal with it.
Statistics has recently been resurrected from the boring bin by the advent of "big data". Now it is OK to dive into data, and data analysis is even regarded as cool. The problem is that statistics hasn't changed and it is still essentially a mathematical pursuit, but try telling that to the new generation of big data exploiters. It seems that simple descriptive statistics is enough to interpret much of the data on offer. The trouble is that descriptive statistics is a very difficult and messy real world subject - much like data.
This book is a collection of cautionary tales. If you have been working with any sort of data for any length of time then few of these stories will come as a surprise - but you will still want to read them if only to get that smug feeling of having been there and solved the problem before.
Any collection of essays has its good and its bad but this set of eighteen is a pleasant surprise in that even the worst of the essays is still readable and pleasing. All of the authors have some sort of sense of humor and perhaps the single thing that makes this book special is that it has a the quirkiest set of essay titles imaginable. Starting from "Is it just me or does this data smell funny" to "Blood, Sweat and Urine" through "Crouching Table, Hidden Network".
The level of the material rarely gets particularly technical but you do need to know something about handling data using a range of languages and tools to get the best out of the essays. For example, the first uses Perl scripts to explain how to work with formatted data.
The second essay "Data Intended for Human Consumption Not Machine Consumption" is essentially about scraping - the task of extracting data from text or web pages minus its formatting and in a form that can be processed further. It might come as a surprise to you to learn that the solution to the problem is to "write code". The language of choice here is R. The summary says:
"Another lesson to take away is that it is worthwhile learning about computer code so that we can work with data that is provided by others, regardless of the foray that they use to store or present the data."
It is still a shock to think that this needs to be said.
"Will the Bad Data Please Stand Up" is a real eye-opener for the beginning data analyst. It explains how simple statistical models can fail to describe the data even when they should be able to do the job. The section on how averages can completely fail to represent the typical data point is something everyone should read.
"Blood, Sweat and Urine" is a very witty essay on how scientists, chemists in this case, manage to "do" both data and statistics in a way that is unique.
"All chemists are required to carry a lab book around. in which tey have to record the details of how they conduct each experiment. And if they forget to write it down? Oops, the experiment is invalid. Run it again. I sometime wonder what would happen if the same principles were applied to data scientists. You didn't document that function. Delete. I can't determine the origin of this dataset. Delete. There's no reference for this algorithm. Delete, delete,delete. The outcry would be enormous but I'm sure standards would improve."
"When Data and Reality Don't Match" is a thought-provoking tale of when you can regard data as being a true representation of the world. You gather some data but later discover that something special happened to make the data not representative. Do you throw that portion of the data away or regard it as true because special things do happen?
Next we have an essay on the less obvious sources of bias and one on the use of imperfect data. "When Databases Attack" makes the argument that resorting to a database isn't always the best thing to do - the simple file is often more powerful. It tells a heartbreaking story of data locked up in CouchDB never to be released because of the time it takes to process such a big data set in the face of bugs.
"Crouching Table, Hidden Network" tells another database horror story, but one that should be familiar to anyone who has worked with a real database. It uses the example of the Koch snowflake to get the idea across to anyone who hasn't met a real database. The Koch snowflake starts out very simple - just a triangle and a simple generator rule - but after a few iterations its a really complex fractal. Databases are like this. No matter how simple they start they grow complex and messy.
The collection ends with some general essays on matters of interest - cloud computing, best practices, machine learning, data providence, social media and data quality. All good essays but not the stars of the show unless you have a particular interest in the topics they cover.
This is a very good general reading book on the topic of working with data. Don't buy it if you want a cookbook of solutions or something highly technical showing you how to perform the latest statistical analysis - it's not that sort of book. It is much more about remembering to keep a lot of common sense about you as you meet your data in the battle to extract its meaning, if any.
I enjoyed reading most of the essays and would recommend it to any reader interested in data and perhaps even big data.