|Becoming A Data Head|
Author: Alex J. Gutman and Jordan Goldmeier
You can tell how important 'big data' is from the snowstorm of jargon surrounding it. This book sets out to guide the reader through the jargon to understand why you might choose a particular technique. So does it live up to its subtitle - How to Think, Speak and Understand Data Science, Statistics and Machine Learning?
The back jacket of the book says it is a complete guide for data science in the workplace - quite a boast for a book that's only 272 pages long. There's definitely a need for some help, as the authors show in the introduction to the book using an amusing example conversation they say happens when, as data scientists, they present their results to business people and get blank looks in response.
Having illustrated why the book is needed, part one of the book sets out to give people who aren't data scientists the tools to succeed in a data-centric world by 'thinking like a data head'.
Chapter 1, What is the problem? poses a number of questions an informed 'data head' needs to ask about business problems, such as 'why is this problem important', 'who does it affect', and 'what if we don't like the results'.
The next chapter asks 'what is data', and gives examples of data versus information, how data is collected and structured, and contrasts observational and experimental data.
Chapter 3 is probably where people from a non-science background will start to worry, as readers are asked to prepare to think statistically. In addition to recommending that people ask questions, the majority of the chapter covers the idea of variation in all things, along with probabilities and statistics.
Part 2 of the book has the title 'Speaking Like a Data Head', and it opens with a chapter on arguing with the data. The authors say this is necessary to find out 'if your data stinks', to avoid falling into the trap of garbage in, garbage out, and the chapter looks at a number of questions you ought to ask to find out just how good your data is. Next, they put forward ways to explore the data, including the questions to ask to determine whether the data you have can answer the question you want answered. This chapter also introduces the idea of correlation (and causation) to explore relationships in the data.
Chapter 6 is all about probabilities, still using a light touch so as not to scare the reader. The authors lay out the rules of the game, how to work out the likelihood of two things happening together, and common traps to avoid.
A chapter titled 'challenge the statistics' comes next, aiming to give the reader the tools to look at a statistical claim and work out whether its realistic or not. In more scientific terms, it introduces inference, along with good questions to ask when presented with a statistical claim, including what's the context, what's the sample size, what are you testing, can I see the confidence intervals, and is this practically significant?
Part 3 of the book is about understanding the data scientist's toolbox, beginning with an explanation of how to look for hidden groups in a set of data using dimensionality, principle component analysis, clustering and k-means clustering.
Regression modeling is the next topic, with a chapter looking at linear regression, what it does, what it gives you and what confusion it might cause. A chapter on understanding the classification model comes next, with information on logistic regression, decision trees, ensemble methods, and the pitfalls you might encounter.
Text analytics and how to understand it is tackled next, including topic modeling, text classification, and practical considerations when working with text.
This part of the book ends with a chapter called 'Conceptualize Deep Learning' that looks at neural networks and deep learning in a very accessible way.
The final part of the book is titled 'ensuring success', and has two very practical chapters on watching out for pitfalls and knowing the people and personalities you're working with. The book ends with a look to the future.
I suspect an alternative title for the book would be 'what we wish our business clients knew about data science'. I assume most developers know more about data than the average bear, but it's easy to be complacent and make assumptions that you know what problem you're solving, that the data you're working with is correct and suitable for the purpose.
The authors do a good job of introducing ideas about statistics and data analysis in a logical way that explains why a technique is needed, and by the end of the book the reader has absorbed the essentials for understanding data. Highly recommended.
|Last Updated ( Tuesday, 29 June 2021 )|