|Think Like a Data Scientist|
Author: Brian Godsey
Everyone seems to want to be a data scientist at the moment. Does this book help?
A data scientist used to be called a statistician. I'm not completely sure why there was a need for a change but I suppose the new name reflects the deeper connection with computers. This book promises to take you step-by-step though the process, but I'm not at all sure that there is a process. The best data science is creative and probably requires you to step outside of the process - if there is one.
So what do you need to know to be a data scientist?
According to the first chapter of the book you don't need to be a statistician and you don't need to be a programmer. As far as I can make out the author thinks that you simply have want to be a data scientist and be bright enough to think hard. This is like saying that you don't need to have a good voice to sing. It could be true occasionally, but it isn't particularly helpful.
I can understand the need to encourage the reader, but as far as I'm concerned there is little point in covering up the fact that you really do need a good understanding of statistics and programming. More important is a skill with analytical thinking and preferably a good grounding in math - after all it is data science.
Chapter 2 continues on in this vague and consoling way and discusses some general issues of dealing with clients and some about dealing with data. "Ask good questions" seems to be the motto, but how you arrange to answer them is something you have to work out for yourself.
Chapter 3 is about the forms that data takes - HTML, XML, json and so on. You might not need to be a programmer, but it seems you need to know a lot of things that programmers have to know about. The point is that none of these things is covered in any depth. It is almost as if it is enough to know they exist. Of course this is much to little to even get by, let alone do something useful. Chapter 4 continues the data focus with a look at a case study.
Chapter 5 starts to look at the statistics that you apparently don't need to know much about. What we have is a quick look at some simple descriptive statistics mixed with a few "handy tricks". This just isn't enough information.
The second part of the book, "Building a product with software and statistics", is very strange. What exactly is the product? No, it isn't about recommender systems, it is about statistics. After a quick review in Chapter, 6 we dive into statistics in chapter 7. The dive in question is a bit of a disappointment because the water isn't at all deep. The discussion of Bayesian v Frequentist left me in a state of complete shock. To say that Bayesians are more about probability than the Frequentists is simply crazy. Frequentists are the ones who demand that probabilities are actually verifiable rather than a stand-in for belief or subjective certainty. Why bring up such a contentious topic if it is goingt to be dismissed in a few paragraphs that don't adequately explain what each approach is all about? At the end of the chapter we are treated to tiny accounts of clustering, component analysis and the inevitable machine learning. None of which are enough to let you grasp even the basics of any of them.
Chapter 8 is about software - spreadsheets, programming and lots of generalities. Chapter 9 introduces databases, cloud services and big data technologies. Chapter 10 is about managing your project and has a brief discussion of significance; why no mention of power? Again we have lots of fairly useless tips and tricks and very little to explain what is going on.
Part 3 of the book is about finishing off the product, i.e the data analyisis, and wrapping up. Chapter 11 is about presenting your results to the customer. Chapter 12 is about responding to the customer and Chapter 13 deals with documenting and archiving your work and drawing lessons from it.
This is a very simple book that doesn't seem to set out to teach you anything much about the essential skills needed in data analysis like probability, statistics or math. It doesn't tell you anything serious about programming or even about using statistical software. It is a long distance view of the subject of the sort that you might need if you were trying to bluff your way through. There is a lot of advice based on experience, but for me it all seemed far to specific too generalize well or far too general not to be common sense. None of the statistical topics are covered in enough detail for the reader to understand what is being said (unless they already know) and none of the programming/software issues are made at all clear. If you are looking for a book that skirts around the material of data science this is it.
|Last Updated ( Tuesday, 10 October 2017 )|