Practical Statistics for Data Scientists

Author: Peter Bruce and Andrew Bruce
Publisher: O'Reilly
Date: June 2017
Pages: 320
ISBN: 978-1491952962
Print: 1491952962
Kindle: B071NVDFD6
Audience: Data Scientists
Rating: 4.5
Reviewer: Mike James

Statistics for data scientists? Aren't they statisticians?

The idea of this book is to introduce data scientists to what you might call classical statistics. Personally I find it terrifying that anyone practicing data science can be expected to not have a background in statistics and hence this book should be preaching to the choir. Its subtitle "50 Essential Concepts" is also worrying - fifty, just fifty how can you sum up statistics in such a small number?!

The book is structured into short blocks introducing an idea such as the mean or measures of location and so on. At first I really didn't like it as there seemed to be little scope for depth in short summaries of the sort you might use to cram for an exam. However, as I continued to read the book it slowly won me over. This doesn't mean it is suitable for every reader, but it is a really good short introduction to some very subtle ideas.

The book follows a fairly traditional path through statistics. Chapter 1 is about exploratory data analysis and it's the one that nearly made me give up on the book. It is fairly low level and from first impressions seems shallow, but there are some good nuggets of insight scattered between the bullet points.

Chapter 2 starts to deal with the issues of real statistics, and not just descriptive statistics, with a look at data and sampling distributions. This is where math starts to be central to the argument, but it is avoided. There are equations, but no deep explanations of the calculations. You are given a working understanding of the ideas and perhaps a few sketched R statements that will calculate something. You don't need to know R, but it is used throughout the book without any explanations.

This is also the chapter where the idea of the bootstrap, and resampling in general, enters the story. I think you could say that this is the real difference between today's "data science" and last century's "statistics". Without a computer it was essential to have models and theoretical distributions to work with. With a computer you can resort to simulation. which is more or less what resampling is. The book is a champion of statistical method as resampling and its biggest fault is that it doesn't emphasize the need for lots of data.

Chapter 3 deals with significance testing and this is a very difficult area. It succeeds in explaining the ideas, but only if you read carefully and think about what is being explained. This is subtle, but the examples help.

From here the book moves through a set of well-known techniques - Chapter 4 regression, 5 classification, 6 machine learning, 7 principle components and clustering. What isn't covered, and it is a reasonable omission, is the whole subject of neural networks. Each of the chapters includes "classical" techniques that are often ignored in data science books, such as stepwise regression and linear discriminant analysis. It does cover the more modern approaches, such as ridge regression and splines. It also covers statistical topics that deserve a book in their own right in just a few paragraphs - particularly Anova.

There are things that you might fault this book on, but other readers might find them an advantage. There are no explanations of how things work. The coverage of technicalities is slight. The program snippets are short and really just illustrations - you shouldn't buy the book if you are looking for code. Many very big topics are covered in a few paragraphs. However this said, it doesn't avoid difficult concepts, such as the idea of main effects and interaction terms in regression as well as ANOVA. There is nothing misleading in what is presented and if you have some idea of what is going on before you read the book then a careful reading will expand your understanding.

I am still of the opinion that data scientists should be statisticians first, but if you disagree this will give you a glimpse of what lies on the other side of the divide. If you are a statistician then reading it might give you some idea of how the data scientists think about things.

Recommended but with some reservations.

For more recommendations of Data Science books see Reading Your Way Into Big Data in our Programmer's Bookshelf section.

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

Discovering Modern C++, 2nd Ed

Author: Peter Gottschling
Publisher: Addison-Wesley
Pages: 576
ISBN: 978-0136677642
Print: 0136677649
Kindle: ‎ B09HTJRJ3V
Audience: C++ developers
Rating: 5
Reviewer: Mike James

Modern C++ who would want to write anything else? Is this a suitable introduction for the rest of us?

+ Full Review

SQL Server 2022 Administration Inside Out

Author: Randolph West et al
Publisher: Microsoft Press
Pages: 992
Print: 0137899882
ISBN: 978-0137899883
Kindle: B0C4VKVP27
Audience: DBAs and developers
Rating: 5.0
Reviewer: Ian Stirk

This book aims to update your DBA skills to cover SQL Server 2022, how does it fare?

+ Full Review

More Reviews

Last Updated ( Tuesday, 19 March 2019 )

Recent Articles

Recent Book Reviews

Popular Articles