|Pandas for Everyone: Python Data Analysis
Author: Daniel Chen
Python a general purpose language for data analysis? With Pandas it might be possible.
This is a book about using the Python package Pandas - a data manipulation and analysis library. You could say that Python plus Pandas is the equal of say R or SAS, but with more flexibility.
So you want a book on Pandas but what do you expect from such a book?
The problem is that this is not a deep theoretical subject where there are great concepts to learn and skills to be mastered. Using a package such as this is mostly a matter of finding out what the implementors decided to call some function or exactly how they hide the feature you are looking for. This sounds like you probably need a cookbook - this book is not a cookbook exactly, but it is a collection of task-oriented descriptions.
The bottom line is that this is not really a book that you would sit down and read cover to cover. It is more what you might turn to to solve a problem or get you into a topic.
The book is divided into five parts, the first of which is an Introduction. Chapter 1 starts off this section with a look at the DataFrame. You are expected to know Python and mostly how to get the programming environment setup. Chapter 2 moves on to consider more general data structures and data. The topics include how to import data. Chapter 3 moves outside of Pandas to use Matplot and Seaborn to create charts.
Part II covers Data Manipulation and this has an interesting approach to the subject focusing on how the data's characteristics and origins effects how it is represented in a Pandas DataFrame. It introduces the idea of "tidy data", almost an informal normal form for statistical data - it's a nice idea. Chapter 4 deals with merging datasets. Chapter 5 is about missing data and what missing data actually means. Chapter 6 is dedicated to the idea of tidy data and explains "columns contain variables not values". This introduces the idea of melting or pivoting or whatever you want to call it but without being clear what is happening to the data. You are expected to see what the transformation is from the examples i.e. by looking at extracts from the data. I don't think that this is best way of explaining the operation - its an algorithm and you and tell the reader what the algorithm is. Unless you get the idea of how the different columns work together to create a variable this is a difficult chapter.
Part III is on Data Munging, which is another way of saying getting your data into shape for the analysis you plan to use. Chapters 7 and 8 covers some surprisingly basic topics - data types, strings, and categorical data. It goes into Python and general programming topics such as how to format strings, using regular expressions and so on. Again this is mostly learning by showing rather than by explanation. Chapter 9 is on the Apply method and this really should be fairly obvious material to any reasonable Python programmer. Chapter 10 is on the inevitable, in the sense that you often have to do it, Groupby and Split type operations. The section closes, Chapter 11. with a look at the problems of using date and time data - never as easy as you might expect.
Part IV is about Data Modelling and in most cases this is the section that books on using statistical software should leave out unless they are prepared to write a full textbook on the subject. To pretend that you can understand even something as simple as linear regression from a page or two isn't realistic. Go read a statistics book. The section starts in Chapters 12 and 13 with the fairly simple models - regression though generalized linear models, but not in the ANOVA sense. Chapter 14 covers diagnostics. Chapter 15 is on regularization including LASSO and ridge regression. Chapter 16 goes over the basic methods of clustering and brings the section to a close. Begin such a short section there is a lot that isn't covered - contingency tables and categorical analysis not to mention the whole world that ANOVA style analysis is. There also nothing on factor analysis, principle components, discriminant analysis etc. This isn't a huge problem as even if they were covered you would need another book to do them justice.
The final section, Conclusion, is a look at some fairly off topic subjects. Chapter 17 is about the wider Python community and Chapter 18 is about how to be a self-directed learner - go to meetings, conferences etc. The space could have been better spent on more Pandas.
The book closes with some appendixes on installing Pandas.
Click on cover for details of a print and e-book bundle
This is a book that is strong on showing you how to do things rather than explaining how to do things. There isn't much deep principle in a package like Pandas, but there are missed opportunities to point out the generalities of data preparation, model proposal and testing. There are places where it reads more like a set of lecture notes than a complete narrative account of using Pandas. In addition the presentation often makes it harder to see what is being demonstrated with tables split across pages where it would have been easy to adjust the layout to keep the lines together. Overall I found the book more difficult to read than it needed to be.
If you are a fairly good Python programmer there are also places in the book where you are told some very basic things about strings, functions and so on.
This book will suit you if you are prepared to actively investigate the examples you are being shown and think about what is happening. In many cases you will need to study the data to see how the commands are changing it.
We also have many more reviews of Data Science books.
|Last Updated ( Wednesday, 05 September 2018 )