Data Science with Java

Author:  Michael R. Brzustowicz
Publisher:  O'Reilly
Pages: 236
ISBN: 978-1491934111
Print: 1491934115
Kindle: B072MKRQBQ
Audience: Java programmers wanting to use the Apache Commons math library
Rating: 3
Reviewer: Alex Armstrong

 

Java is a candidate for doing data science, so a book on the subject seems like a good idea.

Java isn't the trendiest of languages at the moment and for data science you might well think Python or R would be better choices and they probably are. However, if you have a lot invested in Java, why not use it, after all that's what general purpose languages are all about - being general.

The first problem is what exactly is a book on the subject about? Is it about Java or is it about what we used to call statistics? This particular book doesn't really major on either topic, it opts instead to show you code and approaches to various types of analysis realized in Java - mostly using the Apache Commons math library. This is a reasonable idea, but unfortunately it fails because it doesn't address an audience at a specific level. One minute some very simple things are being explained and the next the level suddenly zooms up with equations introducing ideas that, if you already knew, you probably wouldn't be reading the book.

{loadposition morebooks}

Chapter 1 starts off looking at how to get data into the program and how to perform simple cleaning. This is a logical place to start, but here we meet the main problem with this book - the change in levels. The first thing we are told about is the use of one-dimensional arrays. No explanation of what an array is and then the use of the term BufferedReader with again no explanation. Is the reader a Java beginner or an expert? We next have a discussion of missing data and how to read various type of file format. The final few pages are on using SQL and JDBC and how to plot data using JavaFX.

 

 

Chapter 2 moves on to linear algebra and we are asked to consider XW=Y as a prediction of Y given X and weight matrix W. Just below the equation the matrices are written out in subscript form as an example - I'm not sure why. Next we make the acquaintance of matrix and vector representations in the Apache Commons math library. For the rest of the chapter topics are introduced at high speed with only a very basic explanation. Various norms are introduced without much explanation of where they originate from or why. If you already know about such things you will recognize what is going on. The final part of the chapter deals with matrix decompositions - LU, QR, SVG, Eigen and so on. A typical discussion is for the inverse matrix:

"A matrix inverse is used whenever matrices are moved from one side of the equation to the other via division. Another common application is in the computation of the Mahalanobis distance and, by extension, for the multinormal distribution."

That is a very strange way to introduce the inverse and why mention the Mahalanobis distance, just to confuse? There are lots of places where the explanations left me feeling that what I was reading wasn't exactly wrong, but it wasn't really helpful. Similarly the ways suggested of computing the matrix inverse using the SVD or QR decomposition are not exactly mainstream, but you do get some Java code to perform them using the math library. At the end of this chapter you may well wonder what it has to do with anything, let alone data. If you have done a course on linear algebra then it will be a lightening refresher with some Java code.

Chapter 3 is about statistics. It opens with the idea that a data point is a example of a dirac delta function. This is once again true enough, but only a physicist would notice this or care much about it. Why complicate things that are hard enough already? The chapter goes through, at speed, the ideas of probablity density, moments, entropy, some example distributions and so on. Then we are presented with the general idea of descriptive statistics - a shopping list of mean, median, mode, standard deviation and skewness. The last part of the chapter deals with multivariate stats - covariance and correlation, regression and workign with large datasets.

Chapter 4 deals with working with data to transform it into something usable. It starts off looking at text processing and I found this very difficult to read because of the way the terminology was used. For example in the middle of a discussion of creating dictionaries we have:

"For much larger dictionaries we can skip the term storage and use the hashing trick."

Even with the rest of the context, I cannot work out what this means and if you don't know what the "hashing trick" is then you have no hope of even getting started.

After looking at some very simple rescaling techniques, we reach principle components analysis, which is introduced as a dimensional reduction procedure. If you don't know what PCA is, then you are going to be none the wiser after reading this explanation and specifically you won't know why it has anything to do with eigenvectors either. Then comes the idea that:

"One method for calculating the PCA is by finding the eigenvalue decomposition of the covariance matrix of X."

As far as I know this is the definition of classical PCA, not just one way to calculate it. Other ways of calculating it are ways of finding the eigen decomposition of the covariance matrix.

Chapter 5 is about the currently very hot topic of learning and prediction. Here we go off into some very difficult math introduced without much explanation. Vector calculus is introduced without any warning and we are reading equations with del in them and using gradient descent without meeting any of the basic ideas. Next we have a shopping list of cost functions followed by k-means clustering, Gaussian mixture models and so on. This is followed by supervised learning - naive Bayes and linear models. The linear models section is more properly about generalized linear models with a non-linear transformation involved. This is then generalized a stage further by allowing multiple generalized linear models to be stacked together to form a deep neural network. A novel way to approach neural networks but using TensorFlow would probably be more practical.

The final chapter is a lightening look at Hadoop and the idea of map reduce as a way of implementing parallel algorithms.

I am not at all sure who this book is aimed at. It is too simple in places for the expert and much too advanced for the beginner. I suppose that as the expert always has the option of skipping the simple parts it is better suited to the expert. There is very little about Java in the book, just a lot of examples of using the math library. If you need some Java to look at then this might be helpful. As to the stats, there is so much missing - discriminant analysis, contingency tables, stepwise regression, significance testings - and what is covered is inadequately explained.  

This book demonstrates that if you want to be a data scientist you really should learn some math, learn a lot of stats (mainly modeling) and learn to program - in Java or another language.  

Related Articles

What is a Data Scientist and How Do I Become One?

For recommendations of Data Science books books see Reading Your Way Into Big Data in our Programmer's Bookshelf section.

 

To keep up with our coverage of books for programmers, follow @bookwatchiprog on Twitter or subscribe to I Programmer's Books RSS feed for each day's new addition to Book Watch and for new reviews.

{loadposition morebooks}

{loadposition morebooksART}

 

Last Updated ( Wednesday, 21 March 2018 )