Learning Spark
Learning Spark

Authors: Holden Karau, Andy Konwinski , Patrick Wendell & Mateil Zaharia
Publisher: O'Reilly
Pages: 274

ISBN: 9781449358624
Print: 1449358624
Kindle: B00SW0TY8O
Audience: Developers interested in Apache Spark
Rating: 4
Reviewer: Kay Ewbank

The subtitle "Lightning-Fast Big Data Analysis" promises a lot. How well does it deliver?

This introduction to Apache Spark is written by its developers to show how it can be used to create data analytics systems. So they are well placed to introduce it to the rest of us.

Spark is a cluster computing platform that extended the MapReduce model to support interactive queries and stream processing. It also covers batch applications and iterative algorithms, and the main idea is that you can combine different processing types in a single app rather than having to create a data analysis pipeline using multiple separate tools. All this is explained by the authors in the initial high-level overview that opens the book.


Having given the high-level view, the authors start the more detailed material with a chapter on downloading Spark and getting started, going as far as building standalone applications. Programming RDDs (resilient distributed datasets) is next on the agenda. In Spark, you do pretty much everything by working with RDDs – creating new ones or manipulating existing ones. This initial chapter starts with the RDD basics, and looks at transformations, passing functions to Spark from Python, Scala, and Java, and there’s a good intro to common transformations and actions.

Next, the authors look at working with key/value pairs, extracting them using ETL (extract, transform, and load), and how they are used to perform aggregations. There’s also a useful section on data partitioning and the operations that benefit and affect it.

As this is an introduction to Spark, the next chapter covers loading and saving data, starting from the different file formats such as text and JSON, before more useful material on filesystems including FS, Amazon S3, and HDFS. The chapter also covers structured data with Spark SQL, Hive and JSON; and databases – Cassandra, HBase and ElasticSearch.




This is where the book probably begins being useful for most developers. There’s a good chapter on advanced Spark programming introducing accumulators, broadcast variables, and working on a per-partition basis. The next chapter on running Spark on a cluster is also useful, covering how to package your code and dependencies, and the various cluster managers you might have to interact with.

A chapter on Spark SQL isn’t a tutorial on SQL; instead it looks at the way you use it to load data from structures sources such as JSON, Hive and Parquet; the basics of querying through JDBC and BI tools like Tableau; and how to use it in a Spark program from Python, Java and Scala code.

There’s a good chapter on Spark streaming and how to work with discretized streams when writing apps that deal with data as it is streamed from a source. However, though the chapter is a good introduction, I think you’d need more examples and information before you could arrive at anything workable.

The book closes with a chapter on machine learning with MLib, Spark’s library of machine learning functions. MLib is designed to run in parallel on clusters, and it lets you invoke various algorithms on distributed datasets. The authors say the chapter is most relevant to data scientists with a machine learning background who want to use Spark, and that seems a fair analysis; you couldn’t learn about machine learning from the chapter, but you could find out what Spark offers for this audience.

Overall, this is a good introduction to Spark. A lot of the material could be found separately on various Internet sites, but the authors pull it all together and give a cohesive view. If you’re interested in Spark, it’s a good buy.  



You Don't Know JS: this & Object Prototypes

Author: Kyle Simpson
Publisher: O'Reilly
Pages: 174
ISBN: 978-1491904152
Print: 1491904151
Kindle: B00LPUIB9G
Audience: Intermediate level JavaScript programmers
Rating: 4
Reviewer: Mike James 

A small focused book on the two most difficult topics in JavaScript sounds like a really goo [ ... ]

Introduction to Machine Learning with Python

Author: Andreas C. Müller and Sarah Guido
Publisher: O'Reilly
Pages: 394
ISBN: 978-1449369415
Kindle: B01M0LNE8C
Audience: Python programmers
Rating: 4
Reviewer: Mike James

What exactly is machine learning? 

More Reviews

Last Updated ( Monday, 20 April 2015 )

RSS feed of book reviews only
I Programmer Book Reviews
RSS feed of all content
I Programmer Book Reviews
Copyright © 2017 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.