Page 1 of 2
Author: Mohammed Guller
Audience: Devs new to Spark
Reviewer: Ian Stirk
This book aims to provide a “...concise and easy-to-understand tutorial for big data and Spark”. How does it fare?
Spark is increasing the tool of choice for big data processing, being much faster than Hadoop’s MapReduce. After putting Spark into a big data context, the book aims to cover Spark’s core library, together with its more specialized libraries for Streaming, Machine Learning, SQL, and Graphing.
The book is aimed at developers that are new to Spark, some general background programming knowledge required, but little else.
Chapter 1 Big Data Technology Landscape
This chapter opens with a discussion about the current big data age, with data as the lifeblood of organizations, and growing exponentially. The standard 3Vs definition of big data is explored (velocity, variety, volume). Traditional relational database management systems (RDBMS) are unable to process these large volumes in a timely manner – this is where the scalability of big data systems comes into its own.
Next, the chapter discusses some technologies that are either used with Spark, or Spark competes with. The first technology is Hadoop, this is fault tolerant and scalable, and runs on commodity hardware. The three major components of Hadoop are discussed: YARN (Yet Another Resource Negotiator), MapReduce (distributed processing model), and HDFS (Hadoop Distributed File System). Spark is increasingly being used in place of MapReduce owning to its faster speed. The section briefly discusses Hive, a data warehouse with a SQL like interface, Spark SQL is expected to supersede Hive on many systems.
The chapter continues with a look at some common binary formats for serializing (storing on disk) big data, and their pros and cons. Specifically Avro, Thrift, Protocol Buffers, and SequenceFile are examined. Next, some column storage formats, which have performance advantages when the client requires a subset of columns, were briefly discussed, namely: RCFile, ORC, and Parquet.
Then a brief overview of messaging systems is provided, together with the advantages of having a layer of abstraction between producers and consumers. Specifically, Kafka and ZeroMQ are discussed with the aid of useful supporting diagrams.
NoSQL is then examined. The various types of NoSQL databases have different aims to the traditional RDBMS, typically trading Atomicity, Consistency, Isolation, Durability (ACID) for scalability and flexibility. The specific NoSQL databases briefly discussed are Cassandra and HBase. I sometimes wonder if it is meaningful to group NoSQL databases together. Is it meaningful to divide sports into Football and NoFootball? Are all the NoFootball sports meaningful as a group?
The chapter ends with a look at some distributed SQL query engines, these do not use MapReduce batch jobs, and are thus more oriented to interactive querying. The engines briefly examined are: Impala, Presto, and Apache Drill.
This chapter provides an excellent overview of big data technology. It should be noted there are many more technologies than described, but the examples given are sufficient to explain the topic areas. This is possibly the best backgrounder to big data I’ve read.
The discussions are very well written, concise and clear, with helpful diagrams, and no wasted words. There’s a good flow between the topics, and useful links between chapters. There are website links for further information. These traits apply to all the chapters in the book.
Chapter 2 Programming in Scala
Scala is a modern programming language, featuring both functional and object-oriented programming. Spark itself is written in Scala, and although Spark supports multiple languages, new functionality is often added to Scala first. Scala is JVM-based, so can use Java libraries.
This chapter provides an overview of Functional Programming, in terms of functions, immutable data structures and expressions. The basic language features are discussed, and some programming language experience is assumed. Succinct details are provided on how to get started with a Scala environment (e.g. IDE, REPL). Brief discussions and example code are given for various programming features, including: variables, functions, classes, operators, traits, tuples and collections. The chapter ends with a very brief standalone Scala application.
This chapter provides a useful, if brief introduction to Scala, the language used in the book’s examples. Maybe the chapter should have a pointer about where to find further information? (Such a link is given in the book’s introductions).
Chapter 3 Spark Core
The big data world is generally moving away from Hadoop’s MapReduce batch processing towards Spark’s in-memory processing. This chapter discusses the advantages of Spark, and core Spark functionality that is inherent in Spark’s specific libraries - that are discussed in subsequent chapters.
The chapter looks at the key features of Spark, compared with MapReduce (e.g. easier to use, faster, generic). Spark is especially suited to iterative algorithms, and interactive analysis. The high-level architecture of Spark is explained with a helpful diagram, before briefly looking at application execution and data sources (e.g. HDFS, HBase, Amazon S3).
Next, Spark’s API is examined, with reference to the SparkContext (this is a pointer to the Spark environment), and Resilient Distributed Datasets (RDDs). Sample code is provided for the creation of both. RDD operations are either transformations, which performs processing on a source RDD and creates a new RDD (RDDs are immutable), or actions - which are RDD methods that return a value to the driver program. This section contains a useful list of transformation and action methods, each including brief example code. Various ways of saving an RDD are given with examples.
It is noted that transformations are lazy operations, requiring an action method to run it. Lazy evaluation allows Spark to optimize the RDD operations. Additionally, lineage information allows Spark to create and recreate the RDD when needed. Performance can be enhanced further by caching, and examples of this are provided. The chapter ends with a look at shared variables.
This chapter provides a useful overview of Spark’s advantages, and growing popularity. The section on architecture, execution, and data sources helps put Spark processing into context, before drilling down into the API (SparkContext and RDDs) – all illustrated with useful, if brief, example code.
Chapter 4 Interactive Data Analysis with Spark Shell
It’s very easy to get started with Spark, using the command-line tool Spark Shell, which is ideal for quickly testing ideas and learning Spark. This interpreted tool is very similar to the Scala shell. The chapter opens with details on where to download Spark, how to extract the files to your computer, and how to run the Spark shell.
The Spark shell is an REPL (Read-Evaluate-Print-Loop) tool, allowing you to enter commands that are run and the output displayed, this can then be repeated. Some simple Scala expressions are shown.
The chapter continues with an example of loading an RDD from a list, then applying various methods (filter, count, first, take) – all very useful for ensuring your environment is configured correctly and for getting started. The chapter ends with a similar exercise, but involving analysis of a log file.
This chapter provides a very useful and practical introduction to the Spark shell, great for testing ideas. However, I did wonder if the Spark shell should be introduced much earlier in book, to provide feedback and encouragement to the reader.
Chapter 5 Writing a Spark Application
While using the Spark shell is great for interactively testing of ideas, for production applications you will need to write, build, and deploy a Spark application – these details are discussed here.
The chapter opens with a simple WordCount example, with each line of code explained. Next, sbt (Simple Build Tool) and the associated sbt definition file are discussed. sbt is then used to compile the WordCount source code to create a jar file. The application is then run using the spark-submit script that comes with Spark (spark-submit can also be used for deployment to live Spark cluster). While debugging distributed applications is difficult, it’s suggested inspecting logs and using the supplied web-based monitoring application, to get an insight into performance problems.
This chapter provides a gentle walkthrough of the entire process of creating a Spark application: writing, compiling, and running it – with each step and each line of code or configuration file, clearly explained. You focus on the application logic, while Spark handles the distributed processing.