Page 1 of 2
Author: Mike Frampton
Publisher: Packt Publishing
Audience: Spark developers tolerant of bleeding edge technology
Reviewer: Ian Stirk
This book aims to provide a practical discussion of Spark and its major components. How does it fare?
Spark is an increasingly popular Big Data technology, generally performing processing much faster than traditional MapReduce jobs.
This book is for anyone who wants to know more about Spark. In particular, the basic Spark components are discussed, and then Spark is extended with some of the more experimental components.
The book assumes a basic knowledge of Linux, Hadoop, Spark, SBT, and a reasonable knowledge of Scala. The author suggests using the internet to fill any gaps in your prerequisites knowledge.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Apache Spark
The chapter opens with an overview of Spark, being a distributed, scalable, in-memory, parallel processing data analytics system. Spark can be programmed in various languages, including: Java, Python, and Scala. The examples in this book use Scala.
The chapter discusses in outline, the 4 major Spark components (i.e. Machine Learning, Streaming, SQL, and Graph processing), cloud integration, and the future of Spark. Cluster design is briefly examined, it’s noted that Spark doesn’t have its own storage system, so Hadoop is often used – this has the advantage that Spark can become another component in the Hadoop toolset.
The chapter continues with a look at cluster management, and configuring the Spark cluster. Useful discussions and diagrams explaining the Spark master, worker nodes, client nodes and Spark context are provided. This is followed by a section that examines cluster management running as: local, standalone, using YARN, using Mesos, and using Amazon’s Elastic Compute Cloud (EC2).
Next, performance is briefly examined. Topics include: cluster structure (cloud or shared boxes are often slower), putting applications on their own separate nodes, allocate sufficient memory, and filtering data early in the ETL process.
The chapter ends with a look at the cloud, it’s suggested this is the future direction of technology, with Spark as a service. Various providers are briefly discussed (e.g. Databricks, and Google cloud).
This chapter provides a helpful overview of what Spark is, its major components, its various cluster managers, Spark architecture, and its future. Subsequent chapters expand on the major Spark components, and discuss its promising future in the cloud.
Useful discussions, diagrams, configuration settings, practical example code, website links, inter-chapter links are given throughout. These traits apply to the whole of the book.
Chapter 2 Apache Spark MLlib
This chapter opens with the Hadoop/Spark environment configuration used for the examples in this book. It assumes some knowledge of Hadoop, and discusses using Cloudera’s CDH 5.1.3 Hadoop components. The architecture of the Hadoop and Spark cluster are described (i.e. 1 NameNode and 4 DataNodes, 1 Spark master and 4 workers).
The chapter continues with a look at the development environment. Scala is used in the examples, because it is more concise than Java. SBT is used to compile the code, to create JAR files, which can be run. The Linux directory structure used by the examples is briefly explained. The section ends with details about how to install Spark manually - since CDH 5.1.3 contains Spark version 1.0, and some of the Machine Learning (ML) examples need Spark version 1.3.
The chapter now switches to the actual ML section. The ML topics examined are: classification with Naïve Bayes, clustering with K-Means, and Artificial Neural Networks (needs Spark 1.3). For each topic, its theoretical meaning is discussed, and then a code-driven practical example is provided.
This chapter provides a helpful step-by-step walkthroughs of some of the ML topics that Spark can process. The author makes a valid point that the same approach given in the examples, can be used to examine the other Spark MLlib features.
It might have been better to have discussed installation and configuration of Hadoop, SBT, and Spark etc in its own discrete chapter, since it is applicable to all the Spark components, and not specific to this ML chapter.
Perhaps a page could have been used to describe, generally, what ML is, and how it works (training and testing), rather than having it inferred from the text.
Chapter 3 Apache Spark Streaming
This chapter opens with an overview of Spark streaming (i.e. processing data as it continuously arrives), and describes various clients (e.g. Twitter). There’s a useful diagram illustrating the process flow between clients, Spark streaming, other Spark modules (e.g. SQL), and dashboards/databases. The stream is broken down into discrete streams (DStream), based on batch time. There’s a helpful base example showing the creation of a Spark stream context, using the Spark context.
The chapter continues with a look at errors and recovery. In some cases, it may be ok to ignore errors (e.g. gathering performance metrics), here the application can just be restarted. For systems where the data cannot be lost, the system needs to be restarted from a checkpoint. An example is provided showing how to set up checkpointing in HDFS.
The chapter then enters the main section, concerning streaming sources. This section discusses various streaming options, providing practical code-based examples, and expanding on the underlying architecture where necessary. The sources examined are: TCP, File, Flume, and Kafka. In addition, Twitter was examined as a streaming source earlier in the chapter.
This chapter provides a practical discussion of Spark streaming using various sources. Helpful diagrams and code examples are provided throughout. The base code provided could be used as a template for your own streaming code.
Chapter 4 Apache Spark SQL
This chapter opens with a look at the SQL context (created from the Spark context), which is the entry point for processing table data. The recent releases of Spark have included DataFrames, this allows column offsets to be referenced as column names and specific data types – allowing cleaner code. Example code is provided showing importing CSV data from a file, splitting it by separator, and then converting the RDD (Resilient Distributed Dataset) into a DataFrame via the toDF method.
The chapter continues with a look at importing and saving data. Code examples and discussions are provided for processing text, JSON, and Parquet files. DataFrames are examined in greater detail. Manipulation via filtering, selection, display, and group by are all discussed with code examples.
SQL code is examined further, showing filtering via the WHERE clause, and joining tables. SQL can be extended by the creation of User Defined Functions (UDF), and an example UDF is provided.
The chapter continues with a look at integration with Hive. Hive is Hadoop’s data warehouse, thus a primary source of data. The author suggests using Impala for fast in-memory processing, and using Spark with Hive for batch processing, perhaps as part of an ETL chain. Code is provided showing how Spark interacts with Hive.
This chapter provides a very helpful overview of Spark SQL and its many functions (e.g. executing SQL, register UDF, DataFrames etc). The example code and SQL processing should be useful in your own code.
The introduction of DataFrames in Spark 1.3 is undoubtedly an important step - but part of me laments why it wasn’t present from the beginning, but I realise it’s a recent technology, continuously adding functionality. This chapter illustrates the importance of tabular data and SQL processing – something that can be usefully carried over from traditional relational environments.