Mastering Apache Spark
Article Index
Mastering Apache Spark
Chapters 5 - 9, Conclusion

Author: Mike Frampton
Publisher: Packt Publishing
ISBN: 978-1783987146
Print: 1783987146
Kindle: B0119R8J00
Audience: Spark developers tolerant of bleeding edge technology
Rating: 5.0 
Reviewer: Ian Stirk

Chapter 5 Apache Spark GraphX

This chapter opens with an overview of graph terminology, a graph is a data structure, with nodes (vertices), and connections (edges). A graph has many uses, including fraud detection and social modelling. GraphX is Spark’s graph technology.

The chapter continues with a look at GraphX coding. A family tree is used in the examples to illustrate the concepts and processing. First the environment is discussed, specifically the directory structure, SBT environment, and source code compilation into JAR files.

Generic Scala GraphX code is described, and used in all the subsequent examples. The generic code:  

  • imports the Spark context, GraphX, and RDD functionality

  • application is defined (extends App class)

  • define data files (HDFS server and port, Vertex and Edge data files)

  • define Spark master and port

  • Spark config object created

  • Spark context created, using Spark config object

  • Vertex file loaded into RDD, edge data file loaded into RDD

  • Graph created from vertex RDD and edges RDD  

The chapter continues with various graph examples, these examples include:  

  • counting (simple example, prints the number of vertices and edges)

  • filtering (filter based on a person’s age and relationship)

  • PageRank (assumes vertex with most links is most important)  

Since Spark doesn’t have its own storage, it can’t do in-place processing. An example is provided of using the Neo4j graph database to provide in-place processing. The example uses Mazerunner, a GraphX Neo4j processing prototype. It shows how a graph based database can be used for graph storage, and Spark used for graph processing.

This chapter provides a useful introduction to graph terminology, architecture and processing. Useful practical example code is code is provided. The generic GraphX code template should prove a useful base for your own graph processing code. The section on Mazerunner was useful, illustrating the potential of graph storage and processing.

Chapter 6 Graph-based Storage

This chapter opens with a discussion about Spark not providing its own storage. Graph data needs to be sourced from somewhere, and after processing, needs to be stored somewhere. This chapter primarily discusses the use of the Titan graph database for storage.

The chapter proceeds with details on how to download, install, and configure Titan, an interesting but not yet mature product. The various storage options are discussed, these include HBase and Cassandra NoSQL databases. The chapter shows how the associated Gremlin shell can be used interactively to develop graph scripts and Bash shell scripts.

The chapter continues with a look at using Titan with HBase. Details are given on how to install HBase from the CDH distribution. Gremlin HBase scripts are provided to define: storage, ZooKeeper servers and ports, and HBase tables. Useful example code is provided that shows the processing and storage of the HBase graph data. A similar exercise is undertaken for the Cassandra database.

The chapter ends with a look at accessing Titan with Spark, showing the use of Spark to create and access Titan based graphs.

This chapter provides a helpful overview of some of the newer and more experimental technologies, for graph storage systems.

Chapter 7 Extending Spark with H2O

The book now switches back to looking at Machine Learning. While Spark MLlib provides lots of useful functionality, more options are available when integrating Spark with the Sparkling Water component of H2O.

Details are provided on how to download, install and use H2O. This is followed by environment details, including directory structure, SBT config file content, and the use of Bash scripts to execute the H2O examples.

The chapter continues with a look at H2O architecture. It is possible to interact with the data via a web interface, and this is described. Data is shared between Spark and H2O via the H2O RDD, and this is shown in an example (H2O does the processing and the results passed back to Spark). H2O contains algorithms for Deep Learning, and this is examined next. Deep Learning is feature rich, with extra hidden layers, so there is a greater ability to extract data features. Examples are provided.

The chapter ends with a look at H2O Flow, this is a web-based interface for H2O and Sparkling Water. It’s used for monitoring, manipulating data, and training models. Example code is provided.

This chapter shows how Spark MLlib can be extended using H2O libraries. The general architecture of H2O is examined, together with download, installation and configuration details. Various extra data analysis and modelling features are shown.

Chapter 8 Spark Databricks

Creating a big data analytics cluster, together with importing data and ETL, can be difficult and expensive. Databricks aims to make this task easier. Databricks is a cloud-based service that provides similar functionality as an in-house Spark cluster. Currently only Amazon Web Services (AWS) is supported, but there are plans to cover other cloud platforms.

The chapter opens with an overview of Databricks, having a cluster similar to a Spark cluster with master, slaves, and executors. Configuration and server size are defined, and monitoring and security are built-in. With the cloud platform, you only pay for what you use. The cluster is defined in terms of notebooks, which have folders, which can hold code/script. It is also possible to create jobs.

The chapter continues with a look at how to install Databricks, it’s noted that AWS offers 1 year free access, 5GB storage, and 750 hours of EC2 usage – which all means low cost access. The chapter continue with the various steps needed to get up and running (account id, access account id, secret access key – used by Databricks to access your AWS storage). AWS billing is briefly discussed.

Various administration features are discussed, including: Databricks menus (to perform actions on folders), and account management (add accounts, change passwords etc). Cluster management is briefly discussed, and a step-by-step walkthrough of creating and configuring a new cluster is shown. Examples are provided on how to create and use notebooks and folders. Running various jobs and libraries is briefly discussed.

The chapter ends with a look at Databricks tables, again mostly via the admin website. Using menu options it is possible to create a table via an import. It is also possible to create external tables, and tables programmatically. Sample table data can be previewed, and SQL commands run against the tables. Finally, the DBUtils package is examined and some of its methods discussed.

This chapter provides useful information about the current status of Spark in the cloud. There are useful walkthroughs for setting up a cluster to use in the cloud. Databricks know a lot about Spark, having designed it!

Chapter 9 Databricks Visualization

The previous chapter laid the foundation for Spark in the cloud, this chapter extends this to data visualization. Databricks provides dashboards for data visualization, based on the tabular data that SQL produces. Various menu options allow the data to be presenting in different formats.

The chapter continues with a step-by-step walkthrough for the creation of a simple dashboard, which is published - so it can be accessed by an external client. This is followed by the creation of a RDD-based report, and a streamed-based report.

Next, the REST interface is discussed, this allows integration of your Databricks cloud instance with external applications. Code for this is given and discussed. Various methods of moving data in and out of Databricks are then described, with examples. The chapter ends with a brief mention of some resources from which you can obtain further information and help about Databricks.

This chapter provides a useful overview of Databricks visualization. The use of menus and the step-by-step walkthroughs make the chapter particularly easy to understand. The author believes the natural progression of big data processing is:

Hadoop → Spark → Databricks.

Time will tell.

Conclusion

This book has well-written discussions, helpful examples, diagrams, website links, inter-chapter links, and useful chapter summaries. It contains plenty of step-by-step code walkthroughs, to help you understand the subject matter.

The book describes Spark’s major components (i.e. Machine Learning, Streaming, SQL, and Graph processing), each with practical code examples. Some of the template code could form the basis of your own application code.

Several of the core Spark components are extended using less well-know components, many of these are still works in progress. I’m not sure how many readers will find these chapters/sections useful, since they often involve workarounds, or the components might not exist or be superseded later – they can also distract from the book’s core. That said, if you enjoy working at the bleeding edge of technology, you’ll enjoy what these extensions add.

Although the book assumes some knowledge of Spark, for completeness, it might have been useful to have some introduction to it (e.g. explain RDDs, introduce the spark-shell etc). Developers coming from a Windows environment might struggle initially understanding Linux, SBT, JARs etc.

Despite these concerns, I enjoyed this book, it contains plenty of useful detail. Spark is a rapidly changing technology, so check http://spark.apache.org/ for the latest changes. The book is highly recommended.

Related Articles

Reading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark

Big Data Analytics with Spark - review by Ian Stirk 

Learning Spark - review by Kay Ewbank

Banner


Machine Learning with PyTorch and Scikit-Learn

Author: Sebastian Raschka, Yuxi (Hayden) Liu & Vahid Mirjalili
Publisher: Packt
Date: February 2022
Pages: 770
ISBN: 978-1801819312
Print: 1801819319
Kindle: B09NW48MR1
Audience: Python developers interested in machine learning
Rating: 5
Reviewer: Mike James
This is a very big book of machine le [ ... ]



Practices of the Python Pro

Author: Dane Hillard
Publisher: Manning
Date: January 2020
Pages: 248
ISBN: 978-1617296086
Print: 1617296082
Audience: Python developers
Rating: 3
Reviewer: Mike James
I want to be a Python Pro....


More Reviews



Last Updated ( Tuesday, 16 February 2016 )