|Scalable Big Data Architecture|
Page 2 of 2
Author: Bahaaldine Azarmi
Chapter 4 Streaming Data
This chapter take a look at the logging architecture used by the solution, involving ingesting and indexing data. The chapter opens with a discussion about streaming architecture, here it means an architecture capable of ingesting data as it arrives. The use of buffering or messaging brokers to hold the data is discussed, before looking at the specific technologies involved (Flume, Logstash etc).
The chapter continues with a look at the structure of the ingest data. Firstly, clickstream data is discussed, this logs the web-based activity of website visitors – this visitor behaviour data can prove very valuable. The section continues with a look at the structure of the raw data (and the appropriate web server settings to log the data). The section ends with a look at a Log Generator, this is a short Python script that generates data (simulating website visitors), that can be used subsequently by the solution.
Next, details are provided on how to set up the streaming architecture, specifically how to move the logs in Apache Kafka, this solution uses the Logstash-forwarder. Code is provided to do this, and is discussed. Next, details of how to extract the logs from Apache Kafka, and to index the logs is provided. Again code and setup are described.
The chapter’s title is misleading, since the actual stream processing is in the next chapter, the author acknowledges the term could be misunderstood - I would call the chapter “Logging Architecture”.
The setup given here seems quite convoluted. There’s a lot to be said for a simpler system, systems with fewer parts are typically easier to understand and maintain.
Chapter 5 Querying and Analyzing Patterns
This chapter opens with a look at analytics strategy, providing two approaches to analyzing the data, namely: continuous processing and real-time querying.
The chapter continues with processing and indexing the data using Spark. First, details are provided on how to implement Spark streaming, this is followed with a discussing on the basics of how Spark applications work (code is provided). Next, details are provided on how to implement a Spark streamer and Spark indexer, a useful diagram is given and code discussed. The section ends with a look at the use of the PageStatistic object and storing relevant metrics.
Next comes a section on analytics with Elasticsearch, which has a powerful API for performing real-time queries in a scalable manner. Bucket aggregation (group set of documents) and metric aggregation (gives average/min/max etc, for set of documents) are discussed, and code provided.
The chapter ends with a look at using Kibana to visualize data. Kibana provides a dashboard that shows Elasticseach aggregation working. There’s a short walkthrough on how to configure the various types of chart.
Perhaps the Spark REPL tool could have been mentioned for testing your Spark code/ideas (and ensuring your configuration is correct!).
Chapter 6 Learning From Your Data?
This chapter looks at Machine Language, and its use to enhance the proposed solution. The chapter opens with a look at some machine learning concepts, supervised learning (where we know the expected results), and unsupervised learning (where we don’t know the expected results). Next, machine learning in Spark is briefly examined, before discussing the use of the K-means algorithm (allows classification of data) in the solution – by classifying website visitors they will achieve better product recommendations.
The chapter continues with a section on enriching the clickstream data, Spark is used to map the category of product by visitor. Some useful diagrams and code are provided for this. Next, the data is given a category label. The data is then used for training, and then afterwards it is used to make predictions. Spark code is provided for this.
This chapter shows how machine learning can be used on training data to create a model that can then be used to make predictions, in this case, to allow the website visitor to receive better product recommendations.
Chapter 7 Governance Considerations
Governance here means how architecture will be deployed. Recently Docker has become a standard for distributed architecture deployment. The chapter opens with an overview of Docker, having a set of tools and APIs to help construct your infrastructure, and reduce deployment time. Docker terms are defined, including: container (smallest unit that runs your application), image (describes container), client (interface that interacts with Docker features), and daemon (co-ordinates it all) – a helpful diagram shows how they’re related.
The chapter proceeds with a look at installing Docker, and then creating Docker images – one for each technology used in the solution. Docker Compose is then examined as a means of providing help in container configuration and runtime preferences.
The chapter next looks at architecture scalability. Various factors are examined in considering sizing, including: data volumes, data structure, retention period, high availability. It’s suggested you experiment with your system to determine how sustainable the architecture is. Next, monitoring the infrastructure using the Elastic Stack is briefly discussed. The chapter ends with a brief discussion around security.
The chapter provides a brief but useful introduction to the use of Docker to package and deploy your architecture.
There are more nonsense sentences e.g. “When you are sizing your architecture to handle a certain data volume and reach certain SLAs, xx. The answer is always it depends.”
This book aims to be “A practitioner’s guide to choosing relevant big data architecture”. Additionally, it is for “...developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools to integrate into that pattern”.
I had expected a discussion of the different architectures and technologies, being compared and contrasted. Instead, the first chapter tells me what technologies will be used, but gives insufficient reason about why they were selected.
While the architecture pattern and the tools chosen do create a scalable architecture, they are only one of many potential solutions. I note that the author works for Elastic, and Elastic software forms the basis of the proposed solution. I wonder if there is a conflict of interest here?!
Parts of the book were awkward to read, made worse because there are problems with the grammar. Missing sections and unresolved cross references occur throughout the book, indicating revision was incomplete, together with and nonsense sentences such as “
“That may look weird to end with part, but I prefer that the streaming to pipe be ready to work before diving on the core usage of Spark, the data processing.”
The skill level expected of the reader seems to be variable. While the book explains machine learning for the beginner, it casually throws in terms like “infinity converging model”, “funnel conversion”, “Spark Direct Acyclic Graph” – without explanation. Many sections were very brief on detail. Also, there is no introduction to the book itself (unless you use the book’s back cover?!). I wonder if the reviewers/editors were sleeping.
As I've already suggested, a more specific subtitle such as:
“An example big data architecture using preselected components, based around Elastic’s software”
might help potential purchasers.
Reading Your Way Into Big Data - Ian Stirk recommends the reading required to take you from novice to competent in areas relating to Big Data, Hadoop, and Spark
|Last Updated ( Friday, 04 March 2016 )|