Page 1 of 2
Author: Guy Harrison
Date: December 30, 2015
Audience: Architects, DBAs, and Devs
Reviewer: Ian Stirk
To mark the beginning of the New Year we are republishing our most popular book review of 2016.
If you've already read it share it with others.
This book aims to help you choose the correct database technology, in the era of Big Data, NoSQL, and NewSQL, how does it fare?
This book is aimed at:
“enterprise architects, database administrators, and developers who need to understand the latest developments in database technologies”.
Some existing knowledge of databases (relational and NoSQL) is useful in understanding the book.
Below is a chapter-by-chapter exploration of the topics covered.
Part I: Next Generation Databases
Chapter 1 Three Database Revolutions
The book opens with a diagram showing the timeline of major database releases, being divided into: pre-relational (1950-1972), relational (1972-2005), and Next Generation (2005-2015). This book is concerned with the Next Generation databases, but first a bit of history and context...
The chapter takes a brief look at the first database revolution, involving Database Management Systems (DBMS) such as hierarchical databases (e.g. IMS) and network databases (e.g. IDMS) , running on mainframes. These systems were relatively inflexible and difficult to maintain.
Next, the second database revolution is examined, concerned with the widely used relational databases (RDBMS). These are based on relational theory, with its tuples, relations, constraints, normalization, and transactions. The widespread adoption of SQL enhanced their usage.
The chapter next looks at the third database revolution, initiated by massive internet growth which created pressure on RDBMS scalability. Realising this, Google, the largest website, looked at how the growing amounts of data could be processed in a timely manner. Google published 3 influential papers, one about a distributed file system, the second about distributing processing (MapReduce), and the third about a distributed database (BigTable). These ideas later formed the basis of Hadoop.
The old relational architecture was re-examined by Stonebraker in 2007, who suggesting making changes relating to in-memory and columnar databases, this became NewSQL. There is a brief overview of the non-relational database explosion. The author acknowledges NoSQL is an unfortunate term, since it defines what the database isn’t rather than what it is.
The chapter ends with the conclusion that there is no longer a one-size-fits-all database solution. Unlike the relational model, the NoSQL databases do not have a common architectural pattern. The growth of the Internet of Things (IoT), social media, and ever increasing amounts of (often unstructured) data, all indicate the growing need for scalable Next Generation databases.
This chapter provides a great overview of each database revolution, in the context of broader technology changes. It is the best overall explanation of what happened in database world, together with its underlying reasons, I’ve read.
There are a few small errors e.g. “... a huge number of relational database systems emerged in the first half of the 2000s”, this should read ‘Next Generation’ instead of ‘relational’.
The chapter is easy to read, with useful explanations, considered discussions, helpful diagrams, inter-chapter references, and website links. These traits apply to the whole of the book.
Chapter 2 Google, Big Data, and Hadoop
The chapter opens with a look at the ever increasing amounts of data being generated, together with the impact of cloud, mobile and social media as part of the Big Data revolution.
Next, Google’s pioneering Big Data processing is discussed, including: use of low-cost commodity servers, distributed file system, MapReduce, and BigTable. Helpfully, Google made details of these innovations available for others, and they formed the basis of Hadoop.
The chapter next looks at the origins of Hadoop, and its relatively quick adoption by companies looking to process massive amounts of data. The power of Hadoop is discussed in terms of its scalability, cost, and reliability. Next, Hadoop’s high-level architecture is described, with its Hadoop Distributed File System (HDFS), MapReduce processing model, NameNode (controller), DataNodes (workers), and YARN (resource manager).
The chapter continues with a look some of Hadoop’s related technologies. HBase is a fault-tolerant distributed database, based on Google’s BigTable. Hive is Hadoop’s data warehouse, and provides SQL-like querying via Hive Query Language (HQL). Pig is a scripting language, which like Hive translates into batch MapReduce jobs. Pig provides much more functionality than Hive. Both Hive and Pig provide a simpler abstraction of MapReduce, allowing many more users to query Big Data.
The chapter ends with a very brief look at some other Hadoop technologies, these are: Flume, Sqoop, Zookeeper, Oozie, and Hue. The recent growing use of Spark is noted.
This chapter provides a useful overview of the technologies that drove the creation of the Next Generation databases. With useful discussions of Google’s foundation distributed processing papers, Hadoop’s major component and related technologies.
Chapter 3 Sharding, Amazon, and the Birth of NoSQL
It’s noted that between 1995 and 2005, the importance of the Internet grew enormously. Relational databases supporting the larger websites could no longer cope, giving rise to new web-scale database systems, called NoSQL. Web 1.0 refers to static web pages, while Web 2.0 refers to dynamic content, the latter drove the demand for improved scalability.
Dynamic websites require database content, and initially scalability was limited by the power of a single database server, for more scalability a bigger machine was purchased. The chapter discusses other methods of improving the scalability of relational databases including: Memcached, read replication, and sharding. With scale-up limitations, eventually a different architecture was required.
Next, the chapter discusses this different approach: the CAP theorem, were on a partitioned system you can have either consistency or availability, but not both. On many NoSQL databases, consistency is typically sacrificed for availability, giving rise to ‘eventual consistency’.
This chapter provides a very helpful summary of the limitations of relational database servers, some methods of extending their scalability, the growth of the internet as a driver for scalability, and finally how NoSQL databases could use the CAP theorem to provide much greater scalability.
Chapter 4 Document Databases
The chapter opens with a brief history, giving the reasons why document databases have become increasingly popular. The early document databases were based on XML, but soon JSON became more popular, especially with web-based systems.
The chapter continues with a detailed look at JSON document databases, showing the interaction between JSON and AJAX, and the typical storage hierarchy (document and collections). Next, some specific document databases are briefly examined, including CouchBase, and MongoDB.
This chapter provides a useful overview of document databases, their rise, main features, and some specific vendor examples.
Chapter 5 Tables are Not Your Friends: Graph Databases
Graphs describe the relationships between things, e.g. friends. The chapter opens by defining some graph terminology: vertices (nodes), edges (relationships), and properties. RDBMSs can define graphs using self-joins, however the SQL is often convoluted and performance can be a concern.
The chapter continues with a look at the property graph model, which links nodes, relationships, and attributes – these form the basis of the popular graph database Neo4j. Some sample Neo4j graph code, using the declarative Cypher language, is discussed. This is followed by a brief look at Gremlin, a more procedural language, again sample code is discussed.
This chapter provides a useful, if brief, introduction to graph databases and their processing languages.
Chapter 6 Column Databases
If you don’t need all the columns in rows of data, storing the data in columns (physically next to each other) instead of rows, can have many advantages - notably improved performance. This chapter opens with a look at data warehousing, which typically works on subsets of columns. Star and snowflake schemas, fact and dimension tables are briefly discussed.
The chapter next discusses how column storage differs from row storage, with helpful diagrams. The advantages of having related data on the same storage, together with the ability to compress related data is discussed.
The chapter ends with a look at column database architectures. To help reduce the cost of single row changes, often a delta store is used, this contains details of the modifications, and is periodically merged into the main data store – this process is illustrated with diagrams.
This chapter provides a useful introduction to column databases, and their performance advantages.
Chapter 7 The End of Disk? SSD and In-Memory Databases
The chapter opens with a comparison of disk and memory speeds. Solid State Disks (SSDs) are examined, together with the impact of the falling price of memory. Next, various proprietary in-memory databases are discussed, namely: TimesTen, Redis, SAPHANA, VoltDB, and Oracle 12c.
The chapter continues with a look at Spark. Spark is increasingly replacing Hadoop’s MapReduce batch processing, owning to its better performance. Various Spark components are outlined, including GraphX, SparkSQL, and MLlib. Next, Resilient Distributed Datasets (RDDs) are examined, these allow parallel in-memory processing across nodes.
This chapter provides an interesting overview of in-memory databases, both those based on the relational database model, and Spark’s in-memory RDDs.