Authors: Nick Dimiduk & Amandeep Khurana
Publisher: Manning, 2012
Aimed at: Database developers who want to learn how to use HBase
Reviewed by: Kay Ewbank
This book sets out to teach those with experience of other databases how to build applications using HBase. How well does it succeed?
One barrier to learning about open-source software is that the documentation is usually sketchy or non-existent. This is understandable - all of us know that writing the manual is time-consuming and fairly thankless, and when there aren’t limitless resources to throw at the problem, it’s going to be low on the priority list. All that doesn’t make it any easier to break into a new technology, though.
HBase is the NoSQL database that was developed as part of Apache’s Hadoop project. HBase in Action has been described by the authors, Nick Dimiduk and Amandeep Khurana, as the HBase User’s Guide, with an intention of teaching developers who probably have some experience with other databases how to build applications using HBase.
As such, the book opens with a couple of chapters introducing HBase and getting started using it. The example database application used throughout the book is introduced - Twitbase, a simplified clone of Twitter. The authors use the HBase Java client library for these early chapters, and there’s code on most pages. The material is covered at an ideal level, and because Dimiduk and Khurana have developers as the target audience, they focus on what you actually need to know.
By Chapter 3 the authors are on to Distributed HBase, HDFS and MapReduce. They start with a description of the problem MapReduce is used to solve - efficient batch handling of large amounts of offline data, then give an overview of MapReduce and how to use it on dataflows. After showing how HBase works in distributed mode, they then go on to put HBase and MapReduce together in an HBase MapReduce app.
Part 2 of the book is titled Advanced Concepts, although it starts with a look at designing schemas in HBase. The next chapter moves on to using HBase with the observer and endpoint coprocessors. In HBase terms, this refers to pushing some computation to HBase nodes, where the computation is run in parallel across all HBase’s RegionServers.
Coprocessors were added to HBase in the 0.92 release and the authors stress they’re untested in production deployments. HBase coprocessors, being so new, aren’t that well understood, and this chapter makes the book worth getting even if it’s the only bit you use. The final chapter in this part of the book covers alternative HBase clients - scripting from UNIX; JRuby, REST, Python, and an alternative Java HBase client called Asychbase.
Part 3 of the book shows two example HBase applications, an online time-series database and a geographical information system. The final part of the book looks at putting HBase into operation. It starts with a chapter on deploying HBase with discussions of how to plan your cluster, which distribution to use, and how to configure the system. The final chapter looks at ongoing management - monitoring your cluster, performance testing and tuning, cluster management, and backup and replication.
This is a really interesting book. It’s well written and readable, even when explaining difficult topics. The code is well explained, and used to illustrate relevant points rather than just to fill space. There were some aspects that seemed in an odd order - the fact that deployment comes last, and that schemas are put in the ‘advanced’ topics both seemed a little odd. That’s a very minor caveat, though. At the end of reading the book I felt I had a much clearer understanding of HBase, and would be reasonably happy to write a real system using it.
HBase: The Definitive Guide