Author: Lars George
Aimed at: HBase administrators and developers
Pros: A really in-depth guide to HBase and how to manage it
Cons: No-holds barred in-depth coverage, not one for the casual reader.
Reviewed by: Kay Ewbank
If you are trying to implement a scalable storage system based on Apache HBase you will welcome a guide to help.
HBase can (somewhat simplistically) be described as the database that runs on top of Apache Hadoop, the open source framework that can be used for distributed processing of large data sets across clusters of computers. The nature of both Hadoop and HBase means things are a bit more complicated than this might sound, so a good book on the subject could well save you a lot of grief. Lars George was one of the first people to actually use HBase for a real system, and has worked on the documentation since those early days, so is well placed to write this book.
The first thing to say about the book is that it is in no way a fluffy intro, it’s designed for people who are serious about using HBase, and it doesn’t pull any punches in its descriptions. It’s heavy on technical descriptions and code, light on irrelevant asides. If you want a gentle introduction to the social implications of database operations, you need to look elsewhere. On the other hand, running or developing for a database that’s supposed to handle vast amounts of data spread across lots of computers isn’t going to be a walk in the park, so if you need gentle chivvying you’re probably in the wrong place.
The book does start with an introduction, but that covers the alternative non-relational database systems, and the building blocks - tables, auto-sharding, the storage API. There’s then a chapter on installation where George takes you through the filesystems, whether to build HBase from source or use the Apache binary release; standalone or distributed mode; possible configurations; deployment; and operating a cluster. This is all really useful if you’re trying to work out just how to set up your system.
The next three chapters look at the client API, starting with the basics - CRUD operations then going on to batch operations, row locks and scans. The next chapter covers filters, counters and the use of co-processors, and the third chapter on the client API is all about administration in HBase. There’s a guide to the clients you might use covering the relative merits of the native Java client, REST, Thrift and Avro, along with batch clients such as MapReduce, Hive, Pig and Cascading (the alternative to MapReduce that in fact uses MapReduce but makes it easier). There’s a guide to using the HBase Shell, and a quick look at the HBase web-based UI that you can use to look at the cluster status, the tables, and the region servers. This is not exactly a web-based UI in the usual sense of the word, there’s a definite feel that GUIs are for wimps. However, this section is at least notable for giving you the thrill of a screenshot - the second in the book, 200 pages after the previous ‘ooh, screenshot’ moment. Don’t get too attached to the soft life, though, as this is just a temporary aberration before you’re back to the true religion of code and facts.
Next comes a good chapter on MapReduce integration that shows how it works in the abstract before going on to look at how to use it with HBase. If anything, I’d have liked a bit longer chapter here, but you’re told the essentials. The next chapter, Architecture, starts with the sentences ‘It is quite useful for advanced users (or those who are just plain adventurous) to fully comprehend how a system of their choice works behind the scenes. This chapter explains the various moving parts of HBase and how the work together.’ Oh, good; if George is giving us a bit of flannel explaining how this is going to be good for us, it’s definitely not a chapter for skim reading. The chapter starts with a look at the difference between B+-trees and log-structured merge trees, then goes on to look at how the data in HBase is actually stored, the write-ahead log, regions, and replication.
Having come out with aching eyeballs from the Architecture chapter, the next chapter on Advanced Usage is hardly light going. There are fascinating discussions of tall-narrow versus flat-wide tables, partial key scans, secondary indexes, bloom filters and versioning, but light going would be a fib. The chapter on cluster monitoring looks at the various tools you can use to keep a beady eye on the running of your HBase cluster with detailed explanations of the metrics framework that HBase exposes, and how you can monitor the metrics using JMX, Nagios and Ganglia (ooh, another screenshot, 200 more pages on from the last).
Next comes a chapter on performance tuning, which George says means turning many knobs to make your cluster hum. He obviously enjoys all this far too much. The book finishes with a chapter on cluster administration, and if you’ve stayed with the author this far, you should know enough about HBase to be able to do your job.
This is a great book, but not one to read just for the fun of it. If you need to know about HBase, for once I’d say the title is accurate, this IS the definitive guide.