Cloudera Extends Apache HBase To Use Amazon S3
Written by Kay Ewbank   
Friday, 04 October 2019

Cloudera has updated Cloudera Data Platform to provide a way for Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data.

The advantage this offers is that Amazon S3 uses a pay-per-use payment method with no server-side component to run or manage for S3. Cloudera Data Platform (CDP) is described as combining the best of Hortonworks' and Cloudera's technologies to create an enterprise data cloud that includes cloud-native services for data warehousing, machine learning, streaming ingest, and operational data stores.


Apache HBase is Hadoop's open-source, distributed, versioned, non-relational database, modeled after Google's BigTable, which offers random, realtime read/write access to big data. Apache's goal for this project is for it to host very large tables -- billions of rows X millions of columns -- on top clusters of commodity hardware.

Amazon Simple Storage Service (S3) is designed to offer secure, durable, highly scalable object storage at a low cost.


Until now, it's not been possible to use S3 directly from HBase because HBase requires a consistent and atomic file system, whereas S3 provides an eventually consistent object store. This means that HBase has been limited to using HDFS rather than being able to natively use S3. Cloudera has now created a solution that is being offered via CDP. When you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.

Under the covers, this relies on using the Hadoop S3A filesystem adapter which accesses data in S3 via the standard FileSystem APIs. Hadoop's S3Guard is also used for directory listing and object status for the S3A adapter so that HBase sees when new StoreFiles are added to an HBase table.

The new element is the HBase Object Store Semantics (HBOSS), a new software project that has been added to the Apache HBase project to handle the gap between S3Guard and HBase. HBOSS is a facade on top of the S3A adapter and S3Guard which uses a distributed lock to ensure that HBase operations can atomically manipulate its files on S3.


More Information

Trial Installation Of HBase Running On S3 In CDP

Cloudera Data Platform

Related Articles

HBase 1.4 With New Shaded Client

Exploring Storage Options on AWS

AWS Storage Gateway 

Amazon Glacier For Cold Storage

Amazon Updates Data Offerings

HBase Adds MultiWAL Support 

Apache Spark 2.0 Released

First Hybrid Open-Source RDBMS Powered By Hadoop and Spark

HBase 1.0 Released   

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.


Season Of Docs 2020 - A Success Story

Season of Docs is a project supported by Google that aims to bring together open source organizations and technical writers with the aim of improving open source documentation. The results of the 2020 [ ... ]

PostgreSQL Is DB-Engines DBMS of the Year For 2020

In yet another confirmation of its popularity and worth, PostgreSQL has taken the annual accolade awarded by DB-Engines.

More News





or email your comment to:

Last Updated ( Friday, 04 October 2019 )