Press "Enter" to skip to content

HBase and S3

Krishna Maheshwari, et al, explain how we can allow Apache HBase to use S3 for storage:

Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data. Amazon S3 is an object store which offers a high degree of durability with a pay-per-use cost structure. There is no server-side component to run or manage for S3 — all that is needed is the S3 client library and AWS credentials. However, HBase requires a consistent and atomic filesystem which means that it cannot directly use S3 because it is an eventually consistent object store. Both CDH and HDP have only provided HBase solely using HDFS because there have been long-standing impediments that prevented HBase from natively using S3. To address these issues, we’ve built an out-of-the-box solution which we are delivering for the first time via CDP. When you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.

I hadn’t thought of using S3, but it’s an interesting post.