Thursday, November 29, 2012

How HBase Works

HBase is a distributed column family database. It is designed to support random read write access to large data. This wikipedia link http://en.wikipedia.org/wiki/Column-oriented_DBMS gives a nice description of column oriented DBMS. Though this link describes column oriented DBMS, HBase is infact column family oriented DBMS. For more reading on the difference between column and column family DBMS, refer to this excellent blog post.
http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html.

HBase is built over HDFS as the underlying data store. Though it is possible to run HBase over other distributed file systems like Amazon s3, GFS etc, by default it runs over HDFS. HDFS proved a highly distributed and fail safe storage to HBase tables.

This is a simplified view of HBase from a user point of view.

User --> HBase Client ---> HBase Server --> HDFS

User interacts with HBase client which connects to HBase server and reads/writes the data. HBase server in turn reads/writes the table data files from/to HDFS.


HBase Architecture

HBase follows the master slave pattern. It has a HBase master and multiple slaves called Regionservers. HBase tables (except ROOT table which is a special table) are dynamically partitioned into row ranges called regions. Region is a unit of distribution and load balancing in HBase. Each table is divided into multiple regions which are then distributed across the Regionservers. Each Regionserver hosts multiple regions. Master assigns the regions to Regionservers and recovers the regions in case of Regionserver crash. Regionserver serves the client read/write requests directly so that the master is lightly loaded. 

HFile: HBase uses HFile as the format to store the tables on HDFS. HFile stores the keys in a lexicographic order using row keys. It's a block indexed file format for storing key-value pairs. Block indexed means that the data is stored in a sequence of blocks and a separate index is maintained at the end of the file to locate the blocks. When a read request comes, the index is searched for the block location. Then the data is read from that block. 

HLog: Regionserver maintains the inmemory copy of the table updates in memcache. In-memory copy is flushed to the disc periodically. Updates to HBase table is stored in HLog files which stores redo records. In case of region recovery, these logs are applied to the last commited HFile and reconstruct the in-memory image of the table. After reconstructing the in-memory copy is flushed to the disc so that the disc copy is latest.

HBase writes the tables (in HFile format) and log files in HDFS so they are highly available even in case of region server crash.

HBase in Action:

HBase has a ROOT table which stores the information of .META table regions. ROOT table is not partitioned into regions so as to minimize the tables lookups for the client request for a particular key. META table stores the user table regions. Region name is made of table name and region start row. .META table itself is treated as any other HBase table and is divided into regions. As user request comes, .META table is searched for the region to which this query belongs. Searching for a region is easy as the HBase table is lexicographically ordered and hence locating the regions to which this client request belongs is just a matter of doing binary search.

This is flow as it happens for looking up the key in a HBase table.

Client --> ROOT --> .META Table Region --> Requested Table Region --> Requested Row


References:

MemCache: It is an in-memory key value pair store. As it is not write through, once the cache is full or mem cache server crashes, the immemory data is lost. http://en.wikipedia.org/wiki/Memcached

ZooKeeper: It is a distributed, coordination service for distributed applications. HBase uses it for storing the metadata relates to ROOT file. This is a pretty good description of ZooKeeper http://www.igvita.com/2010/04/30/distributed-coordination-with-zookeeper/