Thursday, November 29, 2012

How HBase Works

HBase is a distributed column family database. It is designed to support random read write access to large data. This wikipedia link http://en.wikipedia.org/wiki/Column-oriented_DBMS gives a nice description of column oriented DBMS. Though this link describes column oriented DBMS, HBase is infact column family oriented DBMS. For more reading on the difference between column and column family DBMS, refer to this excellent blog post.
http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html.

HBase is built over HDFS as the underlying data store. Though it is possible to run HBase over other distributed file systems like Amazon s3, GFS etc, by default it runs over HDFS. HDFS proved a highly distributed and fail safe storage to HBase tables.

This is a simplified view of HBase from a user point of view.

User --> HBase Client ---> HBase Server --> HDFS

User interacts with HBase client which connects to HBase server and reads/writes the data. HBase server in turn reads/writes the table data files from/to HDFS.


HBase Architecture

HBase follows the master slave pattern. It has a HBase master and multiple slaves called Regionservers. HBase tables (except ROOT table which is a special table) are dynamically partitioned into row ranges called regions. Region is a unit of distribution and load balancing in HBase. Each table is divided into multiple regions which are then distributed across the Regionservers. Each Regionserver hosts multiple regions. Master assigns the regions to Regionservers and recovers the regions in case of Regionserver crash. Regionserver serves the client read/write requests directly so that the master is lightly loaded. 

HFile: HBase uses HFile as the format to store the tables on HDFS. HFile stores the keys in a lexicographic order using row keys. It's a block indexed file format for storing key-value pairs. Block indexed means that the data is stored in a sequence of blocks and a separate index is maintained at the end of the file to locate the blocks. When a read request comes, the index is searched for the block location. Then the data is read from that block. 

HLog: Regionserver maintains the inmemory copy of the table updates in memcache. In-memory copy is flushed to the disc periodically. Updates to HBase table is stored in HLog files which stores redo records. In case of region recovery, these logs are applied to the last commited HFile and reconstruct the in-memory image of the table. After reconstructing the in-memory copy is flushed to the disc so that the disc copy is latest.

HBase writes the tables (in HFile format) and log files in HDFS so they are highly available even in case of region server crash.

HBase in Action:

HBase has a ROOT table which stores the information of .META table regions. ROOT table is not partitioned into regions so as to minimize the tables lookups for the client request for a particular key. META table stores the user table regions. Region name is made of table name and region start row. .META table itself is treated as any other HBase table and is divided into regions. As user request comes, .META table is searched for the region to which this query belongs. Searching for a region is easy as the HBase table is lexicographically ordered and hence locating the regions to which this client request belongs is just a matter of doing binary search.

This is flow as it happens for looking up the key in a HBase table.

Client --> ROOT --> .META Table Region --> Requested Table Region --> Requested Row


References:

MemCache: It is an in-memory key value pair store. As it is not write through, once the cache is full or mem cache server crashes, the immemory data is lost. http://en.wikipedia.org/wiki/Memcached

ZooKeeper: It is a distributed, coordination service for distributed applications. HBase uses it for storing the metadata relates to ROOT file. This is a pretty good description of ZooKeeper http://www.igvita.com/2010/04/30/distributed-coordination-with-zookeeper/

Tuesday, October 2, 2012

Pig Vs MapReduce

Tried few examples on Hadoop Map Reduce. After some initial hiccups, Hadoop setup on my local box turned out to be pretty hassle free. As Hadoop is written in Java, it definitely helped me. Understanding the whole business of running the map reduce jobs was a breeze.

Later I was reading Pig Latin. It a  high level scripting language for analyzing large data sets. Since its written over Map reduce, out of curiosity, I tried the same examples with Pig which I tried earlier with Map Reduce. These are my observations

  • Pig is good for modelling and prototyping purposes. You can do iterative development as it is easy to change the script and run it again. No need to package/compile for every change.
  • Pig is definitely slow compared to Map Reduce jobs.
  • There is not much documentation on how to optimize the Pig script. User may end up writing the script in such a way that it creates lots of Map reduce jobs.
  • If you are a programmer, probably you would like more power to optimize your code which comes with Map Reduce. I personally prefer the solution where I have more understanding of how things are working.
  • Pig map involve packaging/compiling code if you are using custom functions. In a real problem, user may end up writing lot of custom functions which map end  up making Pig development almost as complex as Map Reduce.



Tuesday, August 28, 2012

how rsync works

Nice small description on how rsync algorithm works at http://psteitz.blogspot.in/2012/01/rsync-how-it-works.html

If you are interested in details then look at technical paper on rsync (written by rsync founders).

Sunday, August 26, 2012

Latency vs Throughput

These terms are sometimes confusing.

Latency is the time it takes to serve a request. Throughput on the other hand measures the total number of requests served in a given unit of time.
For example in context of web servers, time it takes to serve a http request is latency.
Number of http client requests served per unit of time (seconds/hour/day etc) measures the throughput. Throughput can also be measured in terms of bytes of data served per unit of time.

This choice of latency vs throughput is based on application requirement. Generally, applications strive for high throughput without causing the latency delay. For ex an e-commerce application should be able to serve large number of customers (throughput) with minimum latency. On the other hand, an application doing the batch processing of data like "log files data analysis" will be more interested in higher throughput of data access.  

To improve the latency there may be multiple factors. Latency may depend on how well the application is written to the external factors like shared data access. Latency will be minimum when there is no contention for shared resources in processing the request. Ideal case is when a single thread is processing the request. This is practically not feasible as applications need to serve multiple requests concurrently thus requiring higher throughput.

Throughput can be improved by increasing the number of threads running the application without causing much latency delay. If the CPU is mostly utilized then any further increase in number of threads would cause throughput degradation. Applications needs to tune themselves for optimum results.

If the application is scalable, adding more machines to the system should ideally increase the throughput proportionally without causing latency change. Scalability refers to the capability of system to increase throughput under an increased load when resources are added. Scalability constrains can be data access layer or some shared resources which the application is trying to access for serving the request.