Amit JJ

Tuesday, October 2, 2012

Pig Vs MapReduce

Tried few examples on Hadoop Map Reduce. After some initial hiccups, Hadoop setup on my local box turned out to be pretty hassle free. As Hadoop is written in Java, it definitely helped me. Understanding the whole business of running the map reduce jobs was a breeze.

Later I was reading Pig Latin. It a high level scripting language for analyzing large data sets. Since its written over Map reduce, out of curiosity, I tried the same examples with Pig which I tried earlier with Map Reduce. These are my observations

Pig is good for modelling and prototyping purposes. You can do iterative development as it is easy to change the script and run it again. No need to package/compile for every change.
Pig is definitely slow compared to Map Reduce jobs.
There is not much documentation on how to optimize the Pig script. User may end up writing the script in such a way that it creates lots of Map reduce jobs.
If you are a programmer, probably you would like more power to optimize your code which comes with Map Reduce. I personally prefer the solution where I have more understanding of how things are working.
Pig map involve packaging/compiling code if you are using custom functions. In a real problem, user may end up writing lot of custom functions which map end up making Pig development almost as complex as Map Reduce.

Wednesday, September 5, 2012

Article comparing Cloudera and Hortonworks mainly from business point of view.

http://wikibon.org/wiki/v/The_Hadoop_Wars:_Cloudera_and_Hortonworks%E2%80%99_Death_Match_for_Mindshare

Tuesday, August 28, 2012

how rsync works

Nice small description on how rsync algorithm works at http://psteitz.blogspot.in/2012/01/rsync-how-it-works.html

If you are interested in details then look at technical paper on rsync (written by rsync founders).

Sunday, August 26, 2012

Latency vs Throughput

These terms are sometimes confusing.

Latency is the time it takes to serve a request. Throughput on the other hand measures the total number of requests served in a given unit of time.
For example in context of web servers, time it takes to serve a http request is latency.
Number of http client requests served per unit of time (seconds/hour/day etc) measures the throughput. Throughput can also be measured in terms of bytes of data served per unit of time.

This choice of latency vs throughput is based on application requirement. Generally, applications strive for high throughput without causing the latency delay. For ex an e-commerce application should be able to serve large number of customers (throughput) with minimum latency. On the other hand, an application doing the batch processing of data like "log files data analysis" will be more interested in higher throughput of data access.

To improve the latency there may be multiple factors. Latency may depend on how well the application is written to the external factors like shared data access. Latency will be minimum when there is no contention for shared resources in processing the request. Ideal case is when a single thread is processing the request. This is practically not feasible as applications need to serve multiple requests concurrently thus requiring higher throughput.

Throughput can be improved by increasing the number of threads running the application without causing much latency delay. If the CPU is mostly utilized then any further increase in number of threads would cause throughput degradation. Applications needs to tune themselves for optimum results.

If the application is scalable, adding more machines to the system should ideally increase the throughput proportionally without causing latency change. Scalability refers to the capability of system to increase throughput under an increased load when resources are added. Scalability constrains can be data access layer or some shared resources which the application is trying to access for serving the request.