Sanjay Sharma’s Weblog

July 26, 2010

Next Hadoop India User Group Meetup – July 2010

I am pretty excited and looking forward to attend the next HUG meetup on 31st July 2010 in Noida. I really hope to see energetic Indian Hadoop-ers discuss about whats happening in Indian Hadoop community as well as the rest of  the world. 

I guess, I may have been the culprit behind the delay, else we would have the event at least 2-3 months earlier. Will now try to have similar event  more frequently. Have already thoughts around planning for one around NoSQL databases again one of my favourites as a technology of the future. Unlike last time in Nov 2009, a group of young Impros- Absolute Zero forum is organizing the event and sparing me lots of pain:). Offcourse, nothing could have been possible without iLabs and Impetus‘ support pushing us to participate in open source community as much as possible. 

The HUG event this time will have some interesting sessions. Sonal Goyal would be taking about ‘hiho’- an open source solution for bridging the gap between the RDBMS world and Hadoop. As I foresee it, all software based business including SME would like to ride the band wagon of using BI and consumer analytics to enhance business and Hadoop is going to enable that in a cost-effective way. RDBMS would continue to be used for real-time applications since these are time-tested and essentially do not face serious competition (not yet!) from the new age NoSQL databases. So the demand of tools for bringing RDBMS data into Hadoop analytics systems is going to be hot!   ‘hiho’ and sqoop are the two top contenders in this category. Hopefully Sonal would be able to share with us the power of hiho as well as pros/cons over sqoop.

JAQL talk from Himanshu, IBM would again be interesting to know that people are trying out different approaches than map-reduce java/streaming coding and traditional PIG and Hive high level interfaces. The challenge for Himanshu would be to help us understand how JAQL is better than Hive or PIG. 

Sajal would be talking about Hive + Intellicus- a window to the unstoppable future of Hadoop in DW and BI. 

I have always been more biased towards Hive as SQL and java usually go hand in hand in almost all business applications. So it would be interesting to know how Hadoop through Hive is slowly becoming ready for enterprise applications and providing a Visual Interface for data analytics. It seems, at last, Hadoop is ready to come out of developer-only-world to enter the domain of business user$.


July 2, 2010

kundera- making life easy for Apache Cassandra users

Filed under: Cassandra, HPC, Java world, NoSQL — Tags: , , , , , — indoos @ 4:54 am

One of my colleagues Animesh has been working on creating an Annotation based wrapper over Cassandra and we have finally decided to open source it for it to be nurtured as a part of the bigger community.

kundera is hosted on and can be reached here –

Here is how to get started with kundera in 5 minutes –

The logic behind kundera is quite simple – provide ORM like wrapper over the difficult-to-use Thrift APIs. Eventually all NoSQL databases would like to have similar APIs so that it is easy to use NoSQL databases.

The initial release includes a JPA LIKE annotation library. The roadmap is to subsequently change it a Cassandra specific JPA extension. The other important feature that would be added is index/search using Lucandra/Solandra.

October 8, 2009

My memcached experiences with Hadoop

Filed under: Advanced computing, Hadoop, HPC — Tags: , , — indoos @ 12:44 pm

Memcached as I have heard and acknowledge, is the de-facto leader in web layer cache.

Here are some interesting facts from Facebook memcached usage statistics (

  • Over 25 TB (whooping!!!) of in-memory cache
  • Average latency <200 micro seconds (vow!!)
  • cache serialized PHP data structures
  • Lots of multi-gets

Facebook memcached customizations

  • Over UDP
    • Reduced memory overhead of TCP con buffers
    • Application-level flow control, (optimization for multi-gets)
  • On demand aggregation of per-thread stats
    • Reduces global lock contention
  • Multiple kernel changes to optimize for Memcached usage
    • Distributing network interrupt handling over multiple cores
    • opportunistic polling of network interface

My Memcached usage experience with Hadoop

  • Problem definition- using memcached for key-value lookup in Map class. Each mapper method required look up of around 7-8 different types of key-value Maps. This meant that for each  row in input data (million+ rows), lookup was required 7 times more. The entire Map could not be used as in-memory cache due to the big size of the maps (overall 700-800 MB of hierarchical value object Maps with simple keys)
  • Trial 1- using a single Memcached server at running at Namenode with the entire lookup data in memory as key value pair. The map name and the key was used as the lookup key while value was a serialized java object. Tried Externizable implementation as well for some performance boost.The cache worked as a pure persistence cache filled up as a start up job and then working in a read-only mode in subsequent Map Reduce jobs requiring the lookups. Did have problem choosing the right Java client but finally used Danga over spymemcached as spymemcached was not working properly as a persistence read-only cache.
    • Result- no -no. The Map process were really slow
  • Trial 2 -using 15 Memcached servers- 3 running at Namenode while remaining running at individual data node machines. The entire lookup data as key value pair could be seen segregated on each memcached node using memcached command line console. Did a lot of memcached optimizations as well.
  • Result- still no-no. The through put was around 10000 gets per sec  per memcached server. This amounts to around 150000 (yes!!) lookups per sec. BUT still slow to match with our requirements !
  • Final solution- used Tokyo cabinet (a berkley DB like file based storage system) which is as good as it gets! (performance almost same as in-memory loookups)

May 29, 2009

Setting Hadoop 0.20.0 cluster with a windows slave

Filed under: Advanced computing, Java world, Tech — Tags: , , , , — indoos @ 8:46 am

Here are the steps for with setting up a Hadoop cluster and pluging-in a windows machine as a slave-

a. First setup a psuedo-hadoop on a linux machine as explained in

I was able to use this excellent tutorial with minor changes to get psuedo-hadoop cluster running on a Centos/Ubuntu and a Windows machine.

I used a common user hadoop created at all machines

b. Next step was to get all the psuedo-machines work together as a real cluster.  Again was a easy reckoner to get it working

Some easy tips to get hadoop working in cluster mode are

  1. Use machine names everywhere instead of IP address and change /etc/hosts at all machines
  2. Configure the setup at the master machine  i.e. the conf xml including masters and slave files as well as /etc/hosts and copy all these conf files and entries in /etc/hosts file to the slave nodes
  3. The same copying thingy helps for authorized_keys file where we enter all public keys from each slave to master  machine’s  authorized_keys and then copying this authorized_keys file to all slaves.
  4. set JAVA_HOME is each installations file. I had some issues with setting it in .profile and still getting some JAVA_HOME problems
  5. An other easy option is to create a gzip of your master hadoop install and copy it for setup in slave nodes

c. So now for the windows bit-

  1. install cygwin if already not done that
  2. check if you have sshd server installed in cygwin setup- if not, install it
  3. Double check if you a service CYGWIN sshd running under windows services
  4. create a hadoop user by-
cygwin> net user hadoop password /add /yes
cygwin> mkpasswd -l -u hadoop >> /etc/passwd
cygwin> chown hadoop -R /home/hadoop

d. Treat windows machine as *nix-

  1. Now use Putty to login to your local windows machine using the newly created hadoop user
  2. Setup hadoop as you would do for any Linux machine- easy option is to copy paster master hadoop installation
  3. Do not forget to setup .ssh files and copying the pub key in master authorized_key file and copying back that authorized_key to this windows machine. Also do add JAVA_HOME in file which should be a /cygdrive/<path to java6>  entry

e. Assuming that the master server is already running, run this slave using “bin/ start datanode” or “bin/ start tasktracker” to run datanode or task tracker instances.

Next, will write about how I managed to get Hive-0.30 release working with Hadoop 0.20.0 on my small Hadoop cluster with 3 Linux machines and 1 windows machine

December 11, 2008

Love at first bite- GROOVY

Filed under: Java world — Tags: , , , — indoos @ 5:44 am

While looking at Rails on Ruby some time back, I was enticed by its mean clean way of creating fast data driven web sites. Being a hard-core JAVA-ite, I know the LABOR PAINS pains for achieving similar in Java world of JSF, Struts e.t.c.

The first SIGHT of GROOVY aka GRAILS- I was enticed

The first BITE of GROOVY aka GRAILS- I was in Love!!!!

So I have a Rails Clone powered by  Java- deadly combo!!!!

The first few weeks were truly amazing as I tried my hands on a new project. Fast UI development,  magical Ajax support, convention over configuration MIRCHY was what I was wanting for so long.

Some weeks later, as I and Grails settle down together, I am getting aware of our weaknesses (in both me and Grails/Groovy).  It is not that bad yet and with Big B  Java as heavenly God Father covering up the setbacks, it has been good so far.

I am not too concerned about Grails/Groovy being slow(not sure though whether that is true). Why- because Groovy heart is actually Made in JAVA and  I will know what to pull where to get it beating faster.

Will keep posted on whether this LOVE lasts for ever.

Create a free website or blog at