Sanjay Sharma’s Weblog

February 8, 2010

BI with MapReduce

Filed under: Advanced computing, Hadoop — Tags: , , , , , — indoos @ 2:12 pm

Have any of you used map reduce in the context of business intelligence?

While collating my thoughts on this Linked-in Hadoop discussion, found out that I needed more visuals to explain it first to myself :) .

So, here are the many ways in which Hadoop MapReduce does offer an alternative in the big-big BI world-

Scenario 1: Use Hadoop and Hive as interface to BI tools. Pentaho reporting is already supported as of Hive 0.4.0.

Scenario 2: Use Hadoop for intial data polishing, and then dump to a SQL supported column based database near-real BI reporting. Aster data/Vertica /Greenplum sell themselves by advertising  MapReduce connectors (or similar) heavily. The cost of SQL supported column based database is the only pain point here (+ the risk on how these actually scale vs what these promise)

Scenario 3: Use Hadoop for intial data polishing, and then dump to a SQL supported column based database near-real BI reporting. In case of Real time reporting, data can further be BI polished from column based databases to a fast regular RDBMS with BI support.

 

Scenario 4: The free way:)- Use Hadoop for intial data polishing, and then dump to a regular SQL database with BI support. The export from HDFS can be the Un-sqoop way. The onus would more be on the developer to dump only ready-for-report data (lesser) with most of the BI already completed as part of More MR step.

The important fact to note is that there might be additional costs on moving the major chunk of  BI data analysis part to programmatic interfaces (SQL or MR).  

I am not too much of a database-fallen-in-love type, so do like the way Hive can emerge as a potential BI reporting tool.

October 9, 2009

Hadoop optimization and tuning

Filed under: Advanced computing, HPC, Hadoop — Tags: , — indoos @ 7:05 am

Recently been a part of some Hadoop related projects and partnered a white paper with one of my colleagues in my company Impetus on Hadoop optimization and tuning.

The white paper can now be downloaded from Impetus website http://www.impetus.com. Look for White papers (or use this link- http://www.impetus.com/impetusweb/whitepapers_main.jsp?download=HadoopPerformanceTuning.pdf).

There are very few similar things out there and should be helpful for those trying to take Hadoop onto production environments.

October 8, 2009

My memcached experiences with Hadoop

Filed under: Advanced computing, HPC, Hadoop — Tags: , , — indoos @ 12:44 pm

Memcached as I have heard and acknowledge, is the de-facto leader in web layer cache.

Here are some interesting facts from Facebook memcached usage statistics (http://www.infoq.com/presentations/Facebook-Software-Stack)

  • Over 25 TB (whooping!!!) of in-memory cache
  • Average latency <200 micro seconds (vow!!)
  • cache serialized PHP data structures
  • Lots of multi-gets

Facebook memcached customizations

  • Over UDP
    • Reduced memory overhead of TCP con buffers
    • Application-level flow control, (optimization for multi-gets)
  • On demand aggregation of per-thread stats
    • Reduces global lock contention
  • Multiple kernel changes to optimize for Memcached usage
    • Distributing network interrupt handling over multiple cores
    • opportunistic polling of network interface

My Memcached usage experience with Hadoop

  • Problem definition- using memcached for key-value lookup in Map class. Each mapper method required look up of around 7-8 different types of key-value Maps. This meant that for each  row in input data (million+ rows), lookup was required 7 times more. The entire Map could not be used as in-memory cache due to the big size of the maps (overall 700-800 MB of hierarchical value object Maps with simple keys)
  • Trial 1- using a single Memcached server at running at Namenode with the entire lookup data in memory as key value pair. The map name and the key was used as the lookup key while value was a serialized java object. Tried Externizable implementation as well for some performance boost.The cache worked as a pure persistence cache filled up as a start up job and then working in a read-only mode in subsequent Map Reduce jobs requiring the lookups. Did have problem choosing the right Java client but finally used Danga over spymemcached as spymemcached was not working properly as a persistence read-only cache.
    • Result- no -no. The Map process were really slow
  • Trial 2 -using 15 Memcached servers- 3 running at Namenode while remaining running at individual data node machines. The entire lookup data as key value pair could be seen segregated on each memcached node using memcached command line console. Did a lot of memcached optimizations as well.
  • Result- still no-no. The through put was around 10000 gets per sec  per memcached server. This amounts to around 150000 (yes!!) lookups per sec. BUT still slow to match with our requirements !
  • Final solution- used Tokyo cabinet (a berkley DB like file based storage system) which is as good as it gets! (performance almost same as in-memory loookups)

August 27, 2009

Hadoop- some revelations

Filed under: Advanced computing, Java world, Tech — Tags: , , — indoos @ 5:44 am

My recent experience with using Hadoop in production grade applications was both good and bad.

Here are some of the bad ones to start with-

  • Using commodity servers – not entirely true as even expressed on Hadoop web site somewhere. Anything below 8 GB RAM may not help with any good production heavy application, particularly if each Map/Reduce task uses 1-2 GB of RAM
    • Task tracker and data node JVM instances take at least around 1 GB RAM each- effectively leaving 5-6 GB RAM for Map Reduce JVMs
    • 512 MB for each Map and Reduce JVMs leaves with 5-8 Maps +3-6 Reduce instances
  • Usually real-time applications use look up or metadata data.  Although, Hadoop does offer Distributed cache or Configuration based (pseudo) replication of small shared data, the very nature of heavy Java in-memory object handling (serialization-dese) and HDFS access, does not allow performant look up handling
  • I would love to see more/easier/default control on various settings/parameters in config files as the current mechanism is really a pain in the back
  • Hadoop uses a lot of temp space. It is easy to NOT notice that you may only use 1/4 of your total available hard disk memory for business use. This is because you use 2 parts for replication (3 is default n good replication factor) while 1 for temporary (working/intermittent) processing. So for processing say 1 TB data, use may require around 4 TB+ hard disk. I learned about this the hard way after wasting good precious time!!
  • Last but not the least- it is real easy to write Map Reduce using Hadoop genius framework, but real difficult to convert business logic to Map Reduce paradigm

To be continued ……………….

May 29, 2009

Setting Hadoop 0.20.0 cluster with a windows slave

Filed under: Advanced computing, Java world, Tech — Tags: , , , , — indoos @ 8:46 am

Here are the steps for with setting up a Hadoop cluster and pluging-in a windows machine as a slave-

a. First setup a psuedo-hadoop on a linux machine as explained in http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

I was able to use this excellent tutorial with minor changes to get psuedo-hadoop cluster running on a Centos/Ubuntu and a Windows machine.

I used a common user hadoop created at all machines

b. Next step was to get all the psuedo-machines work together as a real cluster.  Again http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) was a easy reckoner to get it working

Some easy tips to get hadoop working in cluster mode are

  1. Use machine names everywhere instead of IP address and change /etc/hosts at all machines
  2. Configure the setup at the master machine  i.e. the conf xml including masters and slave files as well as /etc/hosts and copy all these conf files and entries in /etc/hosts file to the slave nodes
  3. The same copying thingy helps for authorized_keys file where we enter all public keys from each slave to master  machine’s  authorized_keys and then copying this authorized_keys file to all slaves.
  4. set JAVA_HOME is each installations hadoop-config.sh file. I had some issues with setting it in .profile and still getting some JAVA_HOME problems
  5. An other easy option is to create a gzip of your master hadoop install and copy it for setup in slave nodes

c. So now for the windows bit-

  1. install cygwin if already not done that
  2. check if you have sshd server installed in cygwin setup- if not, install it
  3. Double check if you a service CYGWIN sshd running under windows services
  4. create a hadoop user by-
cygwin> net user hadoop password /add /yes
cygwin> mkpasswd -l -u hadoop >> /etc/passwd
cygwin> chown hadoop -R /home/hadoop

d. Treat windows machine as *nix-

  1. Now use Putty to login to your local windows machine using the newly created hadoop user
  2. Setup hadoop as you would do for any Linux machine- easy option is to copy paster master hadoop installation
  3. Do not forget to setup .ssh files and copying the pub key in master authorized_key file and copying back that authorized_key to this windows machine. Also do add JAVA_HOME in hadoop-config.sh file which should be a /cygdrive/<path to java6>  entry

e. Assuming that the master server is already running, run this slave using “bin/hadoop-daemon.sh start datanode” or “bin/hadoop-daemon.sh start tasktracker” to run datanode or task tracker instances.

Next, will write about how I managed to get Hive-0.30 release working with Hadoop 0.20.0 on my small Hadoop cluster with 3 Linux machines and 1 windows machine

May 28, 2009

A must read for opponents of Code Quality and TDD

Filed under: Code quality, Java world — indoos @ 8:21 am

All test-driven development (TDD) and pair programming (PP) opponents- here is something real straight and easy  to understand-

http://anarchycreek.com/2009/05/26/how-tdd-and-pairing-increase-production/

April 9, 2009

GAE+Groovlets – local+remote with Eclipse plugin

Filed under: Java world — Tags: , , — indoos @ 8:19 am

After trying GAE for Java using core GAE SDK, went ahead to try Grails+GAE- sorry doesn’t work yet.

However, Groovy+GAE does work as explained in little tutorial. However, only production env works while development doesn’t ;)

Local deployment does not work due to groovy.security.GroovyCodeSourcePermission /groovy/shell) problem

Started trying Google Plugin for Eclipse got Groovlets+GAE working on local as well as remote environment.

Here are the steps-

GAE+Groovy+Eclipse

GAE+Groovy+Eclipse

  • Changed build.groovy file to use war folder instead of deploy folder {webinf = “war/WEB-INF” instead of webinf = “deploy/WEB-INF”}
  • Changed /.settings/com.google.appengine.eclipse.core.prefs to include groovy-all-1.6.1.jar in filesCopiedToWebInfLib

#Thu Apr 09 10:24:45 IST 2009
eclipse.preferences.version=1
filesCopiedToWebInfLib=appengine-api-1.0-sdk-1.2.0.jar|datanucleus-appengine-1.0.0.final.jar|datanucleus-core-1.1.0.jar|datanucleus-jpa-1.1.0.jar|geronimo-jpa_3.0_spec-1.1.1.jar|geronimo-jta_1.1_spec-1.1.1.jar|jdo2-api-2.3-SNAPSHOT.jar|groovy-all-1.6.1.jar|

  • The project can now be run locally using Run As >> Web Application without any groovy permission issues
  • The project can be deployed to Remote GAE using the cute little Deploy button provided by Google Eclipse Plugin {the button below Eclipse Menu bar-> Project menu in the above image}

December 11, 2008

Toyota Innovation Blog

Filed under: General — indoos @ 6:03 am

Found a really impressive and inspiring blog on innovation(Impetus-my employer is all about Innovation)-

http://creativityandinnovation.blogspot.com/2006/10/toyotas-innovation-factory.html

Love at first bite- GROOVY

Filed under: Java world — Tags: , , , — indoos @ 5:44 am

While looking at Rails on Ruby some time back, I was enticed by its mean clean way of creating fast data driven web sites. Being a hard-core JAVA-ite, I know the LABOR PAINS pains for achieving similar in Java world of JSF, Struts e.t.c.

The first SIGHT of GROOVY aka GRAILS- I was enticed

The first BITE of GROOVY aka GRAILS- I was in Love!!!!

So I have a Rails Clone powered by  Java- deadly combo!!!!

The first few weeks were truly amazing as I tried my hands on a new project. Fast UI development,  magical Ajax support, convention over configuration MIRCHY was what I was wanting for so long.

Some weeks later, as I and Grails settle down together, I am getting aware of our weaknesses (in both me and Grails/Groovy).  It is not that bad yet and with Big B  Java as heavenly God Father covering up the setbacks, it has been good so far.

I am not too concerned about Grails/Groovy being slow(not sure though whether that is true). Why- because Groovy heart is actually Made in JAVA and  I will know what to pull where to get it beating faster.

Will keep posted on whether this LOVE lasts for ever.

October 16, 2008

Financial Crisis 2008- IT strategy change- from “Make Money” to “SAVE Money”

Filed under: General — Tags: , , — indoos @ 8:07 am

The focus for the last many years in product based IT has been to find ways to MAKE MONEY. This meant creating solutions so that financial institutions and entrepreneurs could make more and more money.

However, now with Money Making strategy apparently in doldrums as being proven by the current financial crisis, the need for the hour is finding ways to Save whatever can be saved.

— THE BOAT IS SINKING – SALVAGE WHATEVER IS POSSIBLE–

What this translates into is that IT sales team should be selling “Save Money” initiatives instead of the traditional solutions. The “Save Money” trend is already one of the main patterns pushing development projects in big organizations like GE, Citibank as well as SMEs.

 

So what should product based IT sell-

  • Tools to reduce internal non-production costs – optimize cost on materials/resources 
  • Tools to reduce production costs
  • Tools to reduce IT costs- software as well as hardware
  • More …
  • And last but not the least important – tools for SHRINK:) (psychiatrist/psychologist)- they are certainly going to get good clientele.

Blog at WordPress.com.