Sanjay Sharma’s Weblog

August 27, 2009

Hadoop- some revelations

Filed under: Advanced computing, Java world, Tech — Tags: , , — indoos @ 5:44 am

My recent experience with using Hadoop in production grade applications was both good and bad.

Here are some of the bad ones to start with-

  • Using commodity servers – not entirely true as even expressed on Hadoop web site somewhere. Anything below 8 GB RAM may not help with any good production heavy application, particularly if each Map/Reduce task uses 1-2 GB of RAM
    • Task tracker and data node JVM instances take at least around 1 GB RAM each- effectively leaving 5-6 GB RAM for Map Reduce JVMs
    • 512 MB for each Map and Reduce JVMs leaves with 5-8 Maps +3-6 Reduce instances
  • Usually real-time applications use look up or metadata data.  Although, Hadoop does offer Distributed cache or Configuration based (pseudo) replication of small shared data, the very nature of heavy Java in-memory object handling (serialization-dese) and HDFS access, does not allow performant look up handling
  • I would love to see more/easier/default control on various settings/parameters in config files as the current mechanism is really a pain in the back
  • Hadoop uses a lot of temp space. It is easy to NOT notice that you may only use 1/4 of your total available hard disk memory for business use. This is because you use 2 parts for replication (3 is default n good replication factor) while 1 for temporary (working/intermittent) processing. So for processing say 1 TB data, use may require around 4 TB+ hard disk. I learned about this the hard way after wasting good precious time!!
  • Last but not the least- it is real easy to write Map Reduce using Hadoop genius framework, but real difficult to convert business logic to Map Reduce paradigm

To be continued ……………….

Advertisements

Blog at WordPress.com.