Sanjay Sharma’s Weblog

August 5, 2013

Global Big Data Conference Hyderabad-2Aug2013- Finance/Manufacturing Use Cases

Filed under: Uncategorized — Tags: , , , , , , , — indoos @ 6:30 pm
Advertisements

November 1, 2012

Big Data Technologies Landscape

Filed under: Cassandra, Cloud, Hadoop, Hive, NoSQL — Tags: , , , , , — indoos @ 2:24 pm

August 16, 2010

Hadoop Ecosystem World-Map

While preparing for the keynote for the  recently held HUG India meetup on 31st July, I decided that I will try to keep my session short, but useful and relevant to the lined up sesssions on hiho, JAQL and Visual hive. I have always been a keen student of geography (still take pride in it!) and thought it would be great to draw a visual geographical map of Hadoop ecosystem. Here is what I came up with a little nice story behind it-

  1. How did it all start- huge data on the web!
  2. Nutch built to crawl this web data
  3. Huge data had to saved- HDFS was born!
  4. How to use this data?
  5. Map reduce framework built for coding and running analytics – java, any language-streaming/pipes
  6. How to get in unstructured data – Web logs, Click streams, Apache logs, Server logs  – fuse,webdav, chukwa, flume, Scribe
  7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon!
  8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql
  9. BI tools with advanced UI reporting- drilldown etc- Intellicus 
  10. Workflow tools over Map-Reduce processes and High level languages
  11. Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse plugin, cacti, ganglia
  12. Support frameworks- Avro (Serialization), Zookeeper (Coordination)
  13. More High level interfaces/uses- Mahout, Elastic map Reduce
  14. OLTP- also possible – Hbase

Would love to hear feedback about this and how to grow it further to add the missing parts!

Hadoop ecosystem map

July 11, 2010

Hive BI analytics: Visual Reporting

Filed under: Hadoop, Hive, HPC, Java world — Tags: , , , , , , , , — indoos @ 5:23 pm

I had earlier written about using Hive as a data source for BI tools using industry proven BI reporting tools and here is a list of the various official announcements from Pentaho, Talend. Microstrategy and Intellicus –

The topic is close to my heart since I firmly believe that while Hadoop and Hive are true large data analytics tool, their power is currently limited to use by software programmers. The advent of BI tools in Hadoop/Hive world would certainly bring it closer to the real end users – business users.

I am currently not too sure how these BI reporting tools are deciding how much part of  the analytics be left in Map reduce and how much in the reporting tool itself- guess it will take time to find the right balance. Chances are that  I will find it a bit earlier than others as I am working closely  (read here) with Intellicus team to get the changes in Hive JDBC driver for Intellicus’ interoperability with Hive.

June 24, 2010

Hive remote debugging

Filed under: Hadoop, Hive — Tags: , — indoos @ 2:40 am

Recently spent some time looking under Hive hood while working with my colleague Sunil on HIVE-1346 in Hive JDBC implementation.

Figured out it is not very easy to debug the code, so here is a useful script we used to enable remote debugging in hive. We used Eclipse remote debugging with Hadoop 0.20.1 running in standalone method with Hive 0.5.0. 

Please do remember to remove the extra lines that I had to add for formatting the script. Also, a better job can be done by using something like ‘for’ loop for getting all lib jars from Hadoop and Hive lib directory. 

export HADOOP_HOME=/home/hadoop/hadoop-0.20.1
export HIVE_HOME=/home/hadoop/hive-0.5.0-bin
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HIVE_LIB=$HIVE_HOME/lib
export HIVE_CLASSPATH=$HIVE_HOME/conf:$HIVE_LIB/antlr-runtime-3.0.1.jar:$HIVE_LIB/asm-3.1.jar:
$HIVE_LIB/commons-cli-2.0-SNAPSHOT.jar:$HIVE_LIB/commons-codec-1.3.jar:
$HIVE_LIB/commons-collections-3.2.1.jar:$HIVE_LIB/commons-lang-2.4.jar:
$HIVE_LIB/commons-logging-1.0.4.jar:$HIVE_LIB/commons-logging-api-1.0.4.jar:
$HIVE_LIB/datanucleus-core-1.1.2.jar:$HIVE_LIB/datanucleus-enhancer-1.1.2.jar:
$HIVE_LIB/datanucleus-rdbms-1.1.2.jar:$HIVE_LIB/derby.jar:$HIVE_LIB/hive-anttasks-0.5.0.jar:
$HIVE_LIB/hive-cli-0.5.0.jar:$HIVE_LIB/hive-common-0.5.0.jar:$HIVE_LIB/hive_contrib.jar:
$HIVE_LIB/hive-exec-0.5.0.jar:$HIVE_LIB/hive-hwi-0.5.0.jar:$HIVE_LIB/hive-jdbc-0.5.0.jar:
$HIVE_LIB/hive-metastore-0.5.0.jar:$HIVE_LIB/hive-serde-0.5.0.jar:
$HIVE_LIB/hive-service-0.5.0.jar:$HIVE_LIB/hive-shims-0.5.0.jar:
$HIVE_LIB/jdo2-api-2.3-SNAPSHOT.jar:$HIVE_LIB/jline-0.9.94.jar:
$HIVE_LIB/json.jar:$HIVE_LIB/junit-3.8.1.jar:$HIVE_LIB/libfb303.jar:
$HIVE_LIB/libthrift.jar:$HIVE_LIB/log4j-1.2.15.jar:
$HIVE_LIB/mysql-connector-java-5.0.0-bin.jar:$HIVE_LIB/stringtemplate-3.1b1.jar:
$HIVE_LIB/velocity-1.5.jar:

export HADOOP_LIB=$HADOOP_HOME/bin/../lib

export HADOOP_CLASSPATH=$HADOOP_HOME/bin/../conf:$JAVA_HOME/lib/tools.jar:
$HADOOP_HOME/bin/..:$HADOOP_HOME/bin/../hadoop-0.20.1-core.jar:
$HADOOP_LIB/commons-cli-1.2.jar:$HADOOP_LIB/commons-codec-1.3.jar:$HADOOP_LIB/commons-el-1.0.jar:
$HADOOP_LIB/commons-httpclient-3.0.1.jar:$HADOOP_LIB/commons-logging-1.0.4.jar:
$HADOOP_LIB/commons-logging-api-1.0.4.jar:$HADOOP_LIB/commons-net-1.4.1.jar:$HADOOP_LIB/core-3.1.1.jar:
$HADOOP_LIB/hsqldb-1.8.0.10.jar:$HADOOP_LIB/jasper-compiler-5.5.12.jar:
$HADOOP_LIB/jasper-runtime-5.5.12.jar:$HADOOP_LIB/jets3t-0.6.1.jar:$HADOOP_LIB/jetty-6.1.14.jar:
$HADOOP_LIB/jetty-util-6.1.14.jar:$HADOOP_LIB/junit-3.8.1.jar:$HADOOP_LIB/kfs-0.2.2.jar:
$HADOOP_LIB/log4j-1.2.15.jar:$HADOOP_LIB/oro-2.0.8.jar:$HADOOP_LIB/servlet-api-2.5-6.1.14.jar:
$HADOOP_LIB/slf4j-api-1.4.3.jar:$HADOOP_LIB/slf4j-log4j12-1.4.3.jar:$HADOOP_LIB/xmlenc-0.52.jar:
$HADOOP_LIB/jsp-2.1/jsp-2.1.jar:$HADOOP_LIB/jsp-2.1/jsp-api-2.1.jar:

export CLASSPATH=$HADOOP_CLASSPATH:$HIVE_CLASSPATH:$CLASSPATH
export DEBUG_INFO="-Xmx1000m -Xdebug -Djava.compiler=NONE -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=n"
$JAVA_HOME/bin/java $DEBUG_INFO -classpath $CLASSPATH -Dhadoop.log.dir=$HADOOP_HOME/bin/../logs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$HADOOP_HOME/bin/..
-Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=$HADOOP_LIB/native/Linux-i386-32
-Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.util.RunJar $HIVE_LIB/hive-service-0.5.0.jar
org.apache.hadoop.hive.service.HiveServer

$JAVA_HOME/bin/java -Xmx1000m $DEBUG_INFO -classpath $CLASSPATH -Dhadoop.log.dir=$HADOOP_HOME/bin/../logs
-Dhadoop.log.file=hadoop.log
-Dhadoop.home.dir=$HADOOP_HOME/bin/.. -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
-Djava.library.path=$HADOOP_LIB/native/Linux-i386-32 -Dhadoop.policy.file=hadoop-policy.xml
org.apache.hadoop.util.RunJar $HIVE_LIB/hive-cli-0.5.0.jar org.apache.hadoop.hive.cli.CliDriver

February 8, 2010

BI with MapReduce

Filed under: Advanced computing, Hadoop — Tags: , , , , , — indoos @ 2:12 pm

Have any of you used map reduce in the context of business intelligence?

While collating my thoughts on this Linked-in Hadoop discussion, found out that I needed more visuals to explain it first to myself :).

So, here are the many ways in which Hadoop MapReduce does offer an alternative in the big-big BI world-

Scenario 1: Use Hadoop and Hive as interface to BI tools. Pentaho reporting is already supported as of Hive 0.4.0.

Scenario 2: Use Hadoop for intial data polishing, and then dump to a SQL supported column based database near-real BI reporting. Aster data/Vertica /Greenplum sell themselves by advertising  MapReduce connectors (or similar) heavily. The cost of SQL supported column based database is the only pain point here (+ the risk on how these actually scale vs what these promise)

Scenario 3: Use Hadoop for intial data polishing, and then dump to a SQL supported column based database near-real BI reporting. In case of Real time reporting, data can further be BI polished from column based databases to a fast regular RDBMS with BI support.

 

Scenario 4: The free way:)- Use Hadoop for intial data polishing, and then dump to a regular SQL database with BI support. The export from HDFS can be the Un-sqoop way. The onus would more be on the developer to dump only ready-for-report data (lesser) with most of the BI already completed as part of More MR step.

The important fact to note is that there might be additional costs on moving the major chunk of  BI data analysis part to programmatic interfaces (SQL or MR).  

I am not too much of a database-fallen-in-love type, so do like the way Hive can emerge as a potential BI reporting tool.

Blog at WordPress.com.