Sanjay Sharma’s Weblog

July 27, 2011

Webinar- Big Data Analytics Platform: Beyond Traditional Enterprise Data Warehouse

Filed under: Uncategorized — Tags: — indoos @ 4:46 am

Webinar – Big Data Analytics Platform: Beyond Traditional Enterprise Data Warehouse
July 28, 2011 (10:00 am PT/1:00pm ET )

Register here- http://www.impetus.com/webinar?eventid=45

Free webinar on ‘Big Data Analytics Platform: Beyond Traditional Enterprise Data Warehouse’ covering-

• Traditional EDW v/s Big Data Analytics Platform – What’s missing?
• Building Big Data Analytics Platform
• Is it possible to reuse the existing EDW investments?
• Using open source effectively for enhancing/replacing EDW solutions
• Best practices/ lessons learnt in building Big Data Analytics Platform
• Real-life examples

July 14, 2011

Cloud Computing for SMBs: A Level Playing Field | Cloud Computing Journal

Filed under: Cloud, Hadoop — Tags: , — indoos @ 2:18 pm

Cloud Computing for SMBs: A Level Playing Field | Cloud Computing Journal

May 10, 2011

Datastax Brisk Quick Start in 10 minutes using git source

Filed under: Cassandra, Hadoop, Hive, NoSQL — Tags: , , — indoos @ 5:08 pm

Steps-
1. git clone <brisk git url- used brisk1 branch> to <brisk dir>
2. cd <brisk dir>
3. ant
4. <brisk dir>/bin/brisk cassandra -t
This should get the jobtracker/tasktracker running.
5. <brisk dir>/bin/brisk hive
This should get the hive cli running.
hive commands from http://www.datastax.com/docs/0.8/brisk/about_hive can be used to test various hive commands.
6. <brisk dir>/resources/cassandra/bin/cassandra-cli
This can be used for running cassandra command line

Demo application – Portfolio Manager works almost OK as per -http://www.datastax.com/docs/0.8/brisk/brisk_demo.
It fails while running “./bin/pricer -o UPDATE_PORTFOLIOS”
This can be resolved by first running create table commands from “<brisk dir>/demos/portfolio_manager/10_day_loss.q” to create missing tables.

Rest just works fine inline with the website documentation.
 
These steps were used to run brisk on opensuse 64 bit using the source code in single cluster mode.

January 20, 2011

Some quotes- Unit testing in Product Development

Filed under: Code quality — Tags: , , — indoos @ 10:26 am

“Unit testing is like a Health Insurance Plan or Life term plan? – Both if used correctly”

“Software testing is same as buckling up your seat belt in Air bag enabled car- you never know what might happen”

“TDD is like putting up anti-theft security system at home from day1!”

November 18, 2010

Multicore impact on software development

Filed under: Uncategorized — indoos @ 4:52 pm

Recently got the chance to rub shoulders with academia and chip-design software engineers in Bengaluru at http://www.innovate-it.in/ – a conference on “Issues in design of complex Multi-Core Systems”. I was speaking on “Multicore:Choice of Middleware and Framework” and was one of the few ones from the application software world whilst most were experts from the chip design or hard-core hardware/system level programming background.

Few of my revelations from the event were

– There are no silver bullets (yet!) for migrating traditional software to multi-core

– There is certainly a huge vacant play ground for new Players to come up with technologies to allow software to harness muti-core power with no/minimal software changes. Azul Systems is a real world example of this.

– One interesting finding was that HADOOP design is very much similar to multi-core internals. Distribution of work, data sharing/sync and better cache management are problems common to both and being solved almost in the same fashion. Nice to know that basic fundamental solutions fit in at Mega levels as well as Micro levels!

Hadoop’s programming model fits quite well in multi-core world as evident by some success reported on running Hadoop on GPU (MARS).

One of the practical tips for Hadoop clusters is to keep the count of max maps+reducers  on single node depending on the number of cores. 8 core machines can  run more parallel maps+reducers  than 4 core machines.  Also keep in  mind that data node and task tracker would also be consuming some of the cores’ resources.  Refer to this paper for more details-  White paper-HadoopPerformanceTuning.pdf

August 16, 2010

Hadoop Ecosystem World-Map

While preparing for the keynote for the  recently held HUG India meetup on 31st July, I decided that I will try to keep my session short, but useful and relevant to the lined up sesssions on hiho, JAQL and Visual hive. I have always been a keen student of geography (still take pride in it!) and thought it would be great to draw a visual geographical map of Hadoop ecosystem. Here is what I came up with a little nice story behind it-

  1. How did it all start- huge data on the web!
  2. Nutch built to crawl this web data
  3. Huge data had to saved- HDFS was born!
  4. How to use this data?
  5. Map reduce framework built for coding and running analytics – java, any language-streaming/pipes
  6. How to get in unstructured data – Web logs, Click streams, Apache logs, Server logs  – fuse,webdav, chukwa, flume, Scribe
  7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon!
  8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql
  9. BI tools with advanced UI reporting- drilldown etc- Intellicus 
  10. Workflow tools over Map-Reduce processes and High level languages
  11. Monitor and manage hadoop, run jobs/hive, view HDFS – high level view- Hue, karmasphere, eclipse plugin, cacti, ganglia
  12. Support frameworks- Avro (Serialization), Zookeeper (Coordination)
  13. More High level interfaces/uses- Mahout, Elastic map Reduce
  14. OLTP- also possible – Hbase

Would love to hear feedback about this and how to grow it further to add the missing parts!

Hadoop ecosystem map

July 26, 2010

Next Hadoop India User Group Meetup – July 2010

I am pretty excited and looking forward to attend the next HUG meetup on 31st July 2010 in Noida. I really hope to see energetic Indian Hadoop-ers discuss about whats happening in Indian Hadoop community as well as the rest of  the world. 

I guess, I may have been the culprit behind the delay, else we would have the event at least 2-3 months earlier. Will now try to have similar event  more frequently. Have already thoughts around planning for one around NoSQL databases again one of my favourites as a technology of the future. Unlike last time in Nov 2009, a group of young Impros- Absolute Zero forum is organizing the event and sparing me lots of pain:). Offcourse, nothing could have been possible without iLabs and Impetus‘ support pushing us to participate in open source community as much as possible. 

The HUG event this time will have some interesting sessions. Sonal Goyal would be taking about ‘hiho’- an open source solution for bridging the gap between the RDBMS world and Hadoop. As I foresee it, all software based business including SME would like to ride the band wagon of using BI and consumer analytics to enhance business and Hadoop is going to enable that in a cost-effective way. RDBMS would continue to be used for real-time applications since these are time-tested and essentially do not face serious competition (not yet!) from the new age NoSQL databases. So the demand of tools for bringing RDBMS data into Hadoop analytics systems is going to be hot!   ‘hiho’ and sqoop are the two top contenders in this category. Hopefully Sonal would be able to share with us the power of hiho as well as pros/cons over sqoop.

JAQL talk from Himanshu, IBM would again be interesting to know that people are trying out different approaches than map-reduce java/streaming coding and traditional PIG and Hive high level interfaces. The challenge for Himanshu would be to help us understand how JAQL is better than Hive or PIG. 

Sajal would be talking about Hive + Intellicus- a window to the unstoppable future of Hadoop in DW and BI. 

I have always been more biased towards Hive as SQL and java usually go hand in hand in almost all business applications. So it would be interesting to know how Hadoop through Hive is slowly becoming ready for enterprise applications and providing a Visual Interface for data analytics. It seems, at last, Hadoop is ready to come out of developer-only-world to enter the domain of business user$.

July 11, 2010

Hive BI analytics: Visual Reporting

Filed under: Hadoop, Hive, HPC, Java world — Tags: , , , , , , , , — indoos @ 5:23 pm

I had earlier written about using Hive as a data source for BI tools using industry proven BI reporting tools and here is a list of the various official announcements from Pentaho, Talend. Microstrategy and Intellicus –

The topic is close to my heart since I firmly believe that while Hadoop and Hive are true large data analytics tool, their power is currently limited to use by software programmers. The advent of BI tools in Hadoop/Hive world would certainly bring it closer to the real end users – business users.

I am currently not too sure how these BI reporting tools are deciding how much part of  the analytics be left in Map reduce and how much in the reporting tool itself- guess it will take time to find the right balance. Chances are that  I will find it a bit earlier than others as I am working closely  (read here) with Intellicus team to get the changes in Hive JDBC driver for Intellicus’ interoperability with Hive.

July 2, 2010

kundera- making life easy for Apache Cassandra users

Filed under: Cassandra, HPC, Java world, NoSQL — Tags: , , , , , — indoos @ 4:54 am

One of my colleagues Animesh has been working on creating an Annotation based wrapper over Cassandra and we have finally decided to open source it for it to be nurtured as a part of the bigger community.

kundera is hosted on code.google and can be reached here – http://code.google.com/p/kundera/

Here is how to get started with kundera in 5 minutes –http://anismiles.wordpress.com/2010/06/30/kundera-knight-in-the-shining-armor/

The logic behind kundera is quite simple – provide ORM like wrapper over the difficult-to-use Thrift APIs. Eventually all NoSQL databases would like to have similar APIs so that it is easy to use NoSQL databases.

The initial release includes a JPA LIKE annotation library. The roadmap is to subsequently change it a Cassandra specific JPA extension. The other important feature that would be added is index/search using Lucandra/Solandra.

June 24, 2010

Webinar details – Large data and compute HPC offerings in Impetus

Filed under: HPC — Tags: , , , , , , , — indoos @ 2:28 pm

« Newer PostsOlder Posts »

Create a free website or blog at WordPress.com.