Conor O'Mahony's Database Diary

Your source of IBM database software news (DB2, Informix, Hadoop, & more)

Archive for the ‘InfoSphere BigInsights’ Category

How IBM and Oracle Approach Big Data Solutions

leave a comment »

This blog posts refers to the definition of Big Data commonly in use today. I do not include mainframe-based solutions, which some people might argue tackle Big Data challenges.

Both IBM and Oracle are going after the Big Data market. However, they are taking different approaches. I’m going to take a few moments to have a very brief look at what both companies are doing.

First of all, Oracle have introduced an “appliance” for Big Data. IBM have not. I put the word appliance in quotes because I consider this Oracle appliance to be closer in nature to an integrated collection of hardware and software components, rather than a true appliance that is designed for ease of operation. But the more important consideration is whether an appliance even makes sense for Big Data. There is a decent examination of this topic in the following blog post from Curt Monash and the accompanying comment stream: Why you would want an appliance — and when you wouldn’t. But, regardless of your position on this subject, the fact remains that Oracle currently propose an appliance-based approach, while IBM does not.

The other area I will briefly look at is the scope of the respective vendor approaches. In the press release announcing the Oracle Big Data Appliance, Oracle claim that:

Oracle Big Data Appliance is an engineered system optimized for acquiring, organizing, and loading unstructured data into Oracle Database 11g.

IBM takes a very different approach. IBM does not see its Big Data platform as primarily being a feeder for its relational database products. Instead, IBM sees this as being one possible use case. However, the way that customers want to use Big Data technologies extend well beyond that use case. IBM is designing its Big Data platform to cater for a wide variety of solutions, some of which involve relational solutions and some of which do not. For instance, the IBM Big Data platform includes:

  • BigInsights for Hadoop-based data processing (regardless of the destination of the data)
  • Streams for analyzing data in motion (where you don’t necessarily store the data)
  • TimeSeries for smart meter and sensor data management
  • and more

So, as you can see, there are fundamental differences in the ways that IBM and Oracle are developing products for Big Data solutions. For more information, see IBM Big Data and Oracle Big Data.


Get a Free Copy of the Forrester Wave™ for Enterprise Hadoop Solutions

with 2 comments

Today, Forrester published its Wave analysis for enterprise Hadoop solutions. It has detailed coverage of the Hadoop solutions from vendors like IBM, MapR, Cloudera, Hortonworks, and others. If you are considering an enterprise Hadoop solution, such as IBM InfoSphere BigInsights, it will make for very interesting reading. You can download a free copy of the report from The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012.

Written by Conor O'Mahony

February 2, 2012 at 2:41 pm

Need Help Determining Hadoop Split Sizes? Use Adaptive MapReduce Instead!

with 2 comments

IBM is actively working on adaptive features for the Map and Reduce phases of its InfoSphere BigInsights product (which is based on Apache Hadoop). In some cases, this involves applying techniques commonly found in mature data management products, and in some cases it involves developing new techniques. While a number of these adaptive features are still under development, there are some features in the product today. For instance, BigInsights currently includes an Adaptive Mapper capability that allows Mappers to successively process multiple splits for a job, and avoid the start-up costs for subsequent splits.

When a MapReduce job begins, Hadoop divides the data into multiple splits. It then creates Mapper tasks for each split. Hadoop deploys the first wave of Mapper tasks to the available processors. Then, as Mapper tasks complete, Hadoop deploys the next Mapper tasks in the queue to the available processors. However, each Mapper task has a start-up cost, and that start-up cost is repeated each time a Mapper task starts.

With BigInsights, there is not a separate Mapper task for each split. Instead, BigInsights creates Mapper tasks on each available processor, and those Mapper tasks successively process the splits. This means that BigInsights significantly reduces the Mapper start-up cost. You can see the results of a benchmark for a set-similarity join workload in the following chart. In this case, the tasks have a high start-up cost. The AM bar (Adaptive Mapper) in the chart is based on a 32MB split size. You can see that by avoiding the recurring start-up costs, you can significantly improve performance.

Adaptive MapReduce Benchmark: Set-Similarity Join Workload

Of course, if you chose the largest split size (2GB), you would achieve similar results to the Adaptive Mapper. However, the you might potentially expose yourself to the imbalanced workloads that sometimes accompany very large splits.

The following chart shows the results of a benchmark for a join query on TERASORT records. Again the AM bar (Adaptive Mapper) in the chart is based on a 32MB split size.

Adaptive MapReduce Benchmark: TERASORT Join Workload

In this case, the Adaptive Mapper results in a more modest performance improvement. Although, it is still an improvement. The key benefit of these Adaptive MapReduce features is that they eliminate some of the hassles associated with determining the split sizes, while also improving performance.

As I mentioned earlier in this post, a number of additional Adaptive MapReduce features are currently in development for future versions of BigInsights. I look forward to telling you about them when they are released…

In the mean time, make sure to check out the free online Hadoop courses at Big Data University. I previous blogged about my experiences with these courses in Hadoop Fundamentals Course on

Written by Conor O'Mahony

December 7, 2011 at 1:07 pm

Comparing HDFS and GPFS for Hadoop

leave a comment »

Here is a chart that compares the performance of Hadoop Distributed File System (HDFS) with General Parallel File System-Shared Nothing Cluster (GPFS-SNC) for certain Hadoop-based workloads (it comes from the Understanding Big Data book). As you can see, GPFS-SNC easily out-performs HDFS. In fact, the book claims that a 10-node GPFS-SNC-based Hadoop cluster can match the performance of a 16-node HDFS-based Hadoop cluster.

Comparing HDFS and GPFS for Hadoop Workloads

GPFS was developed by IBM in the 1990s for high-performance computing applications. It has been used in many of the world’s fastest computers (including Blue Gene and Watson). Recently, IBM extended GPFS to develop GPFS-SNC, which is suitable for Hadoop environments. A key difference between GPFS-SNC and HDFS is that GPFS-SNC is a kernel-level file system, whereas HDFS runs on top of the operating system. This means that GPFS-SNC offers several advantages over HDFS, including:

  • Better performance
  • Storage flexibility
  • Concurrent read/write
  • Improved security

If you are interested in seeing how GPFS-SNC performs in your Hadoop cluster, please contact IBM. Although GPFS-SNC is not in the current release of InfoSphere BigInsights (IBM’s Hadoop-based product), GPFS-SNC is currently available to select clients as a technology preview.

Written by Conor O'Mahony

November 30, 2011 at 1:07 pm

Demo: Analyzing Twitter Data with IBM Big Data

with one comment

Last week, I included a demonstration of Using Hadoop to Extract and Analyze Unstructured Information. Now I’d like to share another demo. This demo also shows InfoSphere BigInsights and InfoSphere BigSheets. BigInsights is essentially Apache Hadoop together with extensions for installation, management, security, and integration, while BigSheets is basically an easy-to-use interface for creating and running Map and Reduce jobs.

This demo shows you how to run sentiment analysis on Tweets. Some of the details of creating the specific text analytics are not included. But it is interesting and useful nontheless. It also shows how you can easily run some cool visualizations on that data. Make sure to keep watching until the end where David Barnes show a great visualization on the UK Parliment data.

Don’t forget there is no charge for BigInsights Basic Edition. You can freely download it from InfoSphere BigInsights.

Written by Conor O'Mahony

October 3, 2011 at 8:30 am

Demo: Using Hadoop to Extract and Analyze Unstructured Information

with one comment

Here’s a nice demo. It shows InfoSphere BigInsights, which is IBM’s Hadoop product. BigInsights is essentially Apache Hadoop together with extensions for installation, management, security, integration, and so on. The demo also shows InfoShpere BigSheets. BigSheets is basically an easy-to-use interface for creating and running Map and Reduce jobs. As you can see from the demo, BigSheets makes it quick and easy to apply text analytics extractors and filters to unstructured or semi-structured data. The demo itself shows how you can quickly analyze several aspects of revenue information pulled from earnings press releases. It even includes a nice round-trip to the annotated source data to see “why” certain conditions occurred.

Don’t forget there is no charge for BigInsights Basic Edition. You can freely download it from InfoSphere BigInsights.

Written by Conor O'Mahony

September 27, 2011 at 8:30 am

Hadoop Fundamentals Course on

with one comment

After spending some time reading about Apache Hadoop, I decided it was time to get my hands dirty. So this weekend, I took the Hadoop Fundamentals 1 self-paced course on It is a really nice way to play with Hadoop. You have the choice of downloading the software and installing it on your computer, working with a VMware image, or working in the cloud. I chose the option of working in the cloud. Within a few minutes I had a Amazon AWS account, a RightScale account, and the software installed in the cloud. By the way, although the course is FREE, I did incur some cloud-related usage charges. It amounted to approximately $1 in Amazon charges for the time it took me to complete the course.

The course itself is quite good. It is, as the abstract implies, a high-level overview. It describes the concepts involved in Hadoop environments, describes the Hadoop architecture, and provides an opportunity to follow tutorials for using Pig, Hive, and Jaql. It also provides a tutorial on using Flume. Because of my experience with JavaScript and JSON, I feel most comfortable using Jaql to query data in Hadoop. However, the DBAs among you will probably feel most comfortable with Hive, given its SQL-friendly approach.

If you are curious about Hadoop, I’d recommend this course. I’m eagerly anticipating the availability of the follow-on Hadoop course…

Written by Conor O'Mahony

September 6, 2011 at 11:53 am

%d bloggers like this: