Top Posts of 2011

Its that time of year again. Here are the top posts from this blog in 2011, as judged by number of views.

  1. IBM DB2 Welcomes Oracle Database/HP Itanium Customers
  2. New IBM DB2 vs. Oracle Database Advertising Campaign
  3. A Closer Examination of Oracle’s “Database Performance” Advertisement
  4. Comparing Price for Oracle Exadata and IBM Smart Analytics System
  5. IBM DB2 Strikes Another Blow to Oracle Database

As you can see, there is a strong DB2/Oracle Database competitive theme running through these popular topics. And here are the top posts of 2011, as judged by reader participation. In other words, as judged by the number of comments (or perhaps the amount of controversy).

  1. New IBM DB2 vs. Oracle Database Advertising Campaign (20 comments)
  2. A Closer Examination of Oracle’s “Database Performance” Advertisement (19 comments)
  3. The Future of the NoSQL, SQL, and RDBMS Markets (12 comments)
  4. Update on the IBM DB2 “SQL Skin” for Migrating from Sybase ASE (8 comments)
  5. Industry Benchmark Result for DB2 pureScale: SAP Transaction Banking (TRBK) Benchmark (7 comments)

December 19, 2011

Deploying DB2 and InfoSphere Warehouse on Private Clouds

Cloud computing is certainly a hot topic these days. If an organization is not already using cloud computing, it has plans to do so. The economics, agility, and value offered by cloud computing is just too persuasive for IT organizations ignore.

Even the high-profile Amazon outage couldn’t slow cloud computing’s relentless march towards mainstream adoption. If anything, that outage helped make cloud computing more robust by highlighting the need for hardened policies and procedures around provisioning in the cloud.

IBM recently announced updates to a set of products that make it easy to deploy DB2 and InfoSphere Warehouse on private clouds:

  • IBM Workload Deployer (previously know as WebSphere CloudBurst), which is a hardware/software appliance that streamlines the deployment and management of software on private clouds.
  • IBM Transactional Database Pattern, which works with the IBM Workload Deployer to generate DB2 instances that are suitable for transactional workloads.
  • IBM Data Mart Pattern, which generates InfoSphere Warehouse instances for data mart workloads.

These patterns consist of more than just deploying virtual images with pre-configured software. You should instead think of them as being like mini-applications for configuring and deploying a cloud-based database instances. Users specify information about the database, and then the pattern builds and deploys the database instance.

The Transactional Database Pattern is for OLTP deployments. It includes templates for sizing the virtual machine, database backup scheduling, database deployment cloning capabilities, and tooling (including Data Studio). The Data Mart Pattern incorporates the features to the OLTP pattern, together with deep compression and data movement tools. But, of course, it is configured and optimized for data mart workloads in a virtual environment.

December 12, 2011

Need Help Determining Hadoop Split Sizes? Use Adaptive MapReduce Instead!

IBM is actively working on adaptive features for the Map and Reduce phases of its InfoSphere BigInsights product (which is based on Apache Hadoop). In some cases, this involves applying techniques commonly found in mature data management products, and in some cases it involves developing new techniques. While a number of these adaptive features are still under development, there are some features in the product today. For instance, BigInsights currently includes an Adaptive Mapper capability that allows Mappers to successively process multiple splits for a job, and avoid the start-up costs for subsequent splits.

When a MapReduce job begins, Hadoop divides the data into multiple splits. It then creates Mapper tasks for each split. Hadoop deploys the first wave of Mapper tasks to the available processors. Then, as Mapper tasks complete, Hadoop deploys the next Mapper tasks in the queue to the available processors. However, each Mapper task has a start-up cost, and that start-up cost is repeated each time a Mapper task starts.

With BigInsights, there is not a separate Mapper task for each split. Instead, BigInsights creates Mapper tasks on each available processor, and those Mapper tasks successively process the splits. This means that BigInsights significantly reduces the Mapper start-up cost. You can see the results of a benchmark for a set-similarity join workload in the following chart. In this case, the tasks have a high start-up cost. The AM bar (Adaptive Mapper) in the chart is based on a 32MB split size. You can see that by avoiding the recurring start-up costs, you can significantly improve performance.

Adaptive MapReduce Benchmark: Set-Similarity Join Workload

Of course, if you chose the largest split size (2GB), you would achieve similar results to the Adaptive Mapper. However, the you might potentially expose yourself to the imbalanced workloads that sometimes accompany very large splits.

The following chart shows the results of a benchmark for a join query on TERASORT records. Again the AM bar (Adaptive Mapper) in the chart is based on a 32MB split size.

Adaptive MapReduce Benchmark: TERASORT Join Workload

In this case, the Adaptive Mapper results in a more modest performance improvement. Although, it is still an improvement. The key benefit of these Adaptive MapReduce features is that they eliminate some of the hassles associated with determining the split sizes, while also improving performance.

As I mentioned earlier in this post, a number of additional Adaptive MapReduce features are currently in development for future versions of BigInsights. I look forward to telling you about them when they are released…

December 7, 2011

