Advertisements

Conor O'Mahony's Database Diary

Your source of IBM database software news (DB2, Informix, Hadoop, & more)

Data Compression in IBM DB2 and Oracle Database

with 6 comments

Last week, in my blog post titled Truth in Advertising – Advanced Data Compression in Oracle 11g, I discussed the fact that data compression rates are highly dependent upon 1) the nature of the data and 2) the database environment. I’d like to follow up on that post with a data point that directly compares IBM DB2 and Oracle Database.

Last week, I mentioned that TPC-C data contains random strings. As such, it is not an ideal data set for measuring data compression rates. TPC-H is actually a better data set, as it contains data that is closer in nature to “real world” data. There are issues with TPC-H data as well. Since the data is programmatically generated, most columns have a uniform distribution, which limits the clustering of data that tends to happen in “real world” data sets. Also, there are some “fill columns” with long unique content that is difficult to compress. Nonetheless, the TPC-H data is a better data set for comparing data compression rates than TPC-C.

Luckily for us, Oracle have published their compression results for TPC-H data in a paper at the Conference on Very Large Data Bases (VLDB) titled Data Compression in Oracle. In this paper, they write that:

When loaded with the compression feature, compression for LINEITEM is the highest at a compression factor of about 1.6, while ORDER compresses at a compression factor of about 1.2… LINEITEM shrinks from 79 GB to 49 GB while ORDERS shrinks from 17 GB to 14 GB.

When we compare these rates to the compression rates that DB2 gets for TPC-H data, we see that DB2 has significantly higher compression rates. And, of course, higher compression rates are better because they mean you need less storage. Nobody needs to be reminded that storage-related costs are often the most expensive component of a system. Lowering storage costs can have a very real impact on IT budgets.

TPC-H Table Oracle DB2
LINEITEM 38% 58%
ORDERS 18% 60%

 
Please keep in mind that these results are only for the TPC-H data, and your data may achieve different compression rates. For instance, Tellabs and SunTrust Bank both report using DB2 to reduce the size of database tables by as much as 83%. The only way to know what kind of compression rates you will experience is to get the database vendors to run their tools on your data and let you know.

Advertisements

Written by Conor O'Mahony

August 10, 2009 at 11:27 am

6 Responses

Subscribe to comments with RSS.

  1. Your observation that artificial TPC-H data is more challenging to compress than real-world data is a point that long needed to be made, but my experience is that high compression factors can nevertheless be achieved more easily than most people think. All that is required is a sufficiently intelligent (spelled c-o-m-p-l-e-x) generative data model.
    What do I mean? Consider, for example, the TPC-H lineitem table’s 44-byte comment field, not just semantically and syntactically but also morphologically. Observe that the data consist only of separator sequences and words, that those objects alternate, that there are 9 separator sequences, that there are 207 root words, and that the word fragments at the field ends are simply truncations of the original 207. Simple statistical analyses further reveal that the frequency distributions of the separator sequences and of root words (just 6 bits of zero-order entropy there, BTW) are so highly skewed that a good starting approach should be entropy coding the words, fragments, and separator sequences in separate classes. That’s a divide-and-conquer strategy applied at a high level.
    OK, now to the point. Using that approach (well, perhaps one that wound up being just a wee bit more complicated), Xtreme Compression’s (i.e., my) TPC-H lineitem compressor achieves just over 75 percent compression factor (yes, that’s really 4-to-1) on the comment field, and an unprecedented 85 percent on the entire 7,835,713,740-byte ‘sample factor 10’ lineitem table file. Compressed, it takes 1,165,688,989 bytes, to be exact. And we are exact!
    So numbers like 38 and 58 percent don’t impress me. They shouldn’t impress anyone. Perhaps some people in Redwood Shores and in Armonk (?) are just not trying very hard. Or maybe it’s something else…

    Cheers, and keep up the VERY interesting blog! -GD

    Glenn Davis

    September 14, 2009 at 4:58 pm

  2. Here is an interesting blog post, where Radim Kolar shares information about compression tests that he performed with DB2 and Oracle Database: http://www.channeldb2.com/profiles/blogs/oracle-111g-vs-db2-97

    Conor O'Mahony

    January 30, 2010 at 10:17 am

  3. I don’t know about TPC-H data, but here’s a real-world example – Oracle tables in an SAP system, after compression, went from 255GB to 36GB.

    http://www.sdn.sap.com/irj/scn/weblogs?blog=/pub/wlg/19496

    I’d say that’s pretty damn good.

    On the other hand, the last results I’ve personally seen with DB2 compression, also for an SAP (BW) system, involved tables going from 246GB to 107GB.

    Sure that’s good, but I think the better results are obvious.

    sapdba

    June 1, 2010 at 12:05 am

    • Hi sapdba,

      If you would like to come out publicly and state where the DB2 result came from I’m sure the world would love to know. Or if you would like to remain anonymous, feel free to contact me directly (I’ll even put my email here and take the load of spam I’m likely to get to understand what you are seeing and help you make better use of DB2). It’s ceaton at ca.ibm.com

      Note that in the link above the person created a new table and loaded that table with the data from the other table. They did not say how much free space was in the original table to begin with. So the compression is suspect. I’m not suggesting they are wrong but if you truely want to measure just the effects of compression, you would create a table uncompressed and fully populate it so there was no freespace left in the table. Then compress the table to see how much it shrunk by.

      Don’t forget to look at the entire system as well because DB2 compression is superior (in fact the base DB2 index page/block format is already more efficient than Oracle so add compression on top for rid lists and prefixes in DB2 and it’s even better).

      Send me an email and I’d be glad to discuss it with you.

      Chris Eaton
      IBM Toronto Lab

      Chris Eaton

      June 1, 2010 at 9:33 am

      • I’d rather keep this anonymous, thanks. Why? Because I have good relationships with people at IBM, including the Toronto lab (though I don’t know you). I’ve wanted to get into these deep technical discussions in the past, but I can’t do so without seeming argumentative, and I don’t want to do that with people I associate with. If we keep this anonymous, on the other hand, I can argue the facts without worrying about personal opinions either way. And I don’t mean to offend (though I do occasionally use humor to make my points – no offense is intended).

        I agree with you, that the SDN reference does not have enough data to indicate true compression rates, due to the unknown freespace issue. I do suspect, however, that this “sandbox” was a recent refresh, and thus would have little to no freespace in the tables to begin with. Either way, you can’t dismiss it out of hand and assume it’s not valid based on this alone.

        The fact of the matter is, I see benefits to both IBM’s and Oracle’s compression strategies. I see the benefits of a table-based dictionary, but I see the benefits of block-based dictionaries as well. And I’m simply trying to dismiss the FUD associated with claims that either way is *better*.

        Is there a white paper that discusses your last paragraph? Because I see no evidence that DB2 compression is across-the-board “superior”, nor that the “index page/block format is already more efficient.” Evidence, please.

        sapdba

        June 1, 2010 at 11:58 am

      • Hi sapdb, no offense taken on my side…in fact that’s why I offered you my direct email address so we can exchange more information directly. I’d be glad to offer more details.

        On the last paragraph I will try to explain here, although it’s probably a case where I can write an entire blog post on it (maybe I will 🙂

        So let me start with the Oracle index format. As you may be aware, an oracle index block is made up of a set of key/rid pairs. That is, for every key in the index they store that key and the associated rid for that row that can be found in the data block. The main point here is that there is a key/rid pair for every key even if it is a non-unique index. So if I have a table of customer and I put an index on last name, and the name Eaton appears 10 times on 10 different (or the same for that matter) data blocks then the value of “Eaton” will appear 10 times in the index block. If you don’t believe it, you can just dump an Oracle data block and see for yourself. The way DB2 stores indexes is that the page contains key/rid-list pairs. That is, for every key there is a list of rids for those rows of data. So in the same example, with DB2 the word “Eaton” would appear once followed by a list of rids for that key. With this non compressed format in both cases you can see the advantage DB2 has.

        Then we add compression. Oracle offers static prefix compression. That is, as a DBA you must tell Oracle how many columns of a multi column index you want it to use as the prefix to compress (by default it select n-1 where n is the number of columns in the index). With DB2, the prefix compression is dynamic. That is, we look at the keys on a given index page, find any repeating string (full column, multi-column, substring) and compress those out into the header. And we have compression for the rid list as well (Oracle only has static prefix compression, not rid list compression or dynamic prefix compression).

        So I’m not trying to be argumentative here, but when I look at the compression algorithms for indexes, I see clear advantages with DB2. And in the general case I think table compression (or partition compression) has better opportunities to compress data down smaller than page level compression. As I already stated there will be edge cases (and I can favor page level perhaps unfairly by presorting the data) but we are going after the best compression for the most customers.

        I leave the opportunity to contact me directly open and would love to chat with you more.

        Chris Eaton
        ceaton at ca.ibm.com.nospam
        IBM Toronto Lab

        Chris Eaton

        June 1, 2010 at 12:27 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: