PC Backup – Object based vs. block based deduplication
Posted by Puneesh Chaudhry on Tue, Dec 28, 2010 @ 09:29 AM
a) Format awareness
b) Extracting objects in their native form: examples of objects are: images, slides, excel worksheets, PST file attachments etc.
c) Extracting the same object i.e. an image regardless of order or location
d) Ability to reconstruct a logical object that may be split into multiple physical locations due to file format storage conventions
As I promised, this post compares Object Based and block level deduplication approaches using the same example I used for block level deduplication: a document called NYC.ppt, which contains two familiar NYC icons: Time Square and the Statue of Liberty.

Block level deduplication will start with a fixed size block, say 16K, and create equal sized chunks out of this document as shown below. You can see how the Times Square image is cut about 70% of the way to the right and the Statue of Liberty’s head is cut off horizontally right at the nose! However, Object Based deduplication, because it is format aware, will extricate the underlying objects in their native form, which in this case are: the two images: Times Square and the Statue of Liberty, captured in their entirety.

Intuitively, capturing the objects in their native form seems like the right thing to do, but let’s take a look at another example, which really drives home the advantage of Object Based method over the block based approach. Here’s another document, titled: Fav Places.ppt, which contains both the Statue of Liberty and the Times Square image from the last document, but with some important differences, the two images are stored in a different order and also contains an additional image of the Empire State building:

Let’s put the two approaches to test again. Block level backup dedupe will again divide data in 16K chunks and as a result will have overlapping and chopped off logical objects as part of the same block, making it impossible to find any commonality. On the other hand, Object Based deduplication will again capture the logical objects in their native form and a simple process of checksum comparison will indicate that two out of the 3 objects have already been encountered before in a different file and don’t need to be stored again. The only new object to be stored will be the image of the Empire State building.

To reiterate, this is a microcosm comparing how your data is processed for the first backup by block level deduplication vs. Object Based dedupe. There are tens of thousands, if not millions of documents in your company which have a lot of duplicate embedded data like company logos, images, PST file attachments etc – all likely in different locations within different documents. With a block level deduplication, you have little hope of finding this duplicate data, because it has no notion of objects and how they are laid out. Object Based deduplication will find this common data across millions of those documents and store and transmit them only once! Imagine only storing an image, a slide, a PST file attachment only once in the entire company – you’re looking at around 10-25x reduction in the amount of data you have to store and transmit! Imagine not having to buy storage for another 2 years!
In my next post, I’ll tackle the so-called “variable length deduplication” and show that it is really just fixed length block level deduplication unless you're backing up the same file again and again.