Subscribe by Email

Your email:

Blog

Current Articles | RSS Feed RSS Feed

Laptop Backup – Block Level Deduplication: Not Enough!

  
  
  

This post is part of sub-series on deduplication requirements in an overall Series on planning for corporate PC backup in your organization.  My last post examined the effectiveness of the 4 main approaches to sub-divide your data into chunks which can then be searched for duplication in your existing data.  As I promised, this post looks at the limitations of block level deduplication, with specific focus on data that is likely to occur on laptops and desktops, i.e. documents, email etc.

 

Quick refresher: Block level deduplication breaks the file into fixed sized blocks and only backs up unique blocks using the process described here.  While better than file-level and delta-block technologies, this approach is best suited for database type stores whose physical block layout doesn’t change.  However, for document type data – most prevalent on PCs – where a simple save can alter the layout of the document, block level dedupe isn’t very effective.  This is because it has two limitations:

a)      identifying common data for the first backup

b)      Identifying common data when the data layout changes

To understand, let’s take a look at how block level deduplication works on a simple document.  The image shows a document called NYC.ppt, which contains two familiar NYC icons: Time Square and the Statue of Liberty.

PC backup deduplication example

 

 

 

 

 

 

 


Block level deduplication will start with a fixed size block, say 16K, and create equal sized chunks out of this document.  It has no idea where a logical object begins and where it ends.  As a result, the chunking process will look something like the image below.  You can see how the Times Square image is cut about 70% of the way to the right and the Statue of Liberty’s head is cut off horizontally right at the nose!

block level deduplication limitations

 

Seeing the logical objects getting chopped off randomly makes it apparent that there is something wrong with this approach, but let’s look at another example, which makes it patently clear.  Here’s another document, titled: Fav Places.ppt, which contains both the Statue of Liberty and the Times Square image, but with some important differences, the two images are stored in a different order and also contains an additional image of the Empire State building:

block level deduplication challenges

 

 

 

 

 

 

 

 

 

To the human eye, it is clear that the two images of the Statue of Liberty and the Times Square are identical and should be identified as duplicate, however, the way Block level dedupe works, it won’t be able to find any commonality here, because it will again divide data in 16K chunks and as a result will have overlapping and chopped off logical objects as part of the same block, making it impossible to find any commonality.

problems with block level deduplication

 

This is a microcosm of how your data is processed for the first backup by block level deduplication.  There are tens of thousands, if not millions of documents in your company which have a lot of duplicate embedded data like company logos, images, PST file attachments etc – all likely in different locations within different documents.  With a block level deduplication, you have little hope of finding this duplicate data, because it has no notion of objects and how they are laid out.  The same problem with block level deduplication manifests itself, when the data layout changes for a document.  In future posts, I’ll introduce Object Based Deduplication and show how that solves the problems mentioned above.

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics