Subscribe by Email

Your email:

Blog

Current Articles | RSS Feed RSS Feed

Enterprise Laptop Backup: Deduplication - It’s the chunking, Stupid!

  
  
  

This post is part of sub-series on deduplication requirements in an overall Series on planning for corporate PC backup in your organization.  In my last post, I evaluated the best place for performing backup dedupe for laptops and desktops. 

This post looks at the various approaches available to dividing your data into smaller chunks in order to analyze them for duplication.  As mentioned in a previous post, this is the first step in every deduplication process and is perhaps the most important one because the better the approach to divide data into chunks, which are likely to occur multiple times, the better the dedupe efficiency.  Broadly speaking, there are 4 approaches to how data can be divided into multiple chunks:

Comparison of deduplication methods

  1. File-level: This the most basic form of deduplication, which can identify identical files and store them only once. Also known as Single Instance Storage, this is also perhaps the easiest approach to implement for a vendor.  The downside is that if you change the file by even a single byte, the entire file needs to be stored again.  This happens more often that one may think.  For example, let’s say you create Word document or a PowerPoint presentation and email it to a colleague, who doesn’t make any changes to the presentation.  You’d expect that the two files would be identical, right?  You’ll be surprised to know that more often than not, the files, while visually identical, will be ever so slightly different.  This is because every time a document is opened, applications store metadata about the last user, last open time etc, which changes the files.
  2. Delta-block: While there is debate about whether delta block approaches falls under deduplication, they merit a mention.  While there are several variations, essentially, delta block technologies have the ability to identify changes to an already backed up document and backup only those changes.  The key is that the data has to have already been backed up under the same name to provide file ancestor information to the delta block processing.  Therefore, if you change a file and save it with a different name, the entire file will be backed up again.  While it is better than purely file-level deduplication, it is still pretty basic and is only useful for scenarios where you have large files that keep changing, but preserve their name.
  3. Block level: Block level deduplication breaks the file into fixed sized blocks and only backs up unique blocks using the process described here.  While better than file-level and delta-block technologies, this approach is best suited for database type stores whose physical block layout doesn’t change.  However, for document type data – most prevalent on PCs – where a simple save can completely alter the layout of the document, block level dedup isn’t very effective.  This is because it has two limitations: identifying common data for the first backup and identifying common data when the physical layout changes.  I’ll cover these in my next post.
  4. Object-based: Object-based data deduplication is the current state of the art and is the most effective solution for detecting duplicate data.  It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes.  Unlike block based technologies, object-based dedup is “content aware” and chunks the file into well known logical objects like slides, images, paragraphs, worksheets, attachments etc.  The advantage is that even if the physical layout of a file changes – which can happen with a simple save operation – the logical objects can still be detected and stored only once.  As a result, object-based dedup provides the best efficiency for PC data with as much as 5-10x better performance vs. block based deduplication.

The graphic above displays the relative efficacy of the 4 different methods outlined above.  While there is a lot that goes into determining the actual dedup efficiency, the graphic should be viewed as a good indicator of the relative efficiency of the different methods.

What do you think?  In my next post, I'll cover the limitations of block level deduplication and why Object-based deduplication is the better choice.

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics