Subscribe by Email

Your email:

Blog

Current Articles | RSS Feed RSS Feed

Laptop Backup – Introducing Object Based Deduplication

  
  
  

This post is part of sub-series on deduplication requirements in an overall Series on planning for corporate laptop backup in your organization. My last post examined the challenges of block based deduplication – which has until recently been the state of the art – when it comes to backing up documents and email, which make up the bulk of laptop data.  To summarize: Block level deduplication has two significant weaknesses:

a) Identifying common data for the first backup of your laptop and desktop population

b) Identifying common data when the data layout changes, i.e. when someone edits a document on their laptop

As I promised, this post introduces Object Based Deduplication, which overcomes both of the above limitations.

Object Based Backup Dedupe comes into the picture at the very first step of the backup deduplication, i.e. the chunking step itself.  Before we get into how it works, let’s understand the characteristics of the data that we’re trying to dedup.  Pretty much every file stored on the disk of a laptop or desktop has an inherent structure to it.  It is made up of underlying objects that are unique to that type of file.  For example: in a text document, the objects are: words and paragraphs; in a PowerPoint document, the objects are: slides, images, templates etc; in a PST file, the objects are: messages, attachments etc…  You get the point.  These objects are stored in different locations in different files, depending on when and how the owners of those documents chose to create or insert those objects.  For example, you may have the same image stored on slide 1 of a presentation on Rex Ryan’s PC in NYC and on slide 99 of a presentation on Tom Brady’s laptop in Boston.  Which image you ask?  Why of course this one, of them both being married to supermodels! BTW, I’m no Rex Ryan fan, but his wife does look wonderful in that catalog and I hate to admit it, but he’s definitely more FUN than our genius hoodie!  Jokes aside, the challenge is to detect that common embedded image, even though it may be stored in a different place, in different files and on different machines.

Object Based Backup Dedupe has the following unique characteristics, which help it overcome this challenge:

a)      Format aware: it understands the file formats and can “chunk” the file into underlying objects that make up that file.  For example, a PowerPoint is chunked into slides, images and other objects.  This results in getting the SAME set of objects every time even though one may change their order, i.e. when one rearranges the slides in a presentation.

b)      Get the objects in their native form: often the underlying objects have a lot of metadata associated with them as part of storing them in a document.  For example, an image may have positional data indicating which slide it is stored on.  Being just format aware is not enough; one also has to extricate the native object from all the metadata encasements around it.  This is important because the surrounding metadata will be different in different files.

c)       Order doesn’t matter: It follows from the previous point, but bears explaining.  It doesn’t matter if the object is first, last or in the middle somewhere.  An object can occur in any order at any place and Object Based Backup Dedupe will uniquely identify it and not store it again if it has already been encountered anywhere else.

d)      Complete Logical Object, regardless of physical storage: Some file formats split a logical object like an image into multiple smaller blocks and store it in different physical locations to match their internal storage structure.  This can happen if the object size is too big to fit into one of their internal storage units.  PowerPoint is notorious for doing this.  Object Based Dedupe can construct a logical object in its entirety, regardless of the different physical locations it may have been split into before storage.

On the strength of the above 4 characteristics, an Object Based Dedupe approach will locate that image on slide 1 of the first presentation, extricate it in its native form by removing all metadata around it and construct a complete logical object regardless of internal PowerPoint storage structure.  Similarly, when it encounters the 2nd presentation, even though that same image is stored on slide 99, Object Based Dedupe will be able to extricate the same object out of that presentation because order and metadata don’t matter.  Once the same logical object has been extricated, a simple checksum will indicate that it has already been backed up on another machine and will not need to be backed up again.

So, you’re saying all that’s well and good but what does it mean for me?  It means the following:

a)      That copy of the company logo embedded in potentially millions of documents in your company? only stored and transmitted once.

b)      Those images you see embedded in different documents: stored and transmitted only once.

c)       Attachments in PST files: only stored and transmitted once across the entire company!

To further highlight advantages of Object Based Deduplication, I’ll compare it side-by-side with block level dedupe in my next blog.

Comments

Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics