Subscribe by Email

Your email:

Blog

Current Articles | RSS Feed RSS Feed

Corporate PC backup: Whither Deduplication?

  
  
  

This post is part of sub-series on Deduplication requirements in an overall Series on planning for corporate PC backup in your organization.  In my last post, I explained the anatomy of a dedup system.  This post looks at the various places where Deduplication can be performed and the suitability of those approaches for corporate PC backup.

The old adage location, location, location applies to Deduplication also.  Where the 3 steps of Deduplication - chunk, compute hash and lookup then store (see the post anatomy of a dedup system) - are performed is crucial because that determines WAN efficiency, which is critical if you have a large number of laptops or remote users, who are likely to connect over WAN links.  Broadly, the following approaches are available:

  • Target based: In a Target based Deduplication approach, all 3 steps of Deduplication are performed on a storage device that is a storage target for an application.  The entire data is sent over the network to a storage device which then identifies duplicate data and stores only unique data.  This approach is the least network efficient as it requires that all data be sent over potentially slow WAN links for every backup!  This approach should be ruled out for PCs.
  • Purely source based: In only source-based Deduplication, all 3 steps of dedup are performed on the source system itself without communication with the central server.  The source system, for example a desktop or a laptop, identifies duplicate data on that system and only sends data that is unique on that PC to the server.  However, if there are other systems with similar common data, they will transmit and store the duplicate data again on the server.  This approach can work for a small number of PCs, but should be avoided as it results in too much duplicate data to be stored and transmitted.
  • Global: Global Deduplication combines the best of source and target based Deduplication.  In this approach, Deduplication responsibility is shared between the source system and the server.  The source system performs the first 2 steps of Deduplication, i.e. chunking and hash computation on the source system, but the 3rd step of lookup is performed on the central server.  As a result, the source machine – in conjunction with the server – identifies duplicate data across the entire organization and as a result only data that is truly unique is transmitted and stored.  This approach is the most WAN and storage efficient and is also the most scalable approach. 

Global Deduplication: Save Bandwidth and Storage

Figure: Global Deduplication - Save Storage and Bandwidth

 

 

Comments

I think global object based de-duplication is a powerful tool and is becoming more and more required with de-dupe technology. Well drawn out.
Posted @ Friday, October 08, 2010 12:42 PM by Joe
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics