Corporate PC backup: Anatomy of a Data Deduplication System
Posted by Puneesh Chaudhry on Sat, Aug 21, 2010 @ 10:31 AM
This post is part of sub-series on de-duplication requirements in an overall Series on planning for corporate PC backup in your organization. This post looks at the various components that make up a Deduplication system.
Deduplication is the process of finding duplicate chunks of data and then acting on them only once. For example, if you can identify duplicate data that is stored on your fileserver and store it only once, you could free up a lot of storage – which translates to immediate savings. Across all the data in a company, there is a lot of duplication. For example, identical files, or PowerPoint presentations using the same slides or PST files with the same message attachment. The ability to identify this common data and to transmit and store only unique data is crucial for cost-effectiveness. The industry jargon for this capability is data de-duplication or dedup, in short.

Broadly speaking a de-duplication system consists of the following 3 components:
- Chunking original data: The first step is to divide original data into chunks which in turn will be analyzed for duplicate occurrence in the system. Typically, this involves dividing the original data into smaller chunks whether sub-blocks or sub-objects. This is perhaps the most important step in the entire de-duplication process because the better the approach to divide data into smaller chunks which are likely to be found in duplicate, the better the dedup efficiency. Broadly speaking, there are 4 approaches to how duplicate data is chunked:
- File based
- Delta block based
- Block based
- Object-based
All of the approaches above yield a set of data that can then be analyzed for duplication within the data repository. I’ll analyze each of these approaches as part of this series.
- Computing a unique identifier for the chunks created in step 1: once candidate chunks for duplication data analysis have been identified, we need an efficient way to detect whether it already exists in our data repository. We can compare entire objects byte by byte, but it would be computationally expensive and wouldn’t scale. Hence, the most common approach is to create a hash or checksum of the data and then lookup that checksum in the data repository. The checksums are usually much smaller (3-4 order of magnitude smaller) than the original data – making it much faster to lookup whether a set of data already exists even with terabytes of data.
- Lookup in data repository: The data repository consists of two components:
- Unique data repository: this is where all unique data objects are stored
- Metadata repository: a catalog of unique hashes in the repository corresponding to all the unique data objects stored in the system, optimized for quick lookup.
Once the hash or checksum has been created, it needs to be looked up against the catalog of unique hashes in the unique data repository. If the checksum already exists in the repository, then there is no need to store it again. If, the checksum doesn’t exist, the data object corresponding to it will be added to the repository and the checksum/hash itself will be added to the metadata repository.
Above is a simple view of a de-duplication system. Another important consideration other than the chunking algorithm and the scalability of the data repository is where the chunking and lookups are performed. That’s the subject of the next blog.