Subscribe by Email

Your email:

Blog

Current Articles | RSS Feed RSS Feed

Key considerations when investing in mobile device data sync, access, and sharing.

  
  
  

For the last 15 years, IT Administrators dealing with client technologies have had a fairly manageable set of challenges. The advent of smartphones and tablets over the last 2 years have been a hugely disruptive force that has made administration, management and data access a huge challenge.

A typical mobile user demands that their data is available anytime, anywhere, using any device. “Consumerization of IT” has resulted in the proliferation of Bring-your-Own-Devices (BYOD) to work and again, the user expects that IT support any and all devices. As a result, the challenge of dealing with a multi-vendor laptop/desktop scenario has quickly grown out of control to a multi-device, multi-vendor, multi-OS environment. And keep in mind, 2011 is the year that smartphone/tablet shipments EXCEED that of laptops! In today’s enterprise, corporate data sits in multiple locations – laptops, file shares, backup servers, SharePoint, Documentum, FileNet, etc. It is extremely important to consider all these different “data stores” when implementing a solution to keep your mobile users happy.When listing your requirements and evaluating vendors to solve BYOD data access challenges, think of the following considerations:

Key Considerations when Investing in Device Data Sync, Access, and Sharing

  1. Where does corporate data live? Is it a combination of laptops, file shares, SharePoint, etc.? Are my users using Dropbox/Box.net or some public cloud service because IT can’t support access demands?
  2. How do I allow access to corporate data stores like SharePoint, file servers, or backup servers, etc.?
  3. Data duplication – Is it global or localized? Is it Block, File or Object based? How does this impact my storage costs?
  4. In the case of data sync, is self service restore important for my end-users? Is the process integrated into the operating system?
  5. Are service level thresholds for error management important for my organization?
  6. Audit requirements – can I produce a report that shows all sync and restore actions performed?
  7. Can I perform a federated search across all endpoints for e-discovery? Can I place documents on legal hold?
  8. Can the vendor solution scale to manage remote offices with slow WAN connections?
  9. Does the vendor solution allow secure access from anywhere, without requiring a VPN (which promotes unfettered access to the entire network)?
  10. Is multi-factor authentication important for my organization? Does Active Directory/LDAP integration matter to my IT organization?
  11. Is centralized administration important for my IT organization? Will it require extensive training of end users?

The answers to these questions will go a long way in making sure that you have a comprehensive solution to keep your mobile users fully productive, secure and happy!

iPad, iPhone, iPod: How many mobile devices do you carry?

  
  
  
How many devices do you carry? A friend recently posted on his Facebook status: "am I an Apple geek if I'm packing 3 Apple devices for a day long trip?" I replied: "Why only 3?" (!) Turns out he was missing an iPad! Then I saw a Cisco ad during the NBA finals saying there will be 50 billion network devices roaming around by 2020! That's 7 devices per person!! Seriously, though: this got me thinking... How many devices do we carry these days? I routinely carry 3 devices: my PC (I'm a PC, geddit?), my iPhone and my iPad. When I travel, I also carry my iPod nano (easier to take on a run). Turns out I'm not alone, many of you do the same. Recently, on a flight, I saw someone carrying 7 devices: a Mac, an iPod nano, an iPad, an iPhone (personal), a blackberry (work), a Nikon D7000 and a Sony camcorder! So how many devices do you carry? Typed on my iPad!

Laptop Backup - Recovery Requirements

  
  
  

This post is part of a Series on planning for laptop backup in your organization.  This post looks at key recovery requirements when considering a laptop backup solution.

There are 3 stakeholders when it comes to corporate data on a laptop: End Users, IT and Business/Legal.

When considering the recovery requirements, we need to consider the needs of all 3 stakeholders.  The following attempts to capture the various recovery requirements that are likely to be expected from a laptop backup solution at some point by one of these stakeholders:

  • Self-service recovery: Most IT organizations trust their users to perform self-service recovery, but we still see some who don’t want self-service recovery, because they’ve been burnt in the past.  You should consider the following to ensure you can deliver a functional self-service recovery capability to end users:
  1.  
    1. Single file recovery or full system rebuild: do you trust your users to recover individual files from their backup or are they savvy enough to recover their entire data also?
    2. Allowing end users to overwrite the files: Are there groups of users who have shown a propensity for performing hara-kiri? Whom you don’t want to give the ability to overwrite files?
    3. Web based recovery: Do you want to allow your end users to be able to recover data from a web browser? For times when they are not on their primary computer.
    4. Training end users: This often gets overlooked and causes lot of issues.  How much training do you need to perform for your end users?  Look for a solution that is integrated into the operating system tools like Windows Search and Explorer so they don’t have to learn anything new.
  • Laptop rebuild model – Central depot vs. distributed sites: Do you have a central site where you intend to rebuild all laptops and then ship them to users at different sites? Or, do you have IT staff at remote sites who will be responsible for rebuilding laptops are for users at those sites?
  • Migration in case of lease renewal: Do you intend to use the PC backup as a source for data migration in case of laptop renewal?
  • Recovery Time Objective:
    1. How much time is acceptable for a laptop to be rebuilt?
    2. Are there requirements to give users access to their data almost immediately via a loaner laptop while their main laptop gets rebuilt?
    3. How long are users willing to wait for recovering a single file?
  • Loaner machines: Do you have loaner machines which help users tide over the duration when their laptop is being rebuilt?  What data recovery requirements are imposed by loaner machines?
  • Recovery to non-corporate devices: Do you need to explicitly disallow recovery to non-corporate devices to prevent data leakage?
  • Bare metal recovery: Are you one of those brave souls who believes that bare metal recovery for PCs can work and are willing to tackle the myriad dissimilar hardware issues, or do you have a golden image model on top of which you intend to recover user data and settings?
  • Recovery for E-Discovery or other forensics: is it likely that your legal or HR group will ask you for access to a user’s laptop data for E-Discovery or internal forensics.
I’ll tackle some of these in more detail in later blogs.

Carbonite IPO - Online Backup Companies Can Go Public!

  
  
  

By now all of you must’ve seen the announcement re: online PC/ laptop backup company Carbonite filing an S-1 pursuing an IPO.  There has been a lot of social media discussion re: whether they can be a profitable and viable business, with a loss of $25.7M on revenues of $38M in 2010 and in light of recent announcements from Iron Mountain re: their digital business.  I decided to investigate and take a deeper look at the S-1 to get a better sense of Carbonite’s business.  

Here’s a summary of my analysis:

  • 2010 GAAP loss of $25.7M not as alarming: this figure is misleading for a rapidly growing SaaS business as majority of the revenue is deferred.  When taken into context with bookings of $54M, their net loss is only $9M, not too bad, given their approach to grow as fast as possible.  Read below for a more detailed explanation.
  • Good gross margins: Carbonite has 62% gross margins – not bad for a service business.
  • Already profitable per customer: Contrary to popular belief, their cost to service a customer is actually not bad and they already make an annual profit of $7.6 per customer when you take out the Sales and Marketing costs.
  • High cost of customer acquisition: their cost of customer acquisition is actually quite high: ~$70 per customer.  At their current profit per customer, it will take a long time to recoup this (10 years).  However, this isn’t unusual for a company trying to rapidly grab market share.  I expect them to augment their current offerings to increase the revenue per customer and reduce their customer acquisition cost over the next few years.

Read on for the detailed analysis:

  1. 2010 GAAP Loss of $25.7M: this figure has been much bandied about, but is quite misleading because rapidly growing SaaS businesses have large portions of deferred revenues, while expenses are usually incurred upfront.  Bookings are a better indicator for a SaaS business and Carbonite had bookings of $54M in 2010.  Using $54M as a better indicator of what the top line should be, their 2010 loss shrinks to $9M – which is acceptable given their tear-away spend on Sales and Marketing ($33M in 2010!).
  2. Great Bookings Growth (indication of new business): $54M in new bookings in 2010 and growing at: 198% CAGR (’06-’10).
  3. Decent Gross Margin: 62% gross margin in March 2011: This is one of the biggest indicators of whether they have a sustainable business.  Gross margins have been steadily increasing from 48% to 62% as of March 2011.  This should put to rest speculations about whether they can profitably service whatever customers they can bring on board.
  4. Free Cash Flow: This is another important figure, but no clear trends on this one.  FCF was -23% of bookings in 2010, an improvement from -88% in 2008.  Overall as a % of bookings, it’s not out of the realm of possibility that they could turn cash flow positive if they start holding back a bit on Sales and Marketing investments. 

So, those are the conventional metrics, not bad actually.  But, they don’t tell the story at all.  The real story is in their per customer metric, which gives a pretty good idea of the economics of the business and the future potential.

  1. Revenue per customer: Carbonite’s annual revenue per customer is increasing: from $22.7 (2007) to $40.6 in 2010.  It’s good, but it needs to increase to $60 - $75 per customer per year to offset their high customer acquisition cost.
  2. Cost per customer: If you take out sales and marketing costs, you can figure out what is Carbonite’s cost to serve one customer when one takes into account: COGS, Engineering and G&A.  That number for 2010 is: $33 per customer per year.
  3. Profit per customer: Carbonite’s profit per customer is: $7.6/year (revenue/customer – cost/customer).  Not bad, but is it enough to offset their high customer acquisition cost?
  4. Cost of customer acquisition: The table below shows trends in Carbonite’s cost of customer acquisition:

Carbonite customer acquisition cost

This is high and needs to come below $50/customer, which should be within their reach.  At current rate, with an annual profit per customer of ~8 it would take a long time to recoup the cost of customer acquisition, even with the really high retention rates that Carbonite has (97%).  But, I believe that’s temporary when a company is trying to grow really fast.

So, there you have it.  I think it is a solid business that is growing very rapidly and has good fundamentals.  Right now, they are focused on becoming the dominant player in the market and are spending a lot on sales and marketing and that makes some of the numbers unattractive on the surface, but I believe that is all short term.  Moving forward, I’d have to believe that they will come out with new offerings to increase their revenue per customer and their acquisition cost/customer would go down as well.  They could easily reach a point in a couple of years when they can recover their acquisition cost/customer within 18-24 months – which would bode for a really good business.  I for one am rooting for them as that would make a really good Boston tech story – why should Silicon Valley have all the fun!

Disclaimer: opinions expressed in this article are solely my personal opinions and not of my employer.  All data was extracted from the Carbonite S-1 filing here: http://edgar.sec.gov/Archives/edgar/data/1340127/000095012311049041/b86123sv1.htm

Laptop Backup: Backup Dedupe and Encryption

  
  
  
We often get asked about how Copiun deduplication works with the various forms of encryption technologies that are prevalent on the enterprise laptops or desktops, so I decided to write a post about it. This post is part of an overall series on backup deduplication and PC or laptop backup which can be accessed here.

As a refresher, there are 3 types of encryption technologies on Windows based laptops:

a)   Drive level: e.g. BitLocker, McAfee Safeboot, Credant etc.  These technologies encrypt the laptop hard drive in such a way that requires a correct password to be entered before the PC will even boot. Accessing the hard drive directly by attaching it to another system doesn’t work because all data is encrypted. For a comparison of various disk based encryption methods, see here.

b)   Encrypted File System: In this technology, one or more files or folders on a user’s PC are encrypted with a user’s certificate, such that only that user can access their data.  This means that even the system administrator can not access the user’s data.  The user on the other hand can access their EFS encrypted data directly as if they were accessing unencrypted data – as long as they are logged in with their credentials.

c)    Password protected files: The 3rd part of the encryption is when someone sets a document level password, e.g. in a word or excel file. The document is then stored encrypted using the password and even the user must enter their password every time when they open the document. See here for instructions.

Copiun can backup all 3 types of data without any issues. The deduplication efficiency varies with the encryption type. As much as possible, Copiun performs object based deduplication which requires Copiun software to be able to read the document in its native format.  If the software cannot read the file in its native format, then the software falls back to less efficient deduplication methods like file or block level deduplication.

a)   For Drive level and EFS encryption: Full deduplication efficiency: since Copiun runs as the user itself, it can read the encrypted data in its original format (i.e. un-encrypted) and as a result is able to achieve full deduplication efficiency for both drive level and EFS encryption.

b)   Password protected files: for password protected files, Copiun provides block and file-level deduplication, because it doesn’t have access to un-encrypted data without the document level password.

Many laptop backup products which run as a Windows service, cannot perform block deduplication for files and folders protected with Windows EFS encryption. This is because only the user whose certificate was used for backup can read EFS files in their native format and administrator or local system accounts don’t have access to those files. This means that if a document is stored on one machine in an EFS folder, chances are you won’t be able to find duplicate copies of that document on other machines.  As you evaluate different solutions for laptop backup, make sure you understand the user account under which backups are performed so you can understand the deduplication efficiency you’re likely to get.

Laptop Backup Vendor Comparison

  
  
  

laptop-backup-softwareSelecting a data backup solution is like buying grass seed – there are tons of varieties and qualities to keep us guessing. Buy the wrong kind for the targeted application and you may have to wait another season to try again. This couldn’t be more amplified when you consider solutions for enterprise laptop backup (aka endpoint backup). Often overlooked, laptop data backup is no longer as simple as having users copy files to a shared drive. Today’s users have become more mobile and since they have regular access to email and their CRM, they VPN or dock into the network less often – or not at all - going longer intervals without backup. To address this seachange, innovation is needed that can overcome the many points of failure associated with mobility and endpoint backup.

With the subtitle, “Endpoint Backup Gains in Importance,” Gartner just released their Desktop/Laptop Backup research update (G00211731) on March 21st where Principal Analyst, Sheila Childs, sheds light on purchase considerations by breaking down choices into three categories …

1.       Server Backup Vendor Products Used for Desktops and Laptops

2.       Backup Services (Hosted Service Providers)

3.       Built-From-the-Ground-Up Desktop/Laptop Backup Products

Looking at the three categories, I have broken them down into 3 key areas of consideration: Handling Mobile Users, User Experience, and IT Administration.

1. Server Backup Vendor Products Used for Desktops and Laptops

Handling Mobile Users – Sheila indicates in this category, “Some of these vendors start with their server-based backup product and ‘dumb it down,’ so that it’s appropriate for desktop/laptop backup.” Be careful here – even though these vendors are almost all venerable names, a solution originally designed for always-connected servers, may not apply to frequently disconnected laptops.

User Experience - Users are typically empowered to do their own restores but agents can be intrusive and often require strict schedules to adhere to – another limitation for the mobile user. Test the solution on a few less technical users and see if the results will scale.

IT Administration – For this criteria, understand how thoroughly the solution can deduplicate files. Global (source & target) ought to be the goal. Good backup deduplication will conserve bandwidth as well as storage while making it viable for continuous backup instead of scheduled batches. Look also for central administration where the endpoint doesn’t have to be visited for implementation or updates.

2. Backup Services (Hosted Service Providers)

Handling Mobile Users – Assuming mobile users can connect often enough, online backup can work well. However, if schedules are missed, the delta change could impact laptop resources and effect productivity. Recoveries can be lengthy so make sure you can meet RTO and RPO thresholds.

User Experience - Like the previous category, many of these vendors have attempted to trim down their server agents to accommodate endpoints. Expect schedules, user training, and intrusive agents.

IT Administration – According to Sheila, “Endpoint backup services are a good option when organizations don’t have the in-house resources to tackle this issue, or for those that simply don’t want to tackle it.” There is very little IT overhead with this method. Additionally data is protected offsite, often to multiple data centers. Costs can vary widely - understand the costs over 3 years as compared to owning the storage and doing it yourself.

3. Built-From-the-Ground-Up Desktop/Laptop Backup Products

Handling Mobile Users – Look for features in this category that can take advantage of any Internet connection and store data automatically behind the firewall. Aggressive deduplication at the source and target should enable support for thousands of users.

User Experience – It’s difficult to maintain training for users in larger enterprises – seek features that avoid schedules and cumbersome interfaces. Backup should be non-intrusive, offering continuous processing during idle CPU cycles. In the case of Copiun, the full-text file restore process is integrated into the OS, making a recovery as familiar as searching for a file in your directory.  

IT Administration – IT and other business stakeholders ought to have full visibility into the backup repository for risk management and e-discovery. Also, look for features that easily track enterprise backup success and can differentiate between anticipated errors such as disconnects and “real” failures that are out of service level compliance.

Hopefully, this helps break down the choices available to you. Try our side-by-side vendor comparison tool to evaluate over 50 features of vendors on your short list.

I welcome your comments.

PC Backup – Object based vs. block based deduplication

  
  
  

This post is part of sub-series on requirements to deduplicate files in an overall Series on planning for corporate PC backup in your organization.  My last post introduced Object Based backup deduplication. To summarize: Object Based backup dedupe does a much better job of the 1st and most important part of deduplication: i.e. dividing a file into smaller chunks because of the following characteristics that are unique to Object Based backup deduplication:

a)      Format awareness

b)      Extracting objects in their native form: examples of objects are: images, slides, excel worksheets, PST file attachments etc.

c)       Extracting the same object i.e. an image regardless of order or location

d)      Ability to reconstruct a logical object that may be split into multiple physical locations due to file format storage conventions

As I promised, this post compares Object Based and block level deduplication approaches using the same example I used for block level deduplication: a document called NYC.ppt, which contains two familiar NYC icons: Time Square and the Statue of Liberty.

NYC-image

Block level deduplication will start with a fixed size block, say 16K, and create equal sized chunks out of this document as shown below.  You can see how the Times Square image is cut about 70% of the way to the right and the Statue of Liberty’s head is cut off horizontally right at the nose!  However, Object Based deduplication, because it is format aware, will extricate the underlying objects in their native form, which in this case are: the two images: Times Square and the Statue of Liberty, captured in their entirety.

BlockvsObjectNYCimage

Intuitively, capturing the objects in their native form seems like the right thing to do, but let’s take a look at another example, which really drives home the advantage of Object Based method over the block based approach.  Here’s another document, titled: Fav Places.ppt, which contains both the Statue of Liberty and the Times Square image from the last document, but with some important differences, the two images are stored in a different order and also contains an additional image of the Empire State building:

deduplicate files

Let’s put the two approaches to test again.  Block level backup dedupe will again divide data in 16K chunks and as a result will have overlapping and chopped off logical objects as part of the same block, making it impossible to find any commonality.  On the other hand, Object Based deduplication will again capture the logical objects in their native form and a simple process of checksum comparison will indicate that two out of the 3 objects have already been encountered before in a different file and don’t need to be stored again.  The only new object to be stored will be the image of the Empire State building.

favorite-places-image

To reiterate, this is a microcosm comparing how your data is processed for the first backup by block level deduplication vs. Object Based dedupe. There are tens of thousands, if not millions of documents in your company which have a lot of duplicate embedded data like company logos, images, PST file attachments etc – all likely in different locations within different documents.  With a block level deduplication, you have little hope of finding this duplicate data, because it has no notion of objects and how they are laid out.  Object Based deduplication will find this common data across millions of those documents and store and transmit them only once!  Imagine only storing an image, a slide, a PST file attachment only once in the entire company – you’re looking at around 10-25x reduction in the amount of data you have to store and transmit!  Imagine not having to buy storage for another 2 years!

In my next post, I’ll tackle the so-called “variable length deduplication” and show that it is really just fixed length block level deduplication unless you're backing up the same file again and again.

Laptop Backup – Introducing Object Based Deduplication

  
  
  

This post is part of sub-series on deduplication requirements in an overall Series on planning for corporate laptop backup in your organization. My last post examined the challenges of block based deduplication – which has until recently been the state of the art – when it comes to backing up documents and email, which make up the bulk of laptop data.  To summarize: Block level deduplication has two significant weaknesses:

a) Identifying common data for the first backup of your laptop and desktop population

b) Identifying common data when the data layout changes, i.e. when someone edits a document on their laptop

As I promised, this post introduces Object Based Deduplication, which overcomes both of the above limitations.

Object Based Backup Dedupe comes into the picture at the very first step of the backup deduplication, i.e. the chunking step itself.  Before we get into how it works, let’s understand the characteristics of the data that we’re trying to dedup.  Pretty much every file stored on the disk of a laptop or desktop has an inherent structure to it.  It is made up of underlying objects that are unique to that type of file.  For example: in a text document, the objects are: words and paragraphs; in a PowerPoint document, the objects are: slides, images, templates etc; in a PST file, the objects are: messages, attachments etc…  You get the point.  These objects are stored in different locations in different files, depending on when and how the owners of those documents chose to create or insert those objects.  For example, you may have the same image stored on slide 1 of a presentation on Rex Ryan’s PC in NYC and on slide 99 of a presentation on Tom Brady’s laptop in Boston.  Which image you ask?  Why of course this one, of them both being married to supermodels! BTW, I’m no Rex Ryan fan, but his wife does look wonderful in that catalog and I hate to admit it, but he’s definitely more FUN than our genius hoodie!  Jokes aside, the challenge is to detect that common embedded image, even though it may be stored in a different place, in different files and on different machines.

Object Based Backup Dedupe has the following unique characteristics, which help it overcome this challenge:

a)      Format aware: it understands the file formats and can “chunk” the file into underlying objects that make up that file.  For example, a PowerPoint is chunked into slides, images and other objects.  This results in getting the SAME set of objects every time even though one may change their order, i.e. when one rearranges the slides in a presentation.

b)      Get the objects in their native form: often the underlying objects have a lot of metadata associated with them as part of storing them in a document.  For example, an image may have positional data indicating which slide it is stored on.  Being just format aware is not enough; one also has to extricate the native object from all the metadata encasements around it.  This is important because the surrounding metadata will be different in different files.

c)       Order doesn’t matter: It follows from the previous point, but bears explaining.  It doesn’t matter if the object is first, last or in the middle somewhere.  An object can occur in any order at any place and Object Based Backup Dedupe will uniquely identify it and not store it again if it has already been encountered anywhere else.

d)      Complete Logical Object, regardless of physical storage: Some file formats split a logical object like an image into multiple smaller blocks and store it in different physical locations to match their internal storage structure.  This can happen if the object size is too big to fit into one of their internal storage units.  PowerPoint is notorious for doing this.  Object Based Dedupe can construct a logical object in its entirety, regardless of the different physical locations it may have been split into before storage.

On the strength of the above 4 characteristics, an Object Based Dedupe approach will locate that image on slide 1 of the first presentation, extricate it in its native form by removing all metadata around it and construct a complete logical object regardless of internal PowerPoint storage structure.  Similarly, when it encounters the 2nd presentation, even though that same image is stored on slide 99, Object Based Dedupe will be able to extricate the same object out of that presentation because order and metadata don’t matter.  Once the same logical object has been extricated, a simple checksum will indicate that it has already been backed up on another machine and will not need to be backed up again.

So, you’re saying all that’s well and good but what does it mean for me?  It means the following:

a)      That copy of the company logo embedded in potentially millions of documents in your company? only stored and transmitted once.

b)      Those images you see embedded in different documents: stored and transmitted only once.

c)       Attachments in PST files: only stored and transmitted once across the entire company!

To further highlight advantages of Object Based Deduplication, I’ll compare it side-by-side with block level dedupe in my next blog.

PST Backup: Laptop Backup Series

  
  
  

PST backup is a pain for most organizations. Recently, a Linkedin group that I’m a part of, had a big debate re: whether or not PST files should be backed up and how to back them up.  There were several opinions, mostly colored by the fact that historically, backing up PST files across a large number of distributed PCs has been a really hard task.  In this blog, I’ll look at the key requirements around a successful PST file backup in a large environment. This post is part of an overall series on planning for corporate PC backup in your organization.

For better or for worse if you have a situation where your PCs (desktops and laptops) have PST files on them, you need to backup those files, or you're looking at a potential loss of large amount of data leading to significant productivity loss and worst case loads of regulatory trouble.  Historically, backing up PST files has been hard because they are large, are always open and change every day.  There are 6 main things to consider when taking on the backup of PST files:

a)      Open file backup: This counts as table stakes.  If a backup solution can't backup open files well, it's not a good fit for PST file backup, period. 

b)      Restartability: Restartability is key.  If a user kills the backup in the middle of a 5 GB PST file backup, is the backup process going to start all over?  Will it send the entire data again, or is it smart enough to know what has already been sent and only send the remainder?

c)       Detect PST files anywhere: Users do all kinds of stuff on their PCs, like storing PST files anywhere on the system.  A PST backup solution needs to be able to locate PST files anywhere on the PC and back them up.

d)      Attachment backup deduplication: There is tremendous amount of duplicate data in your PST files.  When someone emails an attachment, that attachment is stored in the sender’s and all of the recipients’ PST files.  If your PST file solution is not smart enough to detect and store those attachments only once, you’re going to incur a huge storage burden for backing up multiple copies of those attachments.  Look for a solution that can deduplicate the attachments across ALL PST files ANYWHERE in the company, as a result transmitting and storing each attachment only ONCE across the ENTIRE Organization.

e)      Bandwidth efficiency: Ensure that the deduplication savings on PST file backups apply to savings on bandwidth as well.  This means that the Agent on the PC needs to be smart enough to detect common data across the ENTIRE organization and transmit it only once.  Otherwise, the process will simply not work for your remote users because sending GBs of data over slow WAN links every day is just not going to scale.

f)       Remote user considerations: Increasingly remote users are infrequently connecting via VPN.  However, there is still a need for the backup of PST files for those remote users on a regular basis.  Make sure your PST file backup solution can backup data even when your mobile users are not connected over the VPN.  Beware of solutions that require you to open firewall ports to the backup server – exposing your backup server with the entire backup data set to the Internet.  Look for a solution that can backup remote user data without opening any firewall ports or putting the backup server in the DMZ.

Any backup product can backup large files, but these are real world issues that need to be answered for a scalable and reliable solution for backing up PST files across PCs.

What do you think?

Laptop Backup – Block Level Deduplication: Not Enough!

  
  
  

This post is part of sub-series on deduplication requirements in an overall Series on planning for corporate PC backup in your organization.  My last post examined the effectiveness of the 4 main approaches to sub-divide your data into chunks which can then be searched for duplication in your existing data.  As I promised, this post looks at the limitations of block level deduplication, with specific focus on data that is likely to occur on laptops and desktops, i.e. documents, email etc.

 

Quick refresher: Block level deduplication breaks the file into fixed sized blocks and only backs up unique blocks using the process described here.  While better than file-level and delta-block technologies, this approach is best suited for database type stores whose physical block layout doesn’t change.  However, for document type data – most prevalent on PCs – where a simple save can alter the layout of the document, block level dedupe isn’t very effective.  This is because it has two limitations:

a)      identifying common data for the first backup

b)      Identifying common data when the data layout changes

To understand, let’s take a look at how block level deduplication works on a simple document.  The image shows a document called NYC.ppt, which contains two familiar NYC icons: Time Square and the Statue of Liberty.

PC backup deduplication example

 

 

 

 

 

 

 


Block level deduplication will start with a fixed size block, say 16K, and create equal sized chunks out of this document.  It has no idea where a logical object begins and where it ends.  As a result, the chunking process will look something like the image below.  You can see how the Times Square image is cut about 70% of the way to the right and the Statue of Liberty’s head is cut off horizontally right at the nose!

block level deduplication limitations

 

Seeing the logical objects getting chopped off randomly makes it apparent that there is something wrong with this approach, but let’s look at another example, which makes it patently clear.  Here’s another document, titled: Fav Places.ppt, which contains both the Statue of Liberty and the Times Square image, but with some important differences, the two images are stored in a different order and also contains an additional image of the Empire State building:

block level deduplication challenges

 

 

 

 

 

 

 

 

 

To the human eye, it is clear that the two images of the Statue of Liberty and the Times Square are identical and should be identified as duplicate, however, the way Block level dedupe works, it won’t be able to find any commonality here, because it will again divide data in 16K chunks and as a result will have overlapping and chopped off logical objects as part of the same block, making it impossible to find any commonality.

problems with block level deduplication

 

This is a microcosm of how your data is processed for the first backup by block level deduplication.  There are tens of thousands, if not millions of documents in your company which have a lot of duplicate embedded data like company logos, images, PST file attachments etc – all likely in different locations within different documents.  With a block level deduplication, you have little hope of finding this duplicate data, because it has no notion of objects and how they are laid out.  The same problem with block level deduplication manifests itself, when the data layout changes for a document.  In future posts, I’ll introduce Object Based Deduplication and show how that solves the problems mentioned above.

All Posts