Sunday, November 23, 2008

Deduplication - bit of a show down

Looking at Data de-duplciation today:

This article contains summary by most of the industry players so I'm going to put it here in full in case it gets lost from drunkendata.com:



Invitation to De-Duplication Vendors

There are some questions I would like to get answers to in the area of De-Duplication. I am hoping that some of the vendor readers of this blog will help out.

Here is an opportunity to shine, folks, and to tell the world why, what, how and where. Here is the question list. You can either respond on line through comment cut and paste or email me your response at jtoigo@toigopartners.com and I will put your response on-line for you. From where I am sitting, these are the kinds of questions that consumers would ask.

1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

7. Some say that de-dupe obviates the need for encryption. What do you think?

8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

9. Some vendors are claiming de-dupe is “green” — do you see it as such?

10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

Thanks in advance for your response.

15 Responses to “Invitation to De-Duplication Vendors”

  1. draft_ceo Says:

    If Diligent is the best, then I am curious to understand why it sold for less than $200M.

  2. Administrator Says:

    IBM hasn’t revealed how much it spent on Diligent. Not sure where you are getting your numbers. Also, no one, except maybe IBM, has suggested that Diligent was best.

  3. Administrator Says:

    Chris P, over at eWeek, has pointed to this “quiz” and encouraged de-dupe vendors to open their corporate kimonos. Thanks, Chris.

  4. draft_ceo Says:

    I got the $200M number from here:
    http://www.byteandswitch.com/document.asp?doc_id=151339

  5. Administrator Says:

    IBM said in its conference call that they did not, as a matter of company policy, disclose acquisition prices. I don’t know where B&S got its numbers or if they are accurate.

  6. Howard.Marks Says:

    Jon,

    As you know I’m not a vendor but I play Blogger at InformationWeek. Starting with question 4 your description of deduping as using stubs isn’t a good analogy.

    Think of a deduped data store as a file system. In the case of a NAS device like a DataDomian or NetApp A-SIS it really is a file system. In a VTL think of each virtual tape as a file.

    Somewhere there’s a directory that says the file FOO.BAR is stored on blocks 123-345, 500-510 and 12999-14090. That’s true of ANY file system. The difference between the deduped and normal file system is that more than one file can use the same block. If I edit FOO.BAR, add 10Kbytes to the end and save it as FOO2.BAR at some point (real time or later) the deduper will recognize (via hashes or a byte by byte compare) that my file has the same data and will build a directory entry that says FOO2.BAR uses 123-345, 500-510, 12999-14090 and 66666-66669. So the second file takes up just 10K bytes.

    Now the file system needs to keep track of how many files point to each block and update that list when files are deleted.

    Re 5 I reject that dedupe is changing data. It’s storing it differently. Now LZS compression is changing data and AES encryption is changing data but dedupe as I described it above (which is good enough a description of all the techniques and 99% acruate for NetApp) isn’t. Strictly speaking RLL in the disk drive is modifying the data.

    Re 6: not any more than LZS or AES truth is those regulations mean “tamper with the change the meaning” when they say no modify.

    7 no it don’t

    8 - The only use of tape for deduped data would be to backup/restore the WHOLE deduped data store in one fell swoop.

    9 - If I dedupe and store 1/20th the data on 1/20th the drives using 1/20th the power it seems greenish. Tape is greener as I blogged a couple days ago.

    10 - If you think about hash based dedupe and CAS you could use dedupe to replace any of the online archive apps CAS is used for. Riverbed and Silverpeak use it for WAN acceleration and NetApp is pitching it for primary file storage. Downside is reading files back is slower because it’s not a sequential read as it would be if the file were on contigious blocks. In fact reading from a deduped store is VERY much like reading from a badly fragmented disk on a file server. Since these are devices made for backup restore they could use long read ahead queues to spedd it up.

    11 - The hard part in deduping is finding the right places to divide data into blocks. Think of the corporate file server. There are 10,000 Word docs with the corporate logo embedded. If the blocking algorithm can put that logo into a block by itself you’ll get much better data reduction than if it uses fixed size 4K blocks.

    The other hard part is building the index so you can QUICKLY check if a block being stored now has been stored before.

    All the HiFn card does is calculate the hashes for blocks. So chips can help but there’s no such thing as chip dedupe.

    Howard Marks
    Backup and Business Continuity Blogger

  7. Administrator Says:

    Thanks for your feedback, Howard. The questions on the list are for clarification from the vendors, none of whom — by the way — have seen fit to respond as yet. The points you make are very valid, but the questions are not a reflection of my misunderstanding of de-dupe as much as they are concerns raised to me by consumers who really don’t understand how de-dupe works.

    Stubbing is still a technique used by certain products, though not by all. I wanted vendors to clarify what techniques they actually use. As for the other questions, consumers believe that de-dupe is changing data, that it imposes a hit on access speeds, that it may jeopardize compliance. I have actually had several de-dupe vendors tell me that de-duped data “is already encrypted.”

    Bottom line: there are equal parts hype and marketecture around the technologies in play. Lots of players are doing things differently. There are no standards for doing it at all. Hence, the questionnaire.

    Thanks again for your thoughtful insights. I hope some of the vendors actually chime in.

  8. Administrator Says:

    Larry Freeman, Senior Marketing Manager, Storage Efficiency Solutions, Network Appliance, writes

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    Company: NetApp
    Dedupe product: NetApp deduplication
    NetApp deduplication is a fundamental component of NetApp’s core operating architecture - Data ONTAP. NetApp deduplication is the first dedupe technology that can be used broadly across many applications, including primary data, backup data, and archival data.

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    Storage admins are reluctant (or prohibited) from deleting data or sending the data to permanent tape archival. But as everyone knows, data keeps growing. This presents a quandary. You can’t just keep buying more and more storage, rather you need to figure out the best way to compress the data you are required to store on disk. Of all the storage space reduction options, deduplication provides the highest degree of data compression, the lowest amount of compute resources and is usually very simple to implement. This is the reason for the broad interest and adoption of deduplication.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    As with many other technologies, the gating factor for selecting the right deduplication technology is “What are you trying to accomplish?”

    Inline deduplication’s main benefit is that it never requires the storage of redundant data; that data is eliminated before it is written. The drawback of inline, however, is that the decision to “store or throw away” data must be made in real time which precludes any data validation to guarantee the data being thrown away is in fact unique. Inline deduplication is also limited in scalability, since fingerprint compares are done “on the fly” the preferred method is to store all fingerprints in memory to prevent disk look-ups. When the number of fingerprints exceeds the storage system’s memory capacity, inline deduplication ingest speeds will become substantially degraded.

    Post-processing deduplication, the method that NetApp uses, requires data to be stored first, then deduplicated. This allows the deduplication process to run at a more leisurely pace. Since the data is stored and then examined, a higher level of validation can be done. Post-processing also requires less system resources since fingerprints can be stored on disk and hence requires fewer system resources during the deduplication process.

    So bottom line, if your main goal is to never write duplicate data to the storage system, and you can accept “false fingerprint compares”, inline deduplication might be your best choice. If your main objective is to decrease storage consumption over time while insuring that unique data is never accidentally deleted, post-processing deduplication would be the choice.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    A better way to describe this would be “removes bit string patterns and substitutes a reference pointer or stub.” In NetApp’s case, a single data block can be referenced 255 times. When we identify and validate two identical data blocks, we re-reference the data pointer of the duplicate block to the original block, and release this duplicate block back to the “free” block pool. No stubs are required.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    This is generally referred to as the “reconstitution” of the deduped data. With NetApp, when we deduplicate, we are merely reorganizing the data structure of the filesystem by using multiple block references. Once the data set is deduplicated, there is no external algorithm needed to reconstitute the data. The direct and indirect nodes that make up the filesystem are traversed and the blocks are recovered, just as they would be in a “normal” NetApp filesystem.

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    The regulators want proof that the data is immutable, or in other words data that has not been altered or tampered with. NetApp deduplication does not alter one byte of data from its original form, its just stored differently on disk. I use this analogy - if a disk volume is unfragmented, isn’t it still the same data? Just stored in a different place? Same thing with data compression, data that has been compressed then uncompressed has changed but the data is still in its original form. One interesting point though, is what happens if a “false fingerprint compare” as described above with inline deduplication occurs. Now the data HAS been changed. Because of this, inline deduplication may not be acceptable in regulatory environments.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    Interesting concept but has one big flaw. Unlike encryption, deduplication does not guarantee that files will be unreadable.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    Its a customer decision, which again depends on the customer’s objective. Many customers are willing to take the trade-off of a proprietary dedupe format written to tape in exchange for significantly reduced number of tapes to manage. Others view this as a show-stopper and don’t want to rely on the deduplication vendor’s ability to recover data from tapes.

    9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    In many respects, yes deduplication is green. If I can reduce my physical storage needs by say, 50% through deduplication that means I need 50% fewer spinning disks to house the same data. The trouble is that as soon as any disk space becomes available, it manages to fill back up pretty quickly, as in your “storage junk drawer” example above. NetApp believes that deduplication is just one component of overall green storage, and should be combined with features like thin provisioning, writeable snapshots, and higher capacity disk drives for optimal “greening.” We have published a whitepaper “Buying Less Storage With NetApp” that addresses just this topic.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    Approximately 30% of all NetApp deduplication users are dedupe-ing primary storage applications, and the area we are seeing the greatest growth in is our deduplication. VMware, Exchange, SQL, Oracle, and SharePoint are the primary apps we predict will see the greatest adoption of NetApp deduplication on 2008.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    Most customers have not adopted software-based deduplication because of the challenges in managing multiple agents and deduplication points across their environment. Users seem to clearly prefer deduplication at the destination storage system. Simple to implement, manage and control. In NetApp’s case, a 10 minute installation. As far as the Hifn card, important to note that this card does not actually perform deduplication, it merely provides a “hash” function to create fingerprints. The fingerprint cataloging, and the stub or data pointer creation will still be the responsibility of the OEM storage provider.

    Thanks, Larry.

  9. Administrator Says:

    From Bill Andrews, CEO of ExaGrid.

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    ExaGrid Systems
    ExaGrid 1000, 2000, 3000, 4000 and 5000 as well as the 5000-GWi (iSCSI gateway).

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    There are two values, the first is to store a lot of data in a small foot print of disk. This works great for backup as each backup job that comes in is 98% the same the backup job before it. So de-dup works great because so much of the data is redundant. However, the additional value is the only way to keep an offsite copy of the data is compare one backup to another and only move the changes. De-dup is required for backup because the backup file names change so you need to compare one to the other to find the differences. For example: primary storage snaps would not each backup as the same file and would try and move the entire backup across the WAN. The net de-dup reduces storages but also enable WAN efficient offsite.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    There are 3 basic methods we see.

    • Break backup jobs or files into roughly 8KB blocks and then compare blocks to store only unique blocks (Data Domain, Quantum, etc.)
    • Byte-level delta where each backup job is compared is compared to the previous backup jobs and only the bytes that change are stored (ExaGrid, Sepaton, etc.)
      Near dup block level that bring the dup part way there and then takes the big segments and compares them for byte level delta (Diligent, etc.)
      The benefit of the last two which ultimately use byte level allow for great scalability. If you use you block for every 10TB there are over 1 billion has table entries (10TB / 8KB).
    • For byte level or near dup to byte level the segments that are compared are typically 100meg each so there are significantly less pieces to track. This allows data to be managed across servers in a scalable solution. If you notice the 3 players that have scalability Sepaton for the enterprise, Diligent for the enterprise and ExaGrid has a scalable GRID architecture all use byte.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    As described above it compares a backup job to the former backup job to find the bytes that change. About 2% of the bytes change from back to backup. For a 10TB backup job this means that the differences are about 200GB each. We store the most recent backup compressed (2 to 1) and then all previous backups as just the bytes that change. Therefore, if some were keeping 20 copies (200TB of straight disk) the result would be a 5TB most recent backup plus 19 byte deltas of 200GB each or a total of 8.8TB. 200TB/8.8TB = 22.7 to 1 as an example.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    At the primary site we store the most recent backup it its complete form – compressed. We store all previous versions as just the bytes that change. We replicate the bytes that change over the WAN to the second site system. We store the bytes that change but we also merge those bytes, each time, into the full backup on the other side. Therefore, both sides are identical with the most recent backup in its complete form and all previous backups as byte deltas. To do a DR recovery all you do is set a backup application and restore as the most recent backup is sitting there in its entirety. You can also, at any time, do a test recovery to make sure the data is there for when you need it.
    An additional benefit of have the most recent copy always ready to go in its entirety (compressed) is that 90% of restores come from the most recent backup and therefore restores are fast. Even more important if you are still making offsite tapes, 100% of offsite tape copies come from the Friday night full backup. If the Friday full is a de-duped set of blocks, the tape copy is slow, however if the Friday full is a complete un-duped full the tape copy is fast.

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    We have not heard this. All data can be restored at any point in time. The backup application controls the catalog and the retention periods. If you want to do a restore from 3 years ago you do the restore and the bytes that change merge into the full backup and the backup application receives the full backup as of that date. All byte are check summed. Even with block level de-dup the hash table can put a backup back together from any point in time.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    If you have a two site disk-back up system with de-dup there is no need for encryption as the system uses physical data center security, standard network security and standard VPN security. The requirement of encryption is to protect the data leaving the building on tape, typically in cartons. Obviously, if data leaves the building on physical media you want the data encrypted. We have a lot of Health Care customers who need to encrypt. With these types of systems there is no need for encryption because the security and encryption is built into the network. On the primary side, the primary site disk-based backup system sits in the data center, secured by data center and network security. It is as secure as the primary data. The second site or offsite system also sits behind data center and network security.

    Again, it is as secure as all data or applications in that data center. The data from one system to the other traverses the WAN over an encrypted VPN. Therefore, the data moves from the primary site to the second / offsite over the same encrypted VPN that all the company’s traffic goes over. So what you have is two systems, sitting in two secured data centers with data going over an encrypted VPN. Therefore, security is inherent in the infrastructure.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    We don’t have this problem as we keep the most recent backup in its complete form. When tapes are made, whether a nightly incremental or a full, it is always made from the full copy.

    The real process need to be thought not from a de-duplication perspective but from a total backup process perspective. As an IT person I care about:

    • Fast backups – so I want the system to take in data as fast as it can. Writing the backup job to disk first and then doing all de-duplication is the fastest (post process). This keeps my backup window to absolute minimum which is my biggest challenge.
    • Fast restores – 90% of my restores come from my recent backup so I want that in its complete form ready to be restored versus having to be put back together before it can be restored.
    • Fast tape copy – my tape copies are made as soon as the backups are done. Therefore, I want a complete backup job on disk so I can make quick tape copies versus waiting until the data gets put back together to get a tape copy.
    • Storage efficiency – I want all the above but I also want storage efficiency and I don’t care how you get there as long as it takes the least space possible and when I want my data back it is there.

    If I eliminate tape offsite I want an updated copy on the other side ready to restore in case of a disaster.

    My data is growing at 30% a year which means it doubles ever 2.5 years. I need a system I can capacity to that keeps the performance up with my ever growing data. Therefore, I need more than just disk capacity added I need each set of disk to be accompanied by the appropriate memory, processor and bandwidth such that I am not degrading in performance as I add more data.

    I want all this at the lowest price possible because IT budget is tight.

    So therefore, the right system offers:

    • Post process to get the backup job off the network fast (short backup window) and all de-dup is performed after the backup is down
    • The most recent backup in its complete for quick restores and quick tape copy
    • The ability to store only changes from backup to backup to have a small footprint of disk
    • That only changes be moved offsite for WAN efficiency and that at the offsite the full is constantly kept up to date for quick Disaster Recovery
    • Storage servers in a scalable system so that each group of disk is not just storage but has more memory, processor and bandwidth to keep up with the increased data
    • The lowest price

    9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    It is and it isn’t

    Against the amount of straight disk you would need it absolutely is as it takes a smaller footprint, less power and cooling to store long term retention.

    Against tape we think it might be the same but we are not 100% sure.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    De-dup has 4 uses that we can see

    1. Efficiently store backup data
    2. Efficiently store primary storage archival data (nearline)
    3. Efficiently store primary data
    4. Only move unique data from remote site to a central site (Symantec PureDisk, EMC Avamar) or de-dup any data over a WAN for WAN efficiency ala Riverbed.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    Hifn as far as we can tell is just compression but accelerated by hardware. In other words typically 2 to 1

    It is not storing only unique blocks or bytes and therefore will not achieve 20 to 1 or more a retention/history grows

    Most customers do not want to put together their own storage servers and the load software. Who do they call if they have a problem? Is it the RAID Card, is it the controller, is it the disk drives, is it the OS, is the de-dup software, is the backup software configuration? Customers want to call one vendor and get their answer. In the enterprise software might win, in the mass market, an appliance approach will win as storage is hardware.

    Thanks, Bill.

  10. pete Says:

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    Company: Data Storage Group, Inc.
    Product Family: ArchiveIQ™
    Shipping Products:
    Quantum GoVault Data Protection (SOHO)
    ArchiveIQ™ Enterprise Server (SMB – Medium Enterprise)
    Differentiators:
    Source-based data deduplication – redundant data is identified and removed at the source, before it is transferred across the network. This allows organizations to protect remote office data and dramatically reduce the backup window and recovery point objective for all systems. The system can also be configured for post-process data deduplication if the source system is a non-Windows server.
    Multiple data deduplication techniques – high levels of data reduction are achieved through multiple data deduplication techniques. With minimal server impact advanced data compression, single instance storage and sub-file data reduction are all included with ArchiveIQ.
    Simple and fast data recovery – Restoring individual files is easy because every file is included in a filename index which allows wildcard searching across several months of recovery points. Additionally, every recovery point can be quickly explored like a normal file share. Full folder recovery is a simple drag-and-drop from the explorer window. Finally, since ArchiveIQ does not “chunk” all the data into small pieces the restore jobs are at full disk speeds.
    Automated Data Validation – every recovery point is continuously validated based on administrative policies. Any unexpected problems with the storage media or deduplicated data will be identified early and the system will repair itself from the source data. Automated Data Retention – the administrator simply specifies how long recovery points should be retained. The system automatically identifies and removes deduplicated data that does not meet the defined data retention policy. This process increases the available storage capacity and limits litigation and compliance liability.
    Source data space management – optionally increase the available storage capacity on Windows file servers that are constantly running out of space. ArchiveIQ will transparently “stub” inactive file data and free expensive storage capacity for new files and active data. If a user or application needs to access the stubbed data, it is transparently cached back from the ArchiveIQ Server.
    No Hardware ties – the administrator has the freedom to use existing server and storage capacity, or purchase new capacity based on various considerations like replication, expansion, migration and price. As long as the storage platform supports NTFS volumes, ArchiveIQ can use it for data deduplication. Future purchases will also cost less because of this freedom of choice.

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    The one primary goal of data deduplication is to have a longer period of data history on disk and readily available. This is more appealing to organizations because it enables efficient data recovery and discovery. It also improves data reliability because the system can be continuously validating backup images. With tape-based data retention a media problem will go un-detected until the production data has been lost and needs to be recovered.
    Data deduplication can also improve the process of creating offsite copies. Instead of copying all of the source data, the system can focus on the deduplicated data. This reduces the total amount of replicated data and network impact. Instead of managing several replication plans, one for each production volume, the focus can be on the unique bits of data.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    When talking specifically about the data deduplication process there are two main questions to ask.
    1) What data reduction techniques are being applied?
    At a high level there are four basic data deduplication techniques being deployed. Every vendor has their own IP, or has licensed IP, but the end-results are similar. Products that claim data deduplication need to deliver at least the first three of the four (excluding data “chunking”).
    a) Advanced Data Compression – data that is highly compressible should be transferred and stored in the compressed state.
    b) Single Instance Storage – redundancy at the file level should be completely removed and a single compressed version should be transferred and stored.
    c) Sub-file data reduction – Active data and structured data (Exchange, SQL, System State, VHD, VMDK, PST) that do not deduplicate well with SIS should be processed for sub-file data reduction.
    d) Data “chunking” - A fourth data deduplication technique breaks all the data into small “chunks” and identifies redundancy at the chunk level. This technique will identify the most redundancy, but it is at a cost. The processing power, core memory and time required to recovery 1TB of data after it has been broken into 8 kilobyte chunks is significant. Also, the master index that maps all these chunks back together will become very large over time. For most organizations it is difficult to know if the overall data reduction from this fourth technique is worth the system and recovery impact.
    2) Where can the system perform data deduplication?
    Today there are three basic areas where data deduplication is taking place. Source-based data deduplication offers the most cost-savings when it comes to backup window, space management and ROBO protection.
    a) Post-process data deduplication
    b) Inline data deduplication
    c) Source-based data deduplication
    NOTE: Don’t be too focused on data deduplication and ignore data recovery, validation, retention and hardware dependencies.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    Unlike block level de-dupe technology which substitute variable or fixed length blocks of data with references (usually hash codes) to identical previously stored blocks of data (a well-known global compression technique), ArchiveIQ uses advanced single pass sub-file content factoring technology to identify and store only the new and unique content of a given data source.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    Our deduplicated data is stored on standard NTFS volumes. Any replication product that supports NTFS can be used for offsite copies. There is no additional charge or appliance required.

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    Data on tape is usually compressed at given blocking factor. As long as the immutability of the content can be assured, deduplication should be acceptable and can be used in conjunction with digital signatures (one or more) and/or write once media formats for nonrepudiation requirements. The management practices for compliance and non-repudiation requirements do not change with the application of de-dupe. Source-based deduplication adds a level of data integrity by being able to verify the contents of the source with what is in the destination repository.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    No, encrypting data over the wire or at rest should still be considered. Since our store is NTFS you can use Windows encryption or third-party products that support NTFS.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    Sounds like FUD - I guess the same rationale should be applied to tape formats. If we asked most D-2-D-2-T vendors to restore the backup image before moving it to tape they would say no. ArchiveIQ should raise fewer concerns since the deduplicated store is fully self-describing and just a series of NTFS files and folders.

    9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    Research is still being done to determine if a VLT draws less power than a tape library for the same environment. Article Cite. Blog Cite.

    One thing is certain – a software only solution like ArchiveIQ draws less power than both a VTL and tape. Especially if you just reuse an existing server and storage.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    Space management of primary storage, E-Discovery, and data protection all using the same deduplicated backend data store, all in a unified product.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    Typically, hardware based deduplication happens at the server or appliance. This requires the totality of the data to be streamed across the network. Network bandwidth and time to backup are unchanged from traditional backup strategies. Software deduplication, at least for our product, allows us to deduplicate the data at the source, only send the changes across the network, thereby saving network bandwidth and reducing the time to perform a backup.
    If the deduplication happens at the “client” machine, then the enterprise backup cycles are distributed across a large number of computers which also means that the “server” on which the deduplicated store resides can handle a large number of clients.
    Finally ask administrator how they feel when tape hardware changes. Often months or years worth of data is trapped on the old media formats. The same can happen with data deduplication that is tied at the hip with hardware. You can’t avoid being tied to the data deduplication techniques, but you can try to avoid being tied to hardware from the same vendor.

  11. jgagne Says:

    From Jay Gagne, Global Solution Architect, COPAN Systems

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    COPAN Systems - Copan Revolution 300 SIRM, COPAN Revolution 100T

    COPAN Systems provides a purpose built enterprise-class persistent data storage platform designed for long-term retention of persistent (fixed) data.

    The COPAN Systems persistent data platform differentiation starts by maximizing both scalability and access to data. While a traditional storage device provides 100 percent access, they are limited in scalability. On the other hand, traditional tape devices provide scalability but limit access. When coupled with de-duplication, the COPAN Systems persistent data storage platform is massively scalable while maintaining accessibility. This differentiation is only magnified with the addition of our de-duplication technology. The result is an incredibly cost effective and power efficient system built to scale to massive enterprise storage needs.

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    There are two main drivers pushing the momentum of de-duplication. First is the reduction of physical space, but as important or perhaps more so, is the reduction of network bandwidth required. WAN accelerators don’t help when you are sending the same data over and over again, but de-duplication before replication does help significantly reduce the bandwidth required. I believe these two elements are the driving force behind the buzz. The benefits include the decrease in RAW capacity, decrease in the costs of offsite storage and management and the instant access to data when it is needed.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    There are many questions a potential de-duplication customer must ask themselves before selecting the right de-duplication vendor. The process of demystifying de-duplication breaks down into 10 questions:

    1. How will de-dupe impact performance on backup/restores - and combined backup and restores? Depending on where in the backup cycle the de-duplication is performed, there will be impacts to your cycle. The same is true of the restore process as well since some solutions require a two-step restore process. COPAN Systems de-duplication solution is based on post processing, which means it does not impact the current backup scenario, and in many cases enhances it. The restoration of data from our solution is a single-step streamlined operation.

    2. How do I compare compression ratios? Scalability of a de-dupe solution is crucial. The actual de-dupe ratio will vary, but comparing each solution with a constant de-dupe ratio will clearly show the scalability of a given solution. Always investigate the assumptions the vendor is using for their de-dupe ratio to ensure consistent comparisons.

    3. How do I effectively scope the size of the storage solution? You need to not only factor in the immediate needs of the current environment, but also to factor in the annual growth over time and the retention requirements. This will clearly show the scale (capacity and access), efficiency and complexity of managing each solution’s de-dupe infrastructure. Given the massive scalability of the COPAN Systems de-duplication solution and its efficient design, scaling overtime is simple, efficient and cost effective.

    4. How do all the parts of the de-duplication solution communicate with each other? The more data that can be seen by a single system (or cluster of systems), the more efficient the solution will be at reducing the storage requirements, complexity of management and overall total cost of ownership. Also beware of the extra costs associated with having more units, appliances, IP and FC switch ports, etc.

    The COPAN Systems de-duplication platform was purpose built to minimize the amount of components while still delivering guaranteed access to data. Given the massive scalability and clustering options, it also provides the largest single data repository of up to 8PB in a single system.

    5. Is the solution file format aware? (i.e. does it understand the type of data being backed up). Not all of the de-dupe solutions are aware of the file format. Being aware of the actual file type increases the efficiency of solution. Some solutions only look at blocks of data without the ability to understand the whole file. The most efficient solutions have the ability to understand the file as well as break into blocks to achieve maximum de-dupe efficiency. Both de-duplication approaches are candidates for implementation onto COPAN’s persistent data storage platform.

    6. How easily can I create tape media? The flexibility for creating tape media is essential to many organizations. Similar to the restore process, some solutions require a two-step process to re-hydrate the virtual tape first and then create the tape replica as a second step. The COPAN Systems de-duplication solution has the ability to easily create tape media in its original format using a single streamlined process.

    7. How does the product replicate data? The ability to replicate data is vital for a de-dupe solution. Any single block of data stored in the de-duped state may be part of many original backups. Having multiple copies of your de-dupe data will ensure the level of protection required for an enterprise ready solution. The COPAN Systems de-duplication solution provides and efficient, bandwidth friendly replication option.

    8. Should I be concerned with the disk type and disk failure rates? Given the volume and criticality of data being stored for five years or more in a de-duped environment, the protection of this data is essential. Drive failures can lead to data loss. Minimizing the amount of drive failures will increase the level of protection for you data. Also, the type of disk (Fibre Channel, SATA, MAID) need to be considered. Some de-dupe applications require high-end Fibre Channel disks and connectivity to meet performance requirements, while others can operate with lower cost drives and still achieve the necessary performance specifications. The need to use higher performing drives will increase the cost of the solution, especially when factoring in five years or more of annual growth.
    The COPAN Systems de-duplication solution uses Massive Array of Idle Disks. Technology (MAID) on SATA-based disk drives that proved 6X greater reliability. It also uses patented Disk Aerobics ® technology to proactively monitor and ensure data integrity. The measured disk failure rate for COPAN Systems is 0.03 percent per year compared to an average of 4-5 percent with standard SATA storage devices.

    9. How does the de-duplication solution consume or conserve power? Does this help in my infrastructure costs? Given the amount of data stored in your de-dupe infrastructure, the floor space, power and cooling requirements should be considered when calculating the total cost of ownership. Since your de-dupe solution will be a massive repository, utilizing a purpose built archive platform will help guarantee data integrity as well as cost effectiveness. Given the fact that the COPAN Systems de-duplication solution is based on MAID, it guarantees power savings of up to 85 percent and up to 7X savings in data center floor space.

    10. What is the working life of the system and what migration strategy do you offer at that time? Many systems use standard transactional storage systems in the backend. These were designed for a 3-4 year technical refresh cycle. Based on the amount of data stored in a de-duplication platform and the product refresh cycle, thought must be given to how data will be migrated and how often you will be required to migrate it. Due to the massive scalability, increased reliability and low failure rate, COPAN Systems Persistent Data Storage Platform has a product life of 7-plus years.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    The COPAN Systems de-duplication solution operates in the manner described, as do most, if not all, options in the marketplace. Since the differentiation is not in the algorithm, then where is it? I think the whole solution needs to be taken into account. It boils down to the same 10 questions above. If you don’t look at the big picture, you run the risk of choosing a solution that won’t meet you long term needs.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    To efficiently and effectively perform a disaster recovery, you would require a system in the recovery site that had the data replicated to it. The recovery then would be exactly like a normal restore function performed in the primary site.

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    I believe that immutability refers to the changing (or tampering) with data. COPAN Systems’ de-duplication does not change the data, rather it stores it differently. The traditional backup applications have always done a similar scenario by combining files (ie. tarring them up). Since that too is only storing differently (and more efficiently) but the data is easily returned to its original form, then de-duplication is not very different.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    To be more clear, we need to say “tape encryption.” The greatest risk for data is when it is in motion, either on a network or on a truck. If encryption for data at rest was a requirement, we would see it in all tiers of storage, starting at the top. A de-duplication solution still needs to provide a means for encryption of data in motion. Potentially the network itself already has that ability. If the question was “Does de-duplication with encrypted replication obviate the need for physical tape encryption?” provided you can de-dupe and replicate everything you need to….yes it does. De-duplication can be used for “tape shredding” when the pointers to the data have been removed, just as if the “encryption keys” were lost for data that had been encrypted.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    This one gets the good ol’ answer of “it depends”. Only a customer can decide what the objectives and goals are. It’s then a vendor’s job to deliver a solution to meet those goals. However, one of the benefits of a standardized format for the physical media is that it could be easily restored. And, more importantly, restored in any order. The above example leads me to believe I would have to restore an entire repository first before I could restore any data. That may make meeting your DR SLA of 24-72 hours for the mission critical data a bit of a challenge.

    9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    The cynic in me wants to say that’s like saying just because I doubled the size of my disk drives, I am twice as power efficient. However at the end of the day, one of the best ways to determine storage power efficiency is Terabytes per Kilowatt. Given the fact that de-duplication does actually increase the amount of data stored per kilowatt, I would have to agree that it is “green.” That doesn’t mean that the storage platform the data is sitting on provides any efficiency. To be truly considered a “green” technology, I think you need to combine things like compression and de-dupe with a storage platform that expands upon the benefits. Then you will have a truly efficient solution, one that provides hundreds of Terabytes to Petabytes of storage per kilowatt, which COPAN does by powering on disk drives only when needed.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    Anywhere where there is a high occurrence of repetitive data is a candidate for de-duplication. It started with VTL (i.e. backup) because that is likely the location of the largest amount of repetitive data. There are many applications and general storage repositories throughout most customer environments that could benefit from de-duplication, such as user home directories and departmental data stores. De-duplication can be applied to data that has multiple generations resulting in commonality of like-records to be optimized. The result is “changed data” from record generation is “factored,” then “common-factored” across the entire set of data. This approach is best utilized at the file and record data types.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    The move to hardware based de-dupe, or to be more accurate, hardware based hashing (since the HW cards don’t actually do the de-duping, they only create the index) is very likely, provided there is value either in time or cost or even better both. It’s not likely to be a differentiator since speed and cost are usually part of the vendor battleground anyway. An enterprise deduplication approach will need multiple processing/memory combinations for creating hash values within an enterprise system. There will be use cases for both approaches depending upon the implementation used by a storage vendor.

    Thanks, Jay.

  12. Peter E Says:

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    Company Name: Symantec
    Deduplication Products: NetBackup PureDisk, part of the NetBackup Platform

    Veritas NetBackup PureDisk provides optimized protection for decentralized data using data deduplication, resulting in reduced total storage consumed from backups by 10–50 times and reduced network bandwidth required for daily full backups by up to 500 times.
    NBU PureDisk:

    • Reduces complexity and risk from remote offices by allowing companies to eliminate tape, encrypt backup data, and centralize data protection in the data center.
    • Improves the return-on-investment (ROI) of disk-based backups versus traditional methods with a scalable and open software based storage system.
    • Centralizes data protection administration, management and compliance by providing a reliable and consistent backup and recovery process.
    • Controls and manages the retention of backup data and enable recovery from remote offices, the data center, or other sites.

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today - well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    Yes, deduplication is about more than just storage reduction. A deduplication engine can reduce the bandwidth required to move backup data as well. We refer to this as “client-side” deduplication. It can be deployed on a server to be protected, in a physical or virtual environment, and reduce the size of the backup at the source, before any data is moved. As a result the bandwidth requirements needed to move the data decrease dramatically. Client-side deduplication is very effective for remote office data and applications. For example, the NetBackup PureDisk client can send data directly to the data center over the WAN, eliminating the need for a remote backup application and/or remote backup storage. Client-side deduplication can also be an effective means to protect virtual server environments because of how it reduces the I/O requirements (90% less data, less bandwidth) and consequently reduces the backup load on a virtual host.

    The client-side deduplication approach eliminates the need for more than one full backup as it identifies changed blocks and only backs up the unique blocks. While every backup is a block incremental, a “full image” can be restored at any time. A casual observer familiar with backups may ask how this is different from “synthetic backups?” The difference lies in the size of the incremental backup and the data movement. Client-side deduplication records only the changed blocks in an incremental or subsequent backup pass, not every file that has changed. In a deduplicated file system, the file metadata references the new and existing blocks on disk, thus a new synthetic backup, ready for restore, is available immediately after the backup completes, without any data movement.

    NetBackup supports both client-side deduplication and target-side deduplication, the later is what most storage companies posting here offer. In target side-deduplication, the deduplication engine lives on the storage device. We often place NetBackup PureDisk in this category for convenience sake, but what we really offer is proxy-side deduplication. By proxy, I mean that we use our NetBackup server (specifically the media server component) as a proxy to perform the deduplication process. With this approach a customer can increase throughput on both backups and restores with additional media servers.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions - de-dupe then ingest - into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    Deduplication is an important component to any backup strategy, but it needs to be used based on RTO and RPO requirements of data (and the business). There is no single factor for selecting a solution, but rather a series of factors that should be considered.

    All of the other responses seemed to immediately jump to explaining the pros & cons of the secret sauce approaches. We believe this to be only one of several selection criteria.

    GATING FACTORS IN SELECTING THE RIGHT DEDUPE TECHNOLOGY

    • Dedupe Process / Efficiency (bit, byte, block, chunk, etc…)
    • Integration with the backup application
    • Hardware Flexibility & Cost
    • Scalability
    • High Availability
    • Disaster Recovery

    First, let’s clear up earlier misconceptions where someone replied that “in-line” deduplication somehow impacts the “data validation process” and results in “false fingerprint compares”. This is simply FUD.

    DEDUPE PROCESS/ EFFICENCY: The efficiency factor refers to the data reduction effectiveness of the deduplication process as this will impact how much storage a customer buys.

    The level of granularity in your dedupe process will affect the performance of your dedupe solution and the storage consumed. In other words, 4k byte blocks will have 16 times more pointers and blocks than a 64k byte block comparisons. This is why NetBackup PureDisk has a default block size of 128kb. Years of implementations had shown this to be an excellent point at which to achieve both optimization and performance. We also allow the customer to choose the size of the segment for different backup jobs. The size can range from 64kb to 16 MBs.

    NetBackup PureDisk detects dedupe patterns using a hash based approach that combines 2 hashes for identification and verification.

    As stated earlier, we can place our dedupe engine in two places – on a client (or source server) and within the NetBackup server. As such, PureDisk deduplication is supported both in-line (during the backup) or as post-process operation from staging disk on a NetBackup media server (when using PureDisk inside of NetBackup). With our dedupe engine on the NetBackup media server we can increase throughput by spreading the load across multiple NetBackup media servers (load balanced)

    Integration with Backup Application – Replication of backup data without backup application awareness creates storage and management headaches. How does an administrator know when to delete an image? In a disaster recovery scenario, how does the backup application handle recovery of data not in its catalog? The NetBackup Platform eliminates that problem by providing a means for the backup application to manage the replication and deletion of duplicate images, wherever they may be. This functionality is available with NBU PureDisk as well as with qualified OpenStorage partners (some of whom have posted here).

    Hardware Flexibility & Cost – We asked ESG to write a whitepaper on the differences between hardware and software-based deduplication. We encourage readers to check it out.

    [Link Redacted -- DD does not link to analyst papers, especially the pay-per-view variety, unless it is to poke fun. If they think ESG papers are useful, readers can find the link on Symantec's web site. -- The Management]

    NetBackup PureDisk is software-based which means that you can build out a deduplication system with legacy storage or new storage. In fact, you can even use different types of storage within a given storage pool or across locations.

    So we think customers should consider how a deduplication solution might lock you into specific hardware, and to ask if it can it be used with your legacy datacenter servers and storage?

    Scalability –Question to consider here include the following:
    • How does the deduplication solution scale in performance and capacity?
    • If capacity is being added, does this increase your aggregate dedupe pool of storage or create another pool?
    • How can the aggregate performance of the solution be increased without major reconfiguration of the backup environment?

    NetBackup PureDisk delivers scalability by breaking apart several components, where dedupe occurs, where metadata is stored and where file content data is stored. PureDisk stores metadata in a metabase engine and file content data in a Content Router. These are the two primary components of our storage pool.

    The benefit to this approach for customers is that performance and storage can be improved by adding additional servers with any one or both of these components. Both the metabase engine and content router components are horizontally scalable. So when you want to expand capacity with NetBackup PureDisk, you add another content router node (each node holds 8 TB of dedupe data – much more backup data). PureDisk automatically load balances the content across the two content routers to improve performance of backup and restore. The same concept can be applied to the metabase engine, which is an integrated relational database, where we store file references. In short, when needed, aggregate performance can be improved by adding additional nodes (a server with one or both of these components).

    High Availability - Does the deduplication solution have high availability failover to spare nodes build-in to protect it against server (node or controller) failure. What happens when one controller or node goes down in a distributed storage system?

    NetBackup PureDisk can protect against this with integrated high-availability using Veritas Cluster server.

    See question 5 for more details on the Symantec solution.

    Disaster Recovery – Does the deduplication solution have recovery features to recover data in case of disk failure or data corruption on disk?

    NetBackup PureDisk provides several disaster recovery options including optimized replication, reverse replication, and of course the ability to recovery a complete system from tape.

    Again, See question 5 for more detail on the Symantec solution.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    At the highest level of abstraction, all dedupe systems use pointers to references duplicate blocks, bytes, or bits. And this is how NetBackup PureDisk operates. With this in mind customers should think about how the deduplication architecture, specifically the storage rather than the deduplication engine, tracks and manages those references and what happens when the number of references grows very large. For example, when your dedupe system has grown to 100s of terabytes of information, how does the expiration of backup data affect the system?

    When you expand your deduplication system with another node (or controller) are you expanding the same dedupe pool or creating another pool of storage? If you can grow a single pool, how does the system balance metadata and content data (the blocks) across the whole system?

    The architecture of NetBackup PureDisk offers the best of both worlds by storing metadata in a horizontally scalable database, as opposed to the file system, and content blocks in a horizontally scalable file system. The separation of these two components improves scalability and performance.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    This question appears to have two parts – standard recovery and disaster recovery. It seems not all responses have consistently addressed the DR scenarios for their dedupe systems. I will address how a regular recovery works and how DR works for NetBackup PureDisk.

    For standard data recovery with NetBackup PureDisk, a customer selects the data they wish to recover and initiates the recovery request. The data is reassembled by the PureDisk dedupe engine. We deliver this engine in two different places –on the PureDisk client for client-side deduplication and in the NetBackup media server for proxy-deduplication. Beside network speed and disk speed, the two primary variables that can affect restore speed, are how fast you can get that data to the dedupe engine and how many engines you run. First, if a customer needs to get data to the PureDisk dedupe engine more quickly they can add additional storage nodes (we call these content routers) to increase performance. Second, we can increase the dedupe engine speed, so to speak, with additional engines or specifically NetBackup media servers. These engines can run in parallel on the NetBackup media servers to restore data even more quickly. Finally, PureDisk stores the most recent backup data in such a way that it can be recalled more quickly than older data.

    NetBackup PureDisk provides protection against disk failure by using Storage Foundation (SF) in combination with hardware array based redundant-array-of-independent-disks (RAID) or SF software RAID protection. Storage Foundation can also manage multiple storage paths for PureDisk to provide redundancy and performance. Protection against node failure, by failing over to a spare node in the storage pool, is provided using Veritas Cluster Server (VCS). PureDisk can also provide a scripted manual failover in cases where VCS is not desired. VCS can provide protection against network failure in an HA configuration. PureDisk can provide native protection against network failure in non-HA configurations. Protection against site failure is provided using PureDisk’s native replication capability to perform bandwidth optimized replication from the datacenter to a DR site. The PureDisk storage pool, including the deduplicated backup data, can be protected, by using PureDisk’s optimized Disaster Recovery (DR) backup capability with NetBackup. This NetBackup integration enables users to perform both incremental backups of a multi-node storage pool, including configuration and all deduplicated data to any medium (including tape), and synthetic full images to improve recovery times.

    Similar to the data center, for remote sites where DR backup is not possible onsite, both configuration and backup data from a remote storage pool can be replicated to a data center, which allows for fast recovery of a remote storage pool to a spare system in the datacenter.

    Finally, PureDisk can also export data (out of a dedupe state) to NetBackup to create standard tapes of backup data at desired intervals for long term data vaulting or archival purposes.

    PureDisk software, related configuration information, and the relevant data are all required to recover any data written into a PureDisk storage pool.

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a subpoena or discovery request? Does de-dupe conflict with the non-repudiation requirements of certain laws?

    We have not encountered this legal question. Data deduplication does not change the underlying content of the data; it merely breaks the data up into pieces to store it more efficiently.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    While it is true that layout of dedupe data differs from a standard file system, someone could potentially inspect the data and reassemble information based on the blocks stored on disk. And the first time a dedupe engine encounters a new file with all unique data, it will need to send every block to storage. In this manner, someone could reassemble a file or data from the pieces. Each block could also contain sensitive data; thus even if the whole file cannot be easily reassembled from the blocks, encryption will still be required.

    Though physical and network security may exist within the data center and when transferring data between sites, we find some customer still want additional levels of security and encryption provides that additional layer. Symantec’s PureDisk offers client-side encryption of data for those customers that need an additional layer of security. PureDisk goes beyond this and offers a feature called Data Lock, which allows users outside of IT, such as HR or legal, to add a password to a backup selection and prevent browsing and/or recovery of data without a unique password that is separate from application access controls.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    Tape is an excellent media for sequential read/write processes. Data deduplication is the epitome of a random access process with variation driven based on the location of metadata, the number of nodes, the type of data etc.

    The need/requirement to back up a dedupe system to tape stems from a disaster recovery concern. As companies reduce the copies of data down to one, they become more reliant on that single copy. Similarly, as the size of a dedupe system containing single copies of backup data grows, the need to have periodic recovery points in case of some type of corruption or disaster increases. PureDisk replication can provide recovery in case of local disaster or corruption.

    If no replication can be implemented, the PureDisk storage pool, including the dedupe backup data, can be protected, by using PureDisk’s optimized Disaster Recovery (DR) backup capability with NBU. This NetBackup integration enables users to perform incremental forever backups of a multi-node storage pool, including configuration and all deduplicated data to any NBU medium (including tape), and to synthesize them into full backups for faster recovery.

    In addition, PureDisk can export deduplicated backup data to NetBackup, in which the data is indeed re-inflated prior to writing it to tape (or any other NBU supported media). This feature supports customers that have a tape archive requirement for long term data retention or compliance. Data is written to tape in standard NBU format which is accepted as a long term data retention format.

    Writing data in deduplicated form to tape for long term retention (and single file restore) does not seem feasible: e.g. a file that consists of 100 blocks could potentially require 100 tapes to recover from. This is not practically usable.

    9. Some vendors are claiming de-dupe is “green” - do you see it as such?

    Yes, we see dedupe as a “green type” technology because it allows customers to store more data on a given amount of disk in the data center. If we assume that in lieu of dedupe disk, a customer were to use regular disk, then a savings in floor space and electricity has already been realized.

    For long-term retention of backup or archive data (e.g., beyond 1-2 years), tape may become the preferred storage medium when the data is no longer expected to be accessed. The assumption here would be that the number of recovery points required would drop such that a weekly or monthly full would be sufficient.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    We addressed how a deduplication engine can be deployed out to the physical or virtual server to back up data in bandwidth or I/O constrained environments. Deduplication storage can be an excellent media for both backup and archive data with medium term (months range) retention times.

    Symantec released its OpenStorage API last year which allows the customers to better leverage the capabilities of intelligent disk systems (including deduplication appliances) more optimally without having to go through the intermediate limiting tape emulation step.

    Deduplication also enables on-line vaulting and disaster recovery. As the amount of data is dramatically reduced, replication of the data over the WAN to a DR site becomes economically viable, eliminating the need for tape collection and vaulting services.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    One submission stated statement that all “software based dedupe” presents a challenge because it involves multiple agents or deduplication points is a mischaracterization of software-based deduplication. The heart of software based dedupe is both the dedupe engine and the storage architecture that supports it. We have a dedupe engine for our clients as well as for our backup media server.

    Again, per an earlier question Enterprise Strategy Group recently wrote an interesting paper for Symantec entitled “Differentiating Hardware and Software-based Data De-duplication.” Symantec’s software approach to deduplication lies in the PureDisk storage pool architecture where we separate out metadata and content into two horizontally scalable components called the metabase engine and the content router (see previous answers).

    With regards to compression, it is important to understand that compression and deduplication are radically different. While compression is only looking for repetitive patterns in a single file, it is fairly easy to build the algorithm and look-up cache in a chip. Deduplication is comparing patterns in new incoming data to the total dataset already stored in the deduplicated backend. While these accelerator chips can accelerate parts of the process such as MD5 or other fingerprint calculation, the whole deduplicated storage system still requires software to control the global index, data removal, scalability, HA and DR.

    Symantec is looking into supporting some of these hardware boards through the appropriate drivers.

    Thanks,
    Peter

    No, thank you, Peter.

  13. Jered Says:

    Jon,

    Thanks for the opportunity to comment on this.

    > 1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    Permabit Technology Corporation delivers Permabit Enterprise Archive, a disk-based storage system with standard NAS interfaces. Permabit Enterprise Archive provides enterprise class archival storage with the flexibility and speed of disk, but at or below the cost of tape. The system includes Scalable Data Reduction, combining traditional compression with sub-file deduplication, and has a grid architecture that uniquely allows scaling to petabytes of real, physical disk (and many more times that of data).

    > 2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    If all that data were junk, we wouldn’t have this problem! Pretty much all the analysts point to data growth rates of 60 to 80% annually. While some of this data is perhaps unnecessary, the bulk of it does need to be kept around for future reference. Digital documents keep growing in both size and volume, and either regulations require that they be kept around, or businesses see value in later data mining.

    Most of this data is being dumped onto existing primary storage, and those primary storage environments (the very costly “junk drawers” out there) keep growing — at an average cost of around $43/GB. That’s an outrageous price, and the number one driver for deduplication. Customers don’t want deduplication, per se; what they want is cheaper storage. Deduplication is a great way of helping deliver that, but it’s only one way in which Permabit drives down costs.

    > 3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    Requirements will differ by use case; dedupe for backup is different from dedupe for archive. For the market space we address, Enterprise Archive, we see four key factors:

    - Scalability: Enterprise Archive environments range from 50 terabytes to multiple petabytes today, and if current growth rates are sustained, a 100TB archive will be over 3PB by 2012. To see significant cost savings, the archive must be managed and maintained as a single unit. Any storage system for archive, deduplicating or not, must be able to scale to petabytes of real physical disk, not pie-in-the-sky 50X deduped data.

    - Cost: To escape the pain of growing primary storage costs, an enterprise archive has to deliver a major change in storage costs. Primary storage averages $43/GB; Permabit is $5/GB before any savings due to deduplication. With even 5X deduplication that realized cost is $1/GB, and competitive with tape offerings. Deduplication is not the feature; low cost of acquisition is the feature. On top of that, we deliver lower TCO through ease of management, and by eliminating the need to ever migrate data to a new system by having hardware upgrades managed entirely internal to our system.

    - Availability: Archive data must be always available. When data is required, it needs to be available in milliseconds, not hours or days. Legal discovery may require a full response in as little as three days, and tape is just not a valid option.

    - Reliability: An Enterprise Archive system must be as reliable as possible, as it may hold the only remaining copy of a critical piece of information. Tapes don’t cut it — failure rates are quoted as high as 20%. Even RAID 6 shows weakness when considered across petabytes of data and dozens of years.

    > 4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    I wouldn’t use the word “stub”, but otherwise that’s a generally fair statement. As data is ingested into a Permabit Enterprise Archive system, we break it up into variable-sized sub-file chunks. For each of those chunks, we determine if it already exists anywhere in the system; if not, we store it. A file is then a list of named chunks that, in order, contain all the data for the file. This is not terribly different from a file in a conventional file system, which is just a list of named disk blocks that, in order, contain all the data for that file. We simply have variable sized “blocks”, and those “blocks” may be in use by multiple files, if they contain the same data.

    > 5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    For disaster recovery purposes, Permabit Enterprise Archive incorporates replication features that allow replication to a remote site, be it another office, a data center, or a service provider. Permabit’s replication takes advantage of our Scalable Data Reduction (SDR), our combination of compression and sub-file deduplication, to minimize bandwidth over the WAN.

    > 6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    Again, dedupe does not change data any more than compression changes data, or traditional file systems change data. Plain old LZW compression gives you a different output bitstream than what went in, with redundant parts removed. Conventional file systems break up files into blocks and scatter those blocks across one or more disks, requiring complicated algorithms to retrieve and return the data. Dedupe is no different. Nonrepudiation requirements are satisfied by the reliability and immutability of the system as a whole, deduplicating or not.

    > 7. Some say that de-dupe obviates the need for encryption. What do you think?

    Anyone who says that is selling snake oil; would you care to name names here? Encryption technologies make it mathematically infeasable to determine the contents of a message (or file) without the cipher’s key. Dedupe has nothing to do with this — however, the two technologies can be combined. Permabit uses AES, the current federal encryption standard, for both data protection on disk and over the wire.

    > 8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    Deduplicating to tape is fine, as long as the data set is entirely self-contained, and the only sort of restore expected is a full-system restore. If you have a dedupe pool across multiple tapes, a restore operation will turn into a messy experience of “please insert tape number 263″, and if the restore is not a full-system restore, the performance will be terrible due to seeking along the tape for each individual chunk.

    For the case of the “NDMP-like” feature I’d have to understand the use case better; there are certainly sensible things I can imagine.

    > 9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    Certainly; it’s as green as any other technology that reduces the number of disks spinning. 10X dedupe means 10X fewer disk spindles. Larger drive capacities are green too.

    > 10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    This question is asking a few different things. The first thing; dedupe and VTL come up together frequently because VTL is a blindingly obvious use case. The driving factor behind VTL, versus other backup to disk technologies, is that the only thing that needs to change in the environment is swapping the tape library for the VTL. No need to rearchitect your backup scheme, no need to change the software, just plug and play. So, VTL vendors tell customers to just keep doing what they’re doing, which involves things like weekly full backups that don’t really make sense in the disk world. Of course VTL vendors can get 25X dedupe — they’re telling their customers to write the same data 25 times!

    The second thing is that backup and archive are very different things. Backups are generally additional copies of data you have elsewhere, and backups are things that you hope you never, ever have to use. They don’t have to be completely reliable, because you have many copies of the same data on other tapes. They don’t have to be always available, because you have a nightly backup window. Archives, on the other hand, contain the last and final copy of data that you don’t need right now, but probably will in the future. These need to be completely reliable and available.

    As I talked about above, dedupe is very important in archives as well, strictly from the perspective of cost savings. But it’s also much harder to dedupe archives, because you don’t have the built-in advantage that VTL backups have — telling customers to save the same data over and over. Building deduplication for archives is a much harder problem, because you have to work harder to find opportunities for dedupe, and you must be able to scale to enormous amounts of disk. In the archive space, you can’t sell your 30TB box as a “one petabyte” appliance.

    > 11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    Anyone who’s pitching “hardware deduplication” is just selling a coprocessor that helps with operations common to deduplication, like cryptographic hashing. If hashing is the performance bottleneck for a vendor, adding in a hardware accelerator will help; if it isn’t, it won’t. Software vs. hardware deduplication will have no user-visible differences other than perhaps performance, but generally the hashing isn’t the part that’s resource intensive, it’s the indexing of all the data in the system. Oh, and hardware dedupe systems will be more expensive, because it’s one more piece of hardware to buy and put in the box.

  14. TonyLovesLinux Says:

    Hi Jon,
    Response from IBM here:

  15. Administrator Says:

    A late response from Sepaton.

    1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.

    Company: SEPATON
    Dedupe Product: DeltaStor® ContentAware™ deduplication
    SEPATON’s DeltaStor technology is a software feature for existing S2100-ES2’s VTL solutions. It leverages the grid architecture of the S2100-ES2 and can scale capacity or performance independently to meet the needs of enterprise customers.

    2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?

    Absolutely, that is why SEPATON introduced a ContentAware approach to deduplication, which enables the ability to significantly leverage the content that is deduplicated. Solutions should have an inherent understanding of the data that is being stored, including the application type, and enable the ability to turn deduplication on or off depending on business and regulatory requirements. In addtion, metadata about the content should also be stored, enabling much more efficient content indexing and search, and therefore the ability to meet discovery requests. Deduplication solutions should not simply perpetuate the “storage junk drawer” scenario and should instead enable much higher value functions for IT and the business to leverage. That is SEPATON’s approach.

    3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?

    Customers need to understand the problem that they are trying to solve. Deduplication provides a reduction in disk footprint, but is not a panacea. Typically, we find that customers have core business SLAs that they need to meet around data protection.

    Customers need to evaluate solutions around how they can meet these requirements in a cost effective manner. Some solutions focus on a single system metaphor where they provide separate and independent boxes with limited capacity and performance metrics.

    Implementing these solutions will typically require multiple separate instances which adds to complexity and cost. Some vendors also aggressively promote inline deduplication which typically results in a decrease in performance and limits capacity within the appliance. Concurrent- process solutions like SEPATON’s DeltaStor typically don’t have these limitations, but will intitially require some incremental disk space.

    In short, the customer must first evaluate their data protection requirements:

    • What is their backup window?

    • Do they have requirements on restore time? (Remember, restore performance impacts not just DR, but also physical tape creation.)

    • What is their data growth rate?

    Once customers understand their requirements they should then look for a deduplication solution that meets those needs. All to often, we see customers taking the opposite approach where they decide they need dedupe for whatever reason without giving thought to the impact of the technology on their SLAs and costs.

    4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?

    At a high level, you are correct that all deduplication algorithms do the same thing. They use varying approaches to identify what is unique data and what is redundant and then replace the redundant data with pointers. The interesting thing is that while the high level process is the same, the vendors use radically different approaches to process the data and these approaches can offer dramatically different metrics around scalability, deduplication ratios, performance and TCO.

    SEPATON leverages ContentAware technology for our DeltaStor deduplication. Through it we gather information about the content of the backup at the object level to identify objects that contain duplicate data. By narrowing the search we can then compare data at the byte level for much more granular deduplication. Additionally, we can perform the various deduplication activities across multiple nodes allowing us to easily scale deduplication performance. This approach enables DeltaStor to find more redundancies and to outperform other solutions.

    5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?

    It depends. In SEPATON’s case, the newest data (i.e. the last backup) is kept in its native, non-de-duplicated format. Older versions of data are de-duplicated. Once de-duplication is accomplished, all information necessary to reconstruct that data (the fragments of unique data plus any required pointers) are kept in SEPATON’s filesystem directly, and no longer require any “recipe” – data is directly recoverable. In particular, the filesystem is built to be robust and reliable (i.e. self-discoverable, self-healing, redundant, etc.).

    SEPATON further believes that our appliance should be transparent to the data protection environment. That is, it should work with existing policies and/or procedures. Most customers use the VTL as the primary target for their backups on their local site.

    Most large enterprises still have a substantial investment in tape and prefer to use that medium for long-term archival. In these environments, the VTL will hold the data onsite for local restores and un-deduplicated tapes will be created for offsite storage by the backup application. This ensures that the tapes are fully recoverable in a remote site even without the VTL.

    Also remember, restore performance is vital here since the process of creating tapes depends on data being read from the VTL at high speed. DeltaStor’s forward differencing technology maintains a complete copy of the newest backup ensuring the fastest restores on the data with no re-assembly required.

    A replication solution is also offered. This product integrates with the backup application and replicates data to a remote VTL based on policies established within the backup software. In this case, both VTLs will hold deduplicated data. The DR process in this scenario is essentially similar as mentioned above since the remote VTL will present itself to the remote backup server as a tape library and drives that exactly matches the one on the primary site.

    Customers need choice, and SEPATON offers multiple solutions. They can maintain tape procedures and use tape for DR or they can use SEPATON replication and use a second system for their remote site. Either way, there is very little change in the customer’s policies or procedures/

    6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?

    As previously mentioned, deduplication does not change what data is available for recovery or restore. It simply changes the way it is stored. However with limited case law on deduplication, there is still some uncertainty here. A potential issue is that some deduplication algorithms rely on hashing for deduplication.

    There is a known risk of hash collisions in these algorithms which would result in silent data corruption. While the likelihood is small, it is still a possibility and it is unclear what the legal implications are. Some approaches, like that used in DeltaStor, avoid relying on hashes for this exact reason.

    In the end the decision regarding your question comes down to the customer. They have to decide for themselves about these issues, and what we can do is provide them a tested low-risk high availability platform to leverage as a part of their corporate data governance practices . We have seen some cases where customers prefer to avoid deduplicating certain data types, backup jobs or even servers which they deem most likely to be subpoenaed. Many solutions do not allow the flexibility to enable or disable deduplication by application, which SEPATON’s DeltaStor does. In short, a customer should examine their legal and discovery requirements carefully and should take the requirements into consideration when evaluating deduplication options.

    7. Some say that de-dupe obviates the need for encryption. What do you think?

    These two technologies solve different problems. Encryption is about limiting data access and preventing inappropriate parties from accessing private data. In many environments, encryption is based on military grade algorithms that are virtually impossible to decipher without the appropriate key, and typically encryption strength is valued over performance.

    Deduplication, on the other hand, is designed to reduce the footprint of data on disk. It allows customers to store more data in a smaller footprint. Performance is typically an important element of deduplication solution because it can be a bottleneck in data protection. Most of the solutions are based off of NAS and/or VTL access methodologies and while they provide access controls, they are not designed to provide the level of protection of military grade encryption.

    In summary, encryption and data deduplication are technologies targeted at two different problems. In fact these technologies can be complementary and we have seen many companies looking to use the two technologies together.

    8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?

    It depends on the customer’s need, but it seems like it could be a challenge. By breaking data into smaller chunks and creating pointers, deduplication essentially fragments data. This works in a random access environment as seen on a disk subsystem. Once you move the deduplicated data to tape, your tape now contains fragmented which is impossible to restore directly. Instead, all tapes from a de-duplication “NDMP-like” backup will need to first be restored onto the disk-based system, and then access to the backup data is possible.

    Finally, customers need to think about accessing their data in the future. The beauty of today’s backup applications is that they use a consistent tape format and so you can be confident that data written can be recovered. As soon as you created proprietary tape format, as suggested here, the customer is now completely dependant on the deduplication for all future restore requirements. This may not seem like a huge problem in the near term, but what if you need to restore the data in 2 years?

    9. Some vendors are claiming de-dupe is “green” — do you see it as such?

    It depends what you are comparing it to. It is clearly greener than non-deduped disk; it is unclear how more or less green it is than physical tape. That said, most customers are implementing or have implemented disk in the datacenter for data protection due to its reliability and performance profile. Many customers are actively looking to implement non-deduped disk to retain more data onsite. Deduplication solutions are often considered instead of implementing more traditional disk. In these environments, it is clear that deduplication provides strong green benefits.

    10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?

    The redundant nature of backup data makes it an ideal target for deduplication. In what other environments are you making a full and completely redundant copy of your data on a weekly basis? Thus deduplication in the data protection is naturally the first market because it provides the opportunity for substantial disk savings.

    Going forward, we would anticipate seeing deduplication available across a wide range of storage devices. It is unlikely that you will ever see it in high Fibre Channel arrays where performance is the number one priority, but we would expect to see similar technologies implemented in a wide variety of second tier storage applications. The deduplication ratios experienced will likely be much less than that in data protection environments, but it can still provide footprint savings.

    11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?

    The whole concept of software vs hardware deduplication is a bit confusing. We make dedicated VTL appliances that are deduplication enabled: Would you consider that a hardware or software solution? Does your opinion change when we tell you that we specifically engineer our appliances for performance by optimizing it for the included hardware infrastructure? In the end all deduplication solutions rely on some kind of software to run.

    The Hifn card accelerates the creation of hashes for hash-based deduplication solutions. Remember, these algorithms include numerous different steps of which creating the hash is only one small step. Thus adding a Hifn card to one of these solutions does not necessarily mean that the performance will suddenly skyrocket; there are numerous other elements that could bottleneck performance. This brings me back to the first point which is that at the lowest level all deduplication solutions are based off of software and the distinction between “software” and “hardware” deduplication is vague.

    Customers should not focus on whether a solution is hardware or software based, but rather on how individual solutions meet their business requirements.

    Thanks for the responses, Sepaton.