This article contains summary by most of the industry players so I'm going to put it here in full in case it gets lost from drunkendata.com:
Invitation to De-Duplication Vendors
There are some questions I would like to get answers to in the area of De-Duplication. I am hoping that some of the vendor readers of this blog will help out.
Here is an opportunity to shine, folks, and to tell the world why, what, how and where. Here is the question list. You can either respond on line through comment cut and paste or email me your response at jtoigo@toigopartners.com and I will put your response on-line for you. From where I am sitting, these are the kinds of questions that consumers would ask.
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Thanks in advance for your response.
 
 
 
 Posts
Posts
 
 
April 21st, 2008 at 11:07 pm
If Diligent is the best, then I am curious to understand why it sold for less than $200M.
April 22nd, 2008 at 10:40 am
IBM hasn’t revealed how much it spent on Diligent. Not sure where you are getting your numbers. Also, no one, except maybe IBM, has suggested that Diligent was best.
April 22nd, 2008 at 10:56 am
Chris P, over at eWeek, has pointed to this “quiz” and encouraged de-dupe vendors to open their corporate kimonos. Thanks, Chris.
April 23rd, 2008 at 1:27 am
I got the $200M number from here:
http://www.byteandswitch.com/document.asp?doc_id=151339
April 23rd, 2008 at 7:47 am
IBM said in its conference call that they did not, as a matter of company policy, disclose acquisition prices. I don’t know where B&S got its numbers or if they are accurate.
April 24th, 2008 at 8:48 pm
Jon,
As you know I’m not a vendor but I play Blogger at InformationWeek. Starting with question 4 your description of deduping as using stubs isn’t a good analogy.
Think of a deduped data store as a file system. In the case of a NAS device like a DataDomian or NetApp A-SIS it really is a file system. In a VTL think of each virtual tape as a file.
Somewhere there’s a directory that says the file FOO.BAR is stored on blocks 123-345, 500-510 and 12999-14090. That’s true of ANY file system. The difference between the deduped and normal file system is that more than one file can use the same block. If I edit FOO.BAR, add 10Kbytes to the end and save it as FOO2.BAR at some point (real time or later) the deduper will recognize (via hashes or a byte by byte compare) that my file has the same data and will build a directory entry that says FOO2.BAR uses 123-345, 500-510, 12999-14090 and 66666-66669. So the second file takes up just 10K bytes.
Now the file system needs to keep track of how many files point to each block and update that list when files are deleted.
Re 5 I reject that dedupe is changing data. It’s storing it differently. Now LZS compression is changing data and AES encryption is changing data but dedupe as I described it above (which is good enough a description of all the techniques and 99% acruate for NetApp) isn’t. Strictly speaking RLL in the disk drive is modifying the data.
Re 6: not any more than LZS or AES truth is those regulations mean “tamper with the change the meaning” when they say no modify.
7 no it don’t
8 - The only use of tape for deduped data would be to backup/restore the WHOLE deduped data store in one fell swoop.
9 - If I dedupe and store 1/20th the data on 1/20th the drives using 1/20th the power it seems greenish. Tape is greener as I blogged a couple days ago.
10 - If you think about hash based dedupe and CAS you could use dedupe to replace any of the online archive apps CAS is used for. Riverbed and Silverpeak use it for WAN acceleration and NetApp is pitching it for primary file storage. Downside is reading files back is slower because it’s not a sequential read as it would be if the file were on contigious blocks. In fact reading from a deduped store is VERY much like reading from a badly fragmented disk on a file server. Since these are devices made for backup restore they could use long read ahead queues to spedd it up.
11 - The hard part in deduping is finding the right places to divide data into blocks. Think of the corporate file server. There are 10,000 Word docs with the corporate logo embedded. If the blocking algorithm can put that logo into a block by itself you’ll get much better data reduction than if it uses fixed size 4K blocks.
The other hard part is building the index so you can QUICKLY check if a block being stored now has been stored before.
All the HiFn card does is calculate the hashes for blocks. So chips can help but there’s no such thing as chip dedupe.
Howard Marks
Backup and Business Continuity Blogger
April 25th, 2008 at 8:16 am
Thanks for your feedback, Howard. The questions on the list are for clarification from the vendors, none of whom — by the way — have seen fit to respond as yet. The points you make are very valid, but the questions are not a reflection of my misunderstanding of de-dupe as much as they are concerns raised to me by consumers who really don’t understand how de-dupe works.
Stubbing is still a technique used by certain products, though not by all. I wanted vendors to clarify what techniques they actually use. As for the other questions, consumers believe that de-dupe is changing data, that it imposes a hit on access speeds, that it may jeopardize compliance. I have actually had several de-dupe vendors tell me that de-duped data “is already encrypted.”
Bottom line: there are equal parts hype and marketecture around the technologies in play. Lots of players are doing things differently. There are no standards for doing it at all. Hence, the questionnaire.
Thanks again for your thoughtful insights. I hope some of the vendors actually chime in.
April 30th, 2008 at 11:20 am
Larry Freeman, Senior Marketing Manager, Storage Efficiency Solutions, Network Appliance, writes
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Thanks, Larry.
April 30th, 2008 at 11:25 am
From Bill Andrews, CEO of ExaGrid.
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Thanks, Bill.
May 5th, 2008 at 8:14 am
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
May 9th, 2008 at 10:06 am
From Jay Gagne, Global Solution Architect, COPAN Systems
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Thanks, Jay.
May 9th, 2008 at 8:21 pm
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today - well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions - de-dupe then ingest - into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a subpoena or discovery request? Does de-dupe conflict with the non-repudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” - do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
No, thank you, Peter.
May 27th, 2008 at 10:47 am
Jon,
Thanks for the opportunity to comment on this.
> 1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
Permabit Technology Corporation delivers Permabit Enterprise Archive, a disk-based storage system with standard NAS interfaces. Permabit Enterprise Archive provides enterprise class archival storage with the flexibility and speed of disk, but at or below the cost of tape. The system includes Scalable Data Reduction, combining traditional compression with sub-file deduplication, and has a grid architecture that uniquely allows scaling to petabytes of real, physical disk (and many more times that of data).
> 2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
If all that data were junk, we wouldn’t have this problem! Pretty much all the analysts point to data growth rates of 60 to 80% annually. While some of this data is perhaps unnecessary, the bulk of it does need to be kept around for future reference. Digital documents keep growing in both size and volume, and either regulations require that they be kept around, or businesses see value in later data mining.
Most of this data is being dumped onto existing primary storage, and those primary storage environments (the very costly “junk drawers” out there) keep growing — at an average cost of around $43/GB. That’s an outrageous price, and the number one driver for deduplication. Customers don’t want deduplication, per se; what they want is cheaper storage. Deduplication is a great way of helping deliver that, but it’s only one way in which Permabit drives down costs.
> 3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
Requirements will differ by use case; dedupe for backup is different from dedupe for archive. For the market space we address, Enterprise Archive, we see four key factors:
- Scalability: Enterprise Archive environments range from 50 terabytes to multiple petabytes today, and if current growth rates are sustained, a 100TB archive will be over 3PB by 2012. To see significant cost savings, the archive must be managed and maintained as a single unit. Any storage system for archive, deduplicating or not, must be able to scale to petabytes of real physical disk, not pie-in-the-sky 50X deduped data.
- Cost: To escape the pain of growing primary storage costs, an enterprise archive has to deliver a major change in storage costs. Primary storage averages $43/GB; Permabit is $5/GB before any savings due to deduplication. With even 5X deduplication that realized cost is $1/GB, and competitive with tape offerings. Deduplication is not the feature; low cost of acquisition is the feature. On top of that, we deliver lower TCO through ease of management, and by eliminating the need to ever migrate data to a new system by having hardware upgrades managed entirely internal to our system.
- Availability: Archive data must be always available. When data is required, it needs to be available in milliseconds, not hours or days. Legal discovery may require a full response in as little as three days, and tape is just not a valid option.
- Reliability: An Enterprise Archive system must be as reliable as possible, as it may hold the only remaining copy of a critical piece of information. Tapes don’t cut it — failure rates are quoted as high as 20%. Even RAID 6 shows weakness when considered across petabytes of data and dozens of years.
> 4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
I wouldn’t use the word “stub”, but otherwise that’s a generally fair statement. As data is ingested into a Permabit Enterprise Archive system, we break it up into variable-sized sub-file chunks. For each of those chunks, we determine if it already exists anywhere in the system; if not, we store it. A file is then a list of named chunks that, in order, contain all the data for the file. This is not terribly different from a file in a conventional file system, which is just a list of named disk blocks that, in order, contain all the data for that file. We simply have variable sized “blocks”, and those “blocks” may be in use by multiple files, if they contain the same data.
> 5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
For disaster recovery purposes, Permabit Enterprise Archive incorporates replication features that allow replication to a remote site, be it another office, a data center, or a service provider. Permabit’s replication takes advantage of our Scalable Data Reduction (SDR), our combination of compression and sub-file deduplication, to minimize bandwidth over the WAN.
> 6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
Again, dedupe does not change data any more than compression changes data, or traditional file systems change data. Plain old LZW compression gives you a different output bitstream than what went in, with redundant parts removed. Conventional file systems break up files into blocks and scatter those blocks across one or more disks, requiring complicated algorithms to retrieve and return the data. Dedupe is no different. Nonrepudiation requirements are satisfied by the reliability and immutability of the system as a whole, deduplicating or not.
> 7. Some say that de-dupe obviates the need for encryption. What do you think?
Anyone who says that is selling snake oil; would you care to name names here? Encryption technologies make it mathematically infeasable to determine the contents of a message (or file) without the cipher’s key. Dedupe has nothing to do with this — however, the two technologies can be combined. Permabit uses AES, the current federal encryption standard, for both data protection on disk and over the wire.
> 8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
Deduplicating to tape is fine, as long as the data set is entirely self-contained, and the only sort of restore expected is a full-system restore. If you have a dedupe pool across multiple tapes, a restore operation will turn into a messy experience of “please insert tape number 263″, and if the restore is not a full-system restore, the performance will be terrible due to seeking along the tape for each individual chunk.
For the case of the “NDMP-like” feature I’d have to understand the use case better; there are certainly sensible things I can imagine.
> 9. Some vendors are claiming de-dupe is “green” — do you see it as such?
Certainly; it’s as green as any other technology that reduces the number of disks spinning. 10X dedupe means 10X fewer disk spindles. Larger drive capacities are green too.
> 10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
This question is asking a few different things. The first thing; dedupe and VTL come up together frequently because VTL is a blindingly obvious use case. The driving factor behind VTL, versus other backup to disk technologies, is that the only thing that needs to change in the environment is swapping the tape library for the VTL. No need to rearchitect your backup scheme, no need to change the software, just plug and play. So, VTL vendors tell customers to just keep doing what they’re doing, which involves things like weekly full backups that don’t really make sense in the disk world. Of course VTL vendors can get 25X dedupe — they’re telling their customers to write the same data 25 times!
The second thing is that backup and archive are very different things. Backups are generally additional copies of data you have elsewhere, and backups are things that you hope you never, ever have to use. They don’t have to be completely reliable, because you have many copies of the same data on other tapes. They don’t have to be always available, because you have a nightly backup window. Archives, on the other hand, contain the last and final copy of data that you don’t need right now, but probably will in the future. These need to be completely reliable and available.
As I talked about above, dedupe is very important in archives as well, strictly from the perspective of cost savings. But it’s also much harder to dedupe archives, because you don’t have the built-in advantage that VTL backups have — telling customers to save the same data over and over. Building deduplication for archives is a much harder problem, because you have to work harder to find opportunities for dedupe, and you must be able to scale to enormous amounts of disk. In the archive space, you can’t sell your 30TB box as a “one petabyte” appliance.
> 11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Anyone who’s pitching “hardware deduplication” is just selling a coprocessor that helps with operations common to deduplication, like cryptographic hashing. If hashing is the performance bottleneck for a vendor, adding in a hardware accelerator will help; if it isn’t, it won’t. Software vs. hardware deduplication will have no user-visible differences other than perhaps performance, but generally the hashing isn’t the part that’s resource intensive, it’s the indexing of all the data in the system. Oh, and hardware dedupe systems will be more expensive, because it’s one more piece of hardware to buy and put in the box.
May 30th, 2008 at 7:13 pm
Hi Jon,
Response from IBM here:
June 18th, 2008 at 11:06 am
A late response from Sepaton.
1. Please provide the name of your company and the de-dupe product(s) you sell. Please summarize what you think are the key values and differentiators of your wares.
2. InfoPro has said that de-dupe is the number one technology that companies are seeking today — well ahead of even server or storage virtualization. Is there any appeal beyond squeezing more undifferentiated data into the storage junk drawer?
3. Every vendor seems to have its own secret sauce de-dupe algorithm and implementation. One, Diligent Technologies (just acquired by IBM), claims that their’s is best because it collapses two functions — de-dupe then ingest — into one in-line function, achieving great throughput in the process. What should be the gating factors in selecting the right de-dupe technology?
4. Despite the nuances, it seems that all block level de-dupe technology does the same thing: removes bit string patterns and substitutes a stub. Is this technically accurate or does your product do things differently?
5. De-dupe is changing data. To return data to its original state (pre-de-dupe) seems to require access to the original algorithm plus stubs/pointers to bit patterns that have been removed to deflate data. If I am correct in this assumption, please explain how data recovery is accomplished if there is a disaster. Do I need to backup your wares and store them off site, or do I need another copy of your appliance or software at a recovery center?
6. De-dupe changes data. Is there any possibility that this will get me into trouble with the regulators or legal eagles when I respond to a supoena or discovery request? Does de-dupe conflict with the nonrepudiation requirements of certain laws?
7. Some say that de-dupe obviates the need for encryption. What do you think?
8. Some say that de-duped data is inappropriate for tape backup, that data should be re-inflated prior to write to tape. Yet, one vendor is planning to enable an “NDMP-like” tape backup around his de-dupe system at the request of his customers. Is this smart?
9. Some vendors are claiming de-dupe is “green” — do you see it as such?
10. De-dupe and VTL seem to be joined at the hip in a lot of vendor discussions: Use de-dupe to store a lot of archival data on line in less space for fast retrieval in the event of the accidental loss of files or data sets on primary storage. Are there other applications for de-duplication besides compressing data in a nearline storage repository?
11. Just suggested by a reader: What do you see as the advantages/disadvantages of software based deduplication vs. hardware (chip-based) deduplication? Will this be a differentiating feature in the future… especially now that Hifn is pushing their Compression/DeDupe card to OEMs?
Thanks for the responses, Sepaton.