Deduplication Internals : Part-1
Deduplication is one of the hottest technologies in the current market because of its ability to reduce costs. But it comes in many flavours and organizations need to understand each one of them if they are to choose the one that is best for them. Deduplication can be applied to data in primary storage, backup storage, cloud storage or data in flight for replication, such as LAN and WAN transfers. So eventually it offers the below benefits;
– Allow to substantially save disk space, reduce storage requirements and Less hardware
– Improve bandwidth efficiency,
– Improve replication speed,
– Reduce Backup window and improve RTO and RPO objectives,
– and finally COST.
What is data deduplication?
This concept is a familiar one which we see daily, a URL is a type of pointer; when someone shares a video on YouTube, they send the URL for the video instead of the video itself. There’s only one copy of the video, but it’s available to everyone. Deduplication uses this concept in a more sophisticated, automated way.
Data deduplication is a technique to reduce storage needs by eliminating redundant or duplicate data in your storage environment. Only one and unique copy of the data is retained on storage media, and redundant or duplicate data is replaced with a pointer to the unique data copy.
That is, It looks at the data on a sub-file (i.e.block) level, and attempts to determine if it’s seen the data before. If it hasn’t, it stores it. If it has seen it before, it ensures that it is stored only once, and all other references to that duplicate data are merely pointers.
How data deduplication works?
Dedupe technology typically divides data in to smaller chunks/blocks and uses algorithms to assign each data chunk a unique hash identifier called a fingerprint to each chunks/blocks. To create the fingerprint, it uses an algorithm that computes a cryptographic hash value from the data chunks/blocks, regardless of the data type. These fingerprints are stored in an index.
The deduplication algorithm compares the fingerprints of data chunk/block to those already in the index. If the fingerprint exists in the index, the data chunk/block is replaced with a pointer to data chunk/block. If the fingerprint does not exist, the data is written to the disk as a new unique data chunk.
Different types of de-duplication – There are many types and broad classification of dedupe methods; they are
1- Based on the Technologies, how it is done.
Fixed-Length or Fixed Block Deduplication
Variable-Length or Variable Block Deduplication
Content Aware or application-aware deduplication
2- Based on the Process, or when it is done.
In-line (or as I like to call it, synchronous) de-duplication
Post-process (or as I like to call it, asynchronous) de-duplication
3- Based on the Type, or where it happens.
Source or Client side Deduplication
My next post will discuss, in detail about these dedupe technologies and process.