Deduplication Internals – Hash based deduplication : Part-2
Now in the second part we discuss about the process or technique used to do the deduplication.
Different data de-duplication products use different methods of breaking up the data into elements or chunks or blocks, but each product uses some technique to create a signature or identifier or fingerprint for each data element. As shown in the below figure, the data store contains the three unique data elements A, B, and C with a distinct signature. These data element signature values are compared to identify duplicate data. After the duplicate data is identified, one copy of each element is retained, pointers are created for the duplicate items, and the duplicate items are not stored.
The basic concepts of data de-duplication are illustrated in below.
Based on the Technologies, or how how it is done. Two methods frequently used used for de-duplicating data are hash based and content aware.
1 – Hash based Deduplication
Hash based data de-duplication methods use a hashing algorithm to identify “chunks” of data. Commonly used algorithms are Secure Hash Algorithm 1 (SHA-1) and Message-Digest
Algorithm 5 (MD5). When data is processed by a hashing algorithm, a hash is created that represents the data. A hash is a bit string (128 bits for MD5 and 160 bits for SHA-1) that
represents the data processed. If you processed the same data through the hashing algorithm multiple times, the same hash is created each time.
Here are some examples of hash codes:
MD5 – 16 byte long hash
– # echo “The Quick Brown Fox Jumps Over the Lazy Dog” | md5sum
– # echo “The Quick Brown Fox Dumps Over the Lazy Dog” | md5sum
SHA-1 – 20 byte long hash
– # echo “The Quick Brown Fox Jumps Over the Lazy Dog” | sha1sum
– # echo “The Quick Brown Fox Dumps Over the Lazy Dog” | sha1sum
Hash based de-duplication breaks data into “chunks”, either fixed or variable length, and processes the “chunk” with the hashing algorithm to create a hash. If the hash already exists, the data is deemed to be a duplicate and is not stored. If the hash does not exist, then the data is stored and the hash index is updated with the new hash.
In Figure 1-2, data “chunks” A, B, C, D, and E are processed by the hash algorithm and creates hashes Ah, Bh, Ch, Dh, and Eh; for purposes of this example, we assume this is all new data.
Later, “chunks” A, B, C, D, and F are processed. F generates a new hash Fh. Since A, B, C, and D generated the same hash, the data is presumed to be the same data, so it is not stored again. Since F generates a new hash, the new hash and new data are stored.
A – Fixed-Length or Fixed Block
In this data deduplication algorithm, it breaks the Data in to chunks or block, and the block size or block boundaries is Fixed like 4KB, or 8KB etc. And the block size never changes. While different devices/solutions may use different block sizes, the block size for a given device/solution using this method remains constant.
The device/solution always calculates a fingerprint or signature on a fixed block and sees if there is a match. After a block is processed, it advances by exactly the same size and take another block and the process repeats.
Requires the minimum CPU overhead, and fast and simple
Because the block size or block boundaries is Fixed, the main limitation of this approach is that when the data inside a file is shifted, for example, when adding a slide to a Microsoft PowerPoint deck, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file. Smaller block size give better deduplication than large ones, but it takes more processing to deduplicate. Larger block size give low depulication, but it takes less processing to deduplicate.
So the Bottom line is Less storage savings and not efficient.
B – Variable-Length or Variable Block
In this data deduplication algorithm, it breaks the Data in to chunks or block, and the block size or block boundaries is variable like 4KB, or 8KB or 16KB etc. And the block size changes dynamically during the entire process. The device/solution always calculates a fingerprint or signature on a variable block size and sees if there is a match. After a block is processed, it advances by taking another block size and take another blocks and the process repeats.
Higher deduplication ratio, high storage space savings.
While the variable block deduplication may yield slightly better deduplication than the fixed block deduplication approach, it does require you to pay a price. The price being the CPU cycles that must be spent in trying to determine the file boundaries. The variable block approach requires more processing than fixed block because the whole file must be scanned, one byte at a time, to identify block boundaries.
Check out the following example based on the following sentence will explain in detail: “deduplication technologies are becoming more an more important now.”
Notice how the variable block deduplication has some funky block sizes. While this does not look too efficient compared to fixed block, check out what happens when I make a correction to the sentence. Oops… it looks like I used ‘an’ when it should have been ‘and’. Time to change the file: “deduplication technologies are becoming more and more important now.” File –> Save
After the file was changed and deduplicated, this is what the storage subsystem saw:
The red sections represent the changed blocks that have changed. By adding a single character in the sentence, a ‘d’, the sentence length shifted and more blocks suddenly changed. The Fixed Block solution saw 4 out of 9 blocks changed. The Variable Block solution saw 1 out of 9 blocks changed.
Variable block deduplication ends up providing a higher storage density and good storage space savings.
2 – Content Aware or application-aware Deduplication
My next blog will be about the content aware dedupe.