Deduplication Internals – Hash based deduplication : Part-2

Now in the second part we discuss about the process or technique used to do the deduplication. 

Different data de-duplication products use different methods of breaking up the data into elements or chunks or blocks, but each product uses some technique to create a signature or identifier or fingerprint for each data element. As shown in the below figure, the data store contains the three unique data elements A, B, and C with a distinct signature. These data element signature values are compared to identify duplicate data. After the duplicate data is identified, one copy of each element is retained, pointers are created for the duplicate items, and the duplicate items are not stored.

The basic concepts of data de-duplication are illustrated in below.

image

Based on the Technologies, or how how it is done. Two methods frequently used used for de-duplicating data are hash based and content aware.

1 – Hash based Deduplication

Hash based data de-duplication methods use a hashing algorithm to identify “chunks” of data. Commonly used algorithms are Secure Hash Algorithm 1 (SHA-1) and Message-Digest
Algorithm 5 (MD5). When data is processed by a hashing algorithm, a hash is created that represents the data. A hash is a bit string (128 bits for MD5 and 160 bits for SHA-1) that
represents the data processed. If you processed the same data through the hashing algorithm multiple times, the same hash is created each time.

Here are some examples of hash codes:
MD5 – 16 byte long hash
– # echo “The Quick Brown Fox Jumps Over the Lazy Dog” | md5sum
9d56076597de1aeb532727f7f681bcb0
– # echo “The Quick Brown Fox Dumps Over the Lazy Dog” | md5sum
5800fccb352352308b02d442170b039d
SHA-1 – 20 byte long hash
– # echo “The Quick Brown Fox Jumps Over the Lazy Dog” | sha1sum
F68f38ee07e310fd263c9c491273d81963fbff35
– # echo “The Quick Brown Fox Dumps Over the Lazy Dog” | sha1sum
d4e6aa9ab83076e8b8a21930cc1fb8b5e5ba2335

Hash based de-duplication breaks data into “chunks”, either fixed or variable length, and processes the “chunk” with the hashing algorithm to create a hash. If the hash already exists, the data is deemed to be a duplicate and is not stored. If the hash does not exist, then the data is stored and the hash index is updated with the new hash.

In Figure 1-2, data “chunks” A, B, C, D, and E are processed by the hash algorithm and creates hashes Ah, Bh, Ch, Dh, and Eh; for purposes of this example, we assume this is all new data.

Later, “chunks” A, B, C, D, and F are processed. F generates a new hash Fh. Since A, B, C, and D generated the same hash, the data is presumed to be the same data, so it is not stored again. Since F generates a new hash, the new hash and new data are stored.

image

A – Fixed-Length or Fixed Block
In this data deduplication algorithm, it breaks the Data in to chunks or block, and the block size or block boundaries is Fixed like 4KB, or 8KB etc. And the block size never changes. While different devices/solutions may use different block sizes, the block size for a given device/solution using this method remains constant.

The device/solution always calculates a fingerprint or signature on a fixed block and sees if there is a match. After a block is processed, it advances by exactly the same size and take another block and the process repeats.

image

Advantages

Requires the minimum CPU overhead, and fast and simple

Disadvantages

Because the block size or block boundaries is Fixed, the main limitation of this approach is that when the data inside a file is shifted, for example, when adding a slide to a Microsoft PowerPoint deck, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file. Smaller block size give better deduplication than large ones, but it takes more processing to deduplicate. Larger block size give low depulication, but it takes less processing to deduplicate.

So the Bottom line is Less storage savings and not efficient.

B – Variable-Length or Variable Block

In this data deduplication algorithm, it breaks the Data in to chunks or block, and the block size or block boundaries is variable like 4KB, or 8KB or 16KB etc. And the block size changes dynamically during the entire process.  The device/solution always calculates a fingerprint or signature on a variable block size and sees if there is a match. After a block is processed, it advances by  taking another block size and take another blocks and the process repeats.

image

Advantages

Higher deduplication ratio, high storage space savings.

Disadvantages

While the variable block deduplication may yield slightly better deduplication  than the fixed block deduplication approach, it does require you to pay a price. The price being the CPU cycles that must be spent in trying to determine the file boundaries. The variable block approach requires more processing than fixed block because the whole file must be scanned, one byte at a time, to identify block boundaries.

Check out the following example based on the following sentence will explain in detail: “deduplication technologies are becoming more an more important now.”

image

Notice how the variable block deduplication has some funky block sizes. While this does not look too efficient compared to fixed block, check out what happens when I make a correction to the sentence. Oops… it looks like I used ‘an’ when it should have been ‘and’. Time to change the file: “deduplication technologies are becoming more and more important now.”  File –> Save

After the file was changed and deduplicated, this is what the storage subsystem saw:

image

The red sections represent the changed blocks that have changed. By adding a single character in the sentence, a ‘d’, the sentence length shifted and more blocks suddenly changed. The Fixed Block solution saw 4 out of 9 blocks changed. The Variable Block solution saw 1 out of 9 blocks changed.

Variable block deduplication ends up providing a higher storage density and good storage space savings.

2 – Content Aware or application-aware Deduplication

My next blog will be about the content aware dedupe.

Advertisements

About GK_RAJ

An enthusiastic IT person, with an intense passion towards Datacenter technologies. I am a VMware vExpert Title holder and working as a Technical Consultant, in Qatar. I am exposed to VMware vSphere, Storage, Bladecenters, Datacenter operations, Symantec Backup, Deduplication technologies and carry rich and diversified experience in these domains. I specialize in Designing & Consulting on VMware VSphere, the integration of Storage and Network Stacks to VSphere. With my experience, I help Organizations/Enterprises to achieve their CAPEX & OPEX savings, develop DR and BCP strategies, Consolidation services with Virtualization using VSphere, and prepare them to move to Cloud. In the meantime, I would like to share my knowledge and do a good contribution to the community. I am an Indian citizen, and have a Engineering degree in Electronics and Communication. I have certified in VCAP5-DCD, VCP-Cloud, VCP 4 & 5, MCITP, MCSE.

Posted on February 9, 2013, in Dell Storage, EMC, Netapp, Storage Technology. Bookmark the permalink. 1 Comment.

  1. A well researched article Gopi. Loved reading it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Dan Gorman's Technology News Aggregation

My Daily Readings from Zite

Virtual Reality

Lets dive into world of virtualization

VMware Professional Services

VMware Professional Services

Brad Hedlund

stuff and nonsense

VCDX56

A blog focusing on day 2 day virtualization stuff

UCSguru.com

Every Cloud Has a Tin Lining.

pibytes

Experience the Datacenter Technologies

boche.net - VMware vEvangelist

Experience the Datacenter Technologies

blog.scottlowe.org

The weblog of an IT pro specializing in virtualization, networking, cloud, servers, & Macs

Eric Sloof - NTPRO.NL

Experience the Datacenter Technologies

Technodrone

Experience the Datacenter Technologies

Welcome to vSphere-land!

your ultimate VMware information destination

Michelle Laverick...

Laverick by Name, Maverick by Nature...

CloudXC

By Josh Odgers - VCDX#90

Long White Virtual Clouds

all things vmware, cloud and virtualizing business critical applications

Virtual Geek

Experience the Datacenter Technologies

Yellow Bricks

by Duncan Epping

%d bloggers like this: