Deduplication Internals – Content Aware deduplication : Part-3

Continuation to my previous part 1 and part 2 in this part we will discuss about the Content aware or Application Aware deduplication.

This type of deduplication is generally called a Byte level deduplication, because the deuplication for the information happens in the deepest level – that is BYTES.

Content aware technologies (also called byte level deduplication or delta-differencing deduplication ) work in a fundamentally different way. Key element of the content-aware approach is that it uses a higher level of abstraction when analyzing the data. Content-aware de-duplication looks at the data as objects. Unlike hash based dedupe, which try to find redundancies in block level, content-aware looks it as objects, comparing them to other objects. (e.g., Word document to Word document or Oracle database to Oracle database.)

In this, the deduplication engine sees the actual objects (files, database objects, application objects etc.) and divides data into larger segments (usually 8mb to 100mb in size).  Then, typically by using knowledge of the content of the data (known as being “content-aware”), this technique finds segments that are similar and stores only the changed bytes between the objects. That is a BYTE level comparison is performed.

If we look in to detail, In this type before the  dedupe process, an unique metadata is typically inserted within the data types,like database, email, photo, documents etc.  Then during the dedupe process, this metadata for the data block is extracted and examined to understand what kind of data present in the block. Based on the type of data , a certain block size is assigned to that particular data, and the block size is optimized based on the type of data being backed up. 

Since the block size is already optimized for that particular data type, CPU cycles don’t need to be wasted to determine the block boundaries.  Compare to other deduplication techniques (hash based), where the data is blindly chopped to find the boundaries, that is to find the  block length and then  identify duplicate segments.
The below example will give more insight to this;

I have saved a photo, then open it and edited one pixel and save the new version as a new file, there won’t be a single duplicate block at the disk level. On the other hand, almost the entire file is a duplicate information. Can you find a duplicate graphic that was used in a Powerpoint, a Word document, and a PDF? Powerpoint and Word both compress with a variant of zip. Even if the graphic is identical, block level dedupe won’t find the duplicate graphics because they are not stored identically on disk. You need something that can find duplicate data at the information level, the answer is – Content Aware or Application Aware dedupe solutions.

This approach provides a good balance between performance and resource utilization.

CommVault Simpana, Symantec Backup exec, Symantec netbackup, Dell DR4000, Sepaton’s DeltaStor, Exagrid  are the solutions which use this technology.

About GK_RAJ

An enthusiastic IT person, with an intense passion towards Datacenter technologies. I am a VMware vExpert Title holder and working as a Technical Consultant, in Qatar. I am exposed to VMware vSphere, Storage, Bladecenters, Datacenter operations, Symantec Backup, Deduplication technologies and carry rich and diversified experience in these domains. I specialize in Designing & Consulting on VMware VSphere, the integration of Storage and Network Stacks to VSphere. With my experience, I help Organizations/Enterprises to achieve their CAPEX & OPEX savings, develop DR and BCP strategies, Consolidation services with Virtualization using VSphere, and prepare them to move to Cloud. In the meantime, I would like to share my knowledge and do a good contribution to the community. I am an Indian citizen, and have a Engineering degree in Electronics and Communication. I have certified in VCAP5-DCD, VCP-Cloud, VCP 4 & 5, MCITP, MCSE.

Posted on February 17, 2013, in Storage Technology. Bookmark the permalink. Leave a comment.

Leave a comment

Dan Gorman's Technology News Aggregation

My Daily Readings from Zite

BRAD HEDLUND .com

Studies in Data Center Networking, Virtualization, Computing

UCSguru.com

Every Cloud Has a Tin Lining.

pibytes

Experience the Datacenter Technologies

boche.net – VMware vEvangelist

Experience the Datacenter Technologies

blog.scottlowe.org

The weblog of an IT pro specializing in virtualization, storage, and servers

Eric Sloof

Experience the Datacenter Technologies

Technodrone

Experience the Datacenter Technologies

Welcome to vSphere-land!

your ultimate VMware information destination

CloudXC

By Josh Odgers - VCDX#90

Long White Virtual Cloudsu by

all things Nutanix, VMware, cloud and virtualizing business critical applications

Virtual Geek

Experience the Datacenter Technologies

Yellow Bricks

by Duncan Epping

CormacHogan.com

Storage, Virtualization, Container Orchestration