Deduplication Internals – Content Aware deduplication : Part-3

Posted by GK_RAJ

Continuation to my previous part 1 and part 2 in this part we will discuss about the Content aware or Application Aware deduplication.

This type of deduplication is generally called a Byte level deduplication, because the deuplication for the information happens in the deepest level – that is BYTES.

Content aware technologies (also called byte level deduplication or delta-differencing deduplication ) work in a fundamentally different way. Key element of the content-aware approach is that it uses a higher level of abstraction when analyzing the data. Content-aware de-duplication looks at the data as objects. Unlike hash based dedupe, which try to find redundancies in block level, content-aware looks it as objects, comparing them to other objects. (e.g., Word document to Word document or Oracle database to Oracle database.)

In this, the deduplication engine sees the actual objects (files, database objects, application objects etc.) and divides data into larger segments (usually 8mb to 100mb in size). Then, typically by using knowledge of the content of the data (known as being “content-aware”), this technique finds segments that are similar and stores only the changed bytes between the objects. That is a BYTE level comparison is performed.

If we look in to detail, In this type before the dedupe process, an unique metadata is typically inserted within the data types,like database, email, photo, documents etc. Then during the dedupe process, this metadata for the data block is extracted and examined to understand what kind of data present in the block. Based on the type of data , a certain block size is assigned to that particular data, and the block size is optimized based on the type of data being backed up.

Since the block size is already optimized for that particular data type, CPU cycles don’t need to be wasted to determine the block boundaries. Compare to other deduplication techniques (hash based), where the data is blindly chopped to find the boundaries, that is to find the block length and then identify duplicate segments.
The below example will give more insight to this;

I have saved a photo, then open it and edited one pixel and save the new version as a new file, there won’t be a single duplicate block at the disk level. On the other hand, almost the entire file is a duplicate information. Can you find a duplicate graphic that was used in a Powerpoint, a Word document, and a PDF? Powerpoint and Word both compress with a variant of zip. Even if the graphic is identical, block level dedupe won’t find the duplicate graphics because they are not stored identically on disk. You need something that can find duplicate data at the information level, the answer is – Content Aware or Application Aware dedupe solutions.

This approach provides a good balance between performance and resource utilization.

CommVault Simpana, Symantec Backup exec, Symantec netbackup, Dell DR4000, Sepaton’s DeltaStor, Exagrid are the solutions which use this technology.

About GK_RAJ

An enthusiastic IT person, with an intense passion towards Datacenter technologies. I am a VMware vExpert Title holder and working as a Technical Consultant, in Qatar. I am exposed to VMware vSphere, Storage, Bladecenters, Datacenter operations, Symantec Backup, Deduplication technologies and carry rich and diversified experience in these domains. I specialize in Designing & Consulting on VMware VSphere, the integration of Storage and Network Stacks to VSphere. With my experience, I help Organizations/Enterprises to achieve their CAPEX & OPEX savings, develop DR and BCP strategies, Consolidation services with Virtualization using VSphere, and prepare them to move to Cloud. In the meantime, I would like to share my knowledge and do a good contribution to the community. I am an Indian citizen, and have a Engineering degree in Electronics and Communication. I have certified in VCAP5-DCD, VCP-Cloud, VCP 4 & 5, MCITP, MCSE.

View all posts by GK_RAJ »

Posted on February 17, 2013, in Storage Technology. Bookmark the permalink. Leave a comment.

Leave a comment
Comments 0

pibytes

Experience the Datacenter Technologies

Deduplication Internals – Content Aware deduplication : Part-3

About GK_RAJ

Leave a comment

Comments 0

Leave a comment Cancel reply

Recent Posts

Top Posts & Pages

Categories

Archives

Follow Blog via Email

Blogs I Follow

Blog Stats

pibytes

Experience the Datacenter Technologies

Deduplication Internals – Content Aware deduplication : Part-3

Share this:

Related

About GK_RAJ

Leave a comment

Comments 0

Leave a comment Cancel reply

Recent Posts

Top Posts & Pages

Categories

Archives

Follow Blog via Email

Blogs I Follow