Understanding DataCove’s Deduplication Capability

Mar 27

What is Deduplication?

Deduplication, in a nutshell, is the process of removing duplicates of the same exact data when it is placed into the same location. Deduplication is an independent technology, but in almost all practical usage, it is actually comprised of two different technologies: Deduplication and Single Instance Storage, and referencing Deduplication is simply used as an umbrella term for both functions since they’re almost always paired together.

With Deduplication removing duplicate content, if used alone on DataCove, this would leave “gaps” in a compliant archive, since it would be simply carving out swaths of data that already exists on the system somewhere else and leaving “partial” emails for review. Imagine not being able to find emails on a search because content was removed from them since it already existed on some other email. Not a good way of trying to find or review data, as one would imagine.

Single Instance Storage is what makes Deduplication work. While Deduplication removes data, Single Instance Storage “adds” a smaller segment of data in its place. Whenever Deduplication removes data, Single Instance Storage appends “pointer” files that redirect the system to that original content. This allows for the Deduplicated content to still show up under a search, since the information is technically there via logical redirection to the original content, which then matches the search criteria.

As a simple example of Deduplication:

On the small scale, this is a near-universal experience for anyone working with Folders in Windows: if a user is attempting to place the a file inside a folder that already contains a file by that same name and type, Windows will identify this and ask if this new file should replace the file currently inside the folder, or if it should be renamed and added (to make it unique). Windows compares the files and sees that they are the same file, even if one has a larger size or newer modified date, but as far as Windows’ filesystem is concerned, those are the same file and cannot exist in the same place at the same time.

While the Windows use case is meant make the user experience better by not letting a user confuse themselves with multiple files of the same name, it’s also a filesystem limitation in that multiple files of the same name are not allowed to coexist in the same space. This is true for almost every computer system, including DataCove.

DataCove’s email archiving scenario is a bit different: user confusion is not the crux of the issue; efficiency in space consumption is.

When operating on the scale of tens or hundreds millions of emails being stored and tens of thousands of emails being ingested daily, disk capacity and overall storage becomes a factor over time. This leads to a dichotomy: Storing every single copy of every single email is DataCove’s job, but holding every copy of the same email is inefficient.

Using a combined series of Deduplication, Single Instance Storage and Compression technologies, DataCove removes the tremendous overhead that comes from storing multiple copies of the most space-consumptive document types: Email Attachments. Emails themselves, HTML heavy as they may be, are all of a few kilobytes each. Essentially negligible in today’s storage capacities. Attachments, however, tend to be a few hundred kilobytes to tens of megabytes a pop. With attachments being so commonly blasted around to multiple people at a time, replied to and forwarded back and forth, this can add up exceedingly fast in terms of how much space is being eaten.

Given that those attachments are rarely changing from each transmission to transmission, it makes the most sense to focus on Deduplicating those files, which balances the amount of overhead the Deduplication process entails while providing the most benefit.

As some practical breakdowns of these technlogies:

Deduplication:

Definition: Deduplication is the process of automatically removing duplicate email messages within an archive.
Scenario: Imagine you’re sending a “Happy Holidays” message with a 1MB attachment to all 2000 employees in your company. Without deduplication, this seemingly tiny attachment would occupy almost 2GB of your DataCove’s storage space.
Functionality: Deduplication ensures that only one copy of a duplicate message is stored. It prevents unnecessary redundancy and optimizes storage efficiency.

Single-Instance Storage:

Explanation: Single Instance Storage is often confused with deduplication but serves a slightly different purpose.
Use Case: Suppose two different users send emails containing the same attachment. Single Instance Storage recognizes that these emails are identical and retains a single copy. However, it indexes the attachment in a way that allows it to be located by searching for both emails and users.
Advantage: By maintaining a single instance of shared attachments, Single Instance Storage further reduces storage requirements.

Compression:

Purpose: Compression ensures that the size of email files is minimized.
Impact: Smaller file sizes translate to more available storage space on your DataCove.
Cost Savings: Ultimately, efficient compression leads to lower costs associated with storage.
Combined Power: When Deduplication, Single Instance Storage, and Compression work together, they significantly reduce the storage overhead on DataCove while still providing an extremely rapid search and retrieval experience.

With DataCove’s licensing model being based on storage capacity, the value of Deduplication cannot be overstated. The more efficient DataCove is at Deduplication, the more space is saved and consequently, the smaller the system needed to meet an organization’s retention requirements.

How does DataCove's Deduplication work?

DataCove’s order of operations for Deduplication come as part of its multi-layered Email Processing stack.

A visualization of this is contained below, with individual clarifications of each section broken out further below.

Note: For definitional purposes, the computing term “hash” will be used repeatedly below. A ‘hash,’ in DataCove’s context, is a fixed length alphanumeric code that that traces to an individual email or sub-document of that email, for the purposes of rapid data comparison and as a unique identifier.

When DataCove first receives an email, they enter Layer 1 of the Email Processing Stack. Regardless of the source of the email (such as POP or IMAP fetchers, SMTP, Exchange or Google Crawlers, PST uploads, etc), the very first step that DataCove conducts is creating a hash of that full email. This hash gets added to a list of entries that will be referenced in the future for immediate deduplication if detected.

If this hash matches a hash that DataCove has created previously and has on that list of previously seen email, the entire email will be marked as a duplicate and deleted, as the existence of that hash means the DataCove has seen it before and already dealt with that email.
Emails that were deleted from the system via the Individual Email Deletion or Retention Policy functions will have their hashes removed from this Hash Index, allowing their possible reinsertion in the future.

Past the original email deduplication check, Layer 2 of the Email Processing Stack contains the “Shredder” function. The Shredder essentially dices up the emails into multiple different components, known as Documents. These Documents are different pieces of the emails, such as the Headers, Bodies and any Attachments themselves all becoming an individual Document each.

These Documents then get hashed in the same fashion as the total email does in Layer 1, with Attachments then getting passed to Layer 3 for Deduplication.
Email Headers and Body Documents proceed directly to the Indexing Layer and are not deduplicated.

Layer 3 of the Email Processing Stack is the Deduplication of Attachments phase, where the hashes of the Attachments get compared against an index of all Attachments that DataCove has seen before. If any exact hash matches are found, the Attachment is removed and a Pointer file put in its place that directs it to where the DataCove stores the original Attachment.

Attachments must be the exact same as one the DataCove has already seen and presently has stored. Any changes to the Attachment would mean it would possess a unique hash and thus be considered a unique attachment, which won’t be deduplicated.
Original Attachments deleted by the Individual Email Deletion or Retention Policy features have their Attachments “moved” to the next oldest surviving Email that possesses that same Attachment.

Once the Documents have finished the Processing phases, they move onto the Indexing Layer. Indexing is the process of “reading” the Documents and noting down each and every word into many searchable indices. These indices form the backbone of DataCove’s search functionality, allowing for all sorts of words in many different combinations to then be searched for across the email bodies, subject lines, attachments and more. Once the Documents finish Indexing, they are compressed for any final storage savings to be gleaned from them.

Any Documents that are encrypted, password protected or malformed won’t be indexed due to DataCove not being able to read them properly. In such situations, these emails will be marked as “unindexable” and placed into a separate category for visibility. Additional information on what these are, how they happen and how their data can be viewed is found in another article concerning Unindexable Documents, linked here.

Once Indexing is complete, the emails are ready for searching in DataCove and all deduplication procedures have finished.

In general, the overall combination of these different functions provide a whopping 40% to 60% space savings on any given DataCove.

DataCove tracks the amount of deduplication occurring on the system constantly, which can be found by clicking on Status in the top header bar, followed by selecting Summary Counts on the left hand side.

Under the Totals section near the top of the page, DataCove tracks the total number of Attachments seen by the system, along with what is actually being retained, described as Unique Attachments.

In the example below, 461 million attachments were detected across some 233 million emails that are stored, with 307 million of those 461 million attachments actually being retained by the system due to their distinctive content. With a space savings of some 150~ million odd attachments, DataCove provides great efficiency to every organization, with especial benefit to attachment-heavy ones.

Michael Singh

Understanding DataCove’s Deduplication Capability

What is Deduplication?

How does DataCove's Deduplication work?

Quantifying User Mailbox Sizes in Microsoft 365’s Exchange Online

Configuring Microsoft Exchange Journaling (Exchange 2013, 2016, 2019 and newer)

Tangent