Skip to main content
All CollectionsUploadCommon Upload Questions
What is deduplication and how does it work?
What is deduplication and how does it work?

Logikcull's deduplication feature removes exact duplicates, streamlining data review. Choose between Global, Custodian, or No Dedupe views.

Leah Keilty avatar
Written by Leah Keilty
Updated over 9 months ago

Introduction

Logikcull provides deduplication views that remove exact duplicates from search results, making it easier to review and manage data. Users can toggle between Global, Custodian, and No Dedupe views to display unique data or all data in a project. Deduplication works by comparing file content and metadata to identify duplicates, using specific fields for email, non-email, and calendar invitation data. This feature does not delete any files but merely hides them based on the chosen deduplication view.

Viewing Duplicates

In the Document Viewer

To view duplicates in your search results, select "Dupes" from the document card's drop-down options.

Toggle Between Dedupe Views

Simply open the filter carousel, navigate to Deduplication View, and select the Dedupe filter you want to use. Changing deduplication views is not permanent.

  1. Global deduplication shows only unique data across your project, inclusive of families (i.e. attachments).

  2. Custodian deduplication shows unique data within the silos of your respective custodians/people in your project.

  3. No Dedupe shows ALL the data in your project regardless if it was duplicative of another document in your project.

Your project has a default filter view you will see when you first login. When you clear a search, it will also default to that. If you did not change it during project creation, it defaults to a horizontal/global deduplication view.

⚠️ Important note on filter views and saved searches:

The deduplication filter view and culled/unculled filter view are each saved as parameters in a saved search. For example, if you are in the Global Dedupe view and you run a saved search that was saved in the No Dedupe view, the filter view will update to reflect the parameters of the saved search. However, if you run a saved search using the Advanced Search Builder, the filter view will NOT be updated automatically. The best practice is to limit use of the Advanced Search Builder to saved searches that all share the same filter view.

Changing the default Dedupe view

  1. Click on the Project Settings button (gear icon)

  2. Select "Preferences" from the drop down

  3. Change the "Preferred Deduplication View" preference, then click the "Save Search Preferences" button, and that's it! Whenever you log in to the project, the deduplication view preference saved above will be used.

What documents are hidden by Deduplication?

To review documents that were removed from the Global Dedupe view, switch to "No Dedupe" and enter this in the search bar:

horizontal_duplicate:true

To review documents that were removed from the Custodian Dedupe view, switch to "No Dedupe" and select a custodian from the filter carousel, then enter this in the search bar:

vertical_duplicate:true

How to bulk tag duplicates

In the event that you want to tag all of the duplicates the same for consistency, you can do this by hovering over the ellipsis (...) next to the tag name in the document info panel, and selecting "Tag Dupes". As a reminder, you would only see this option if there are duplicates for the document that you currently have up in the image viewer.

What is the "Has Duplicates" Auto Tag?

The “Has Duplicate” Auto Tag (QC Tag) applies to both dupes and the original copy. This tag compares the metadata, and if there is an exact match, all instances of the duplicative document will receive this Auto Tag.

Filtering on documents with duplicates

You can also make use of the Auto Tag "Has Duplicate" to easily and quickly find documents that have duplicates associated with them. Keep in mind, this Auto Tag is applied to all instances of the duplicative document, including the original.

How Does Deduplication work?

When files are uploaded into Logikcull as part of a File Upload*, the file content (proxied by an MD5 hash) and the metadata are compared, and assuming both are identical, a duplicate is identified. The MD5 hash is the fingerprint of the file.

*Note that Logikcull does not deduplicate database uploads. If documents are uploaded using the Production Upload format, they will not be deduplicated.

How is email uniqueness determined?

Logikcull deduplication uses the following fields to calculate the hash value for email data:

  • From

  • To

  • CC

  • BCC

  • Email Subject

  • Sent Date + Time

  • Extracted text of the email

How is non-email uniqueness determined?

For E-docs and non-mail items, the MD5 Hash value is calculated down to a binary level bit-by-bit calculation based on the content. That is, it is based mainly on the following:

  • Content/body of the image

  • Created Date

  • File Size

How are calendar invitations deduplicated?

Please keep in mind, for calendar invitations, a special set of fields are also used to calculate the hash values for these documents:

  • From

  • To

  • CC

  • Subject

  • Attachment Name

If you have recurring calendar invite entries, we typically recommend reviewing them outside of Logikcull or, alternatively, uploading these entries as part of a database upload with the following fields populated in your metadata load file: Appointment Start Date, Appointment Start Time, Appointment End Date, and Appointment End Time.

Family status impacts the dedupe view

Logikcull deduplicates at the family level, and as long as the fields referenced above match, it is identified as a duplicate. Exact duplicates are hidden by default at a family level. For example, if a Word document is in a folder but the same Word document is attached to an email, you’ll see the document twice because one stands alone and another is part of that family/attachment context.

We preserve these family level relationships because a file’s context may differ as part of a family. Another example is if an email with the same attachment is sent out to two different parties with different bodies of text, the attachment would be identified as a duplicate, but since the parent emails are different, both families would remain in the dedupe view.

If you're looking to see only one instance of every document (regardless of family structure), you can run the syntax file_duplicate:false

How do I locate duplicate custodian(s) in the document viewer?

The duplicate custodian information can be viewed right in the document viewer. From the document info panel, scroll down to the "Duplicates" section and hover over the custodian information. You can view the duplicate custodian information there.

Did this answer your question?