Near Deduplication in eDiscovery Document Reviews: It's Meaning and Advantages

The old cliche is that close is only good in horseshoes and hand grenades. Well, sometimes a document review can feel like both. And close can represent millions of dollars and countless attorney hours. The eDiscovery world goes to great lengths to discuss data analytics in terms of predictive coding, but there are additional analytic processes that can be employed to mitigate cost, time, and risk. One of these technology assisted services is Near Deduplication.


When preparing electronically stored information (ESI) for document review, a common practice is Deduplication. This is the process of identifying exact duplicate copies of documents present in a data collection by comparing electronic records' unique hash values. Once the documents are identified as exact duplicates, they may be excluded from the full data set on a custodial (one copy of that exact duplicate document present for each of whom the data belongs, or custodian) or global (one copy of that exact duplicate document present in the entire universe of documents) level. Because this process compares hash values derived either from the binary content of files (for loose files or attachments) or from email metadata values (form Emails), it provides absolute results.

As a separate issue, sometimes it's important to know which documents have almost the same content within a data collection(s). The key documents relevant to a case may contain only minor differences from one another. A party being represented in litigation may own documents that contain a large amount of form or standard language. Custodians may have documents that have only been slightly altered and stored or shared as multiple files. Documents may contain fonts that match their background in color and thus rendering the text invisible upon visual review. Whether it's for one of these reasons or another, sometimes grouping documents based upon similar text content represents a strategic advantage during review. This is the analysis Near Deduplication provides.

Near Deduplication works based upon the searchable text content of a document. This text is extracted during common eDiscovery processing or, for documents without extractable text, via Optical Character Recognition (OCR is the conversion of electronic documents to searchable text information). The extracted and/or OCR searchable text is analyzed and then a percentage ranking is assigned to the documents based upon their textual content; the documents are grouped based on their similarity. The higher the percentage, the more similar the documents are to the Pivot or Master document (the document with the highest word/character count in a Near Duplicate grouping) in terms of their text. The output is delivered in the form of a CSV load file (viewable in any Text Editor or Microsoft Excel) and may contain additional field information including the number of near duplicates, associations of near duplicates, and even the number of words contained within each document.

Be aware that because Near Deduplication relies on the accuracy of OCR, it does not produce absolute results. Issues like handwriting and variations in font styles and sizes can alter OCR accuracy and therefore alter Near Deduplication scores. Because of this, there is risk in making coding decisions about documents based upon the Near Deduplication output alone. It is recommended near duplicates be reviewed before making any final decisions with regards to document production. Keep in mind that while one document may have 98% of the same text of another that could be the difference between corroboration and contradiction.

The main advantage of Near Deduplication is that it groups documents based upon text. This can allow for batching of documents to specific reviewers and/or experts, the ability to prioritize the review of a group of near duplicates based on the relevancy of their topic, and the ability to facilitate a more efficient review as opposed to relying on traditional linear review tactics. For these reasons, along with the ability to resolve the data issues described earlier in this article, Near Deduplication has the potential to save time and thereby money during the most expensive phase of eDiscovery, document review.

Whether Near Deduplication is the right analytic process to use on your data set will be situation specific. This should be decided after collaboration with the legal team, litigation support, and/or service provider assisting in getting the data prepared for review. As with any eDiscovery process, gaining an understanding of the nature and types of electronic files being provided by the litigating parties will assist in deciding which tools will be apt to apply. Whether you get handed server data, loose files, horseshoes, or hand grenades, you can add new dynamics to your case management by exploring the data analytic resources available to you and putting them to good use.


No comments:

Related Posts Plugin for WordPress, Blogger...