Creative culling cuts legal costs

Creative culling cuts legal costs

A useful guide designed to help legal professionals reduce data volumes for eDiscovery.

Written by the Customer Success team at


Over 100 billion emails a day. That’s how much email the business world generates per a recent Radacti Group report. And email is only part of the mountain of electronic data generated daily.

When the inevitable lawsuit arises, all this data can become a huge problem. Whether your client is filing a lawsuit or fending one off, pre-trial electronic discovery, also called eDiscovery, is a growing cost-center in litigation. In fact, a recent RAND study revealed that 73 percent of the total costs of producing electronic documents relevant to a civil lawsuit are due to the review process itself.

For beleaguered legal professionals who are looking for ways to tame this mountain of data for litigation, this paper will deliver the techniques you need to cull the data set as much as possible, so you only have to review the data that is actually relevant to the legal matter at hand, saving money and those precious hours in the day.

Understanding the Data Culling Basics

Culling data is easier than reviewing unnecessary data; however, that doesn’t mean it’s easy. Before you start implementing creative solutions to your data culling problems, it’s important we take a moment to ensure you understand the basics of data culling.

Classic culling

Let’s start with two starter strategies commonly used for data culling: de-duplication and de-NISTing. De-duplication removes duplicate files that users may have created, such as the same copy of a Word document sitting in user’s Desktop and Documents folders. De- NISTing removes non-user generated system files, such as apps and files in the Windows operating system folder, based on a list published by the National Institute of Standards and Technology (NIST). eDiscovery software may have the ability to de-duplicate and de- NIST included as features set up to occur automatically. While these are great first steps in culling data, they won’t be enough to reduce large datasets to manageable levels.

The meat’s in the meta

Before we get into the real savvy data culling techniques that will pare those monster sets down to size, we need to take a moment to discuss the tool that unlocks your culling potential. Metadata is data about data, which is written into data files. Metadata describes the structure and content of a data file, which makes it searchable. When performing a literature review, you will want to familiarize yourself with the metadata available within your documents, so you know what to look for and what to exclude. By taking a little time to explore what is valued, what isn’t valued, and the metadata that differentiates the two, you’ll be better prepared to cull your data.

You might want to know:

  • What time zone the data was processed in
  • Whether a normalized, searchable date field is available
  • What format custodian names are in
  • What language are the documents written in
  • Whether metadata exists to indicate a document’s file type
  • What other metadata fields are available and what are their definitions

The more metadata you can access, the better you will be able to cull your dataset to a manageable volume using the workflows you’ll learn here.

Minding the quality matters

Before you can cull your dataset using in-text search parameters, you need to ensure the data files are searchable as text documents. In order to do this, you need to randomly select and copy (pro tip: never operate on a live patient, so please use caution with original files) a few documents in different formats (e.g. native, image, and text). View these documents to select fairly unique text samples. Then, perform a search for that text. If the document comes up in your search, then you know the documents are searchable. If the document doesn’t come up as expected, the first thing you need to do is check to make sure you performed the test properly. If you still don’t get the results you expect, you will need to discover whether the problem is ineffective optical character recognition (OCR) or the search rules of the system you are using.

Creative Culling Techniques

Now that you understand what you have to work with, it’s time to get creative! First, you’ve got to realize that no case is the same and no dataset is the same. Some of these techniques will work on the data you have, others might not. If you know your data, you’ll be able to select the culling techniques that can drastically reduce the data files you need to review. Each of these cost-effective culling workflows can be used with almost any kind of eDiscovery software.

Excluding irrelevant signature text from specific search results

Signature text is a standardized block of text included in outgoing emails. By excluding this text, you can reduce the possibility of pulling up irrelevant data files in your searches. For example, searching for the word “confidential” can pull up any emails that include a confidentiality warning like this:

The contents of this email message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply email or phone and delete this message and its attachments, if any.

You can exclude these false leads by performing a proximity search along with a NOT condition. Test this technique by searching for the word “confidential” that occurs within 3 words of “intended.” If the results include only those messages with the irrelevant signature block, you can create a search for “confidential” but NOT “confidential that occurs within 3 words of “intended.” This should produce relevant results.

Identifying and removing signature images

People also like to include images in their signatures. It seems harmless enough, but it can be quite a nuisance when performing a document review. The same image file appears repeatedly, wasting the reviewer’s time and resources. Luckily, the right metadata can easily cull these images from the review process.

In order to isolate and exclude these images, you need to locate the MD5 hash value of the image, which is more effective than trying to search by file name or file size. Your eDiscovery software also needs the capabilities to perform a search for this value. By locating these files, you can remove them from the dataset fairly quickly or you can bulk tag the files as non-responsive.

Identifying privileged and irrelevant email domains

Often, you will find that emails that are irrelevant or inappropriate for your purposes have been included in your dataset. This includes emails covered by Attorney-Client Privilege and commercial emails with no relevance to the legal matter at hand. These emails can be easily identified using the email domain name.

You can exclude these emails by searching email metadata fields, including To, From, CC, and BCC, for the specific email domain names you want to exclude with this format:


You may want to review emails to an attorney separately or exclude them altogether, but don’t stop there. Visit Alexa’s shopping category to discover the domain names used by popular shopping sites, so you can exclude irrelevant commercial messages from your dataset as well. With the right software, you can even generate a report listing all the domains found within your dataset and select the ones to exclude from that list.

Placing non-standard file types in a separate workflow

You can speed up the review process by separating your dataset into different workflows based on the file types. Ideally, this wouldn’t be necessary, because all data files would be equally viewable and searchable. Unfortunately, audio, video, and industry-specific files (like engineering blueprints) can cause significant problems during the eDiscovery process. By separating these files from your searchable workflow, you can minimize the time you would otherwise waste by repeatedly wading through files that are difficult to view and search.

Start by generating a report that lists all the file types in your dataset. Identify file types that are not viewable with your review software and separate them into their own workflow to be reviewed as a group at a later time. By delivering this new, smaller dataset to a workstation that can process these files, you can better delegate your firm’s resources. You may also use this technique to exclude executable files and other irrelevant file types from your review process.

Removing near-duplicates

Most eDiscovery programs remove duplicates easily enough, but you’ll find they don’t target near duplicates quite so easily. Minor differences, like an email received date, will cause many programs to list near duplicates or near-dupes, even though you don’t need to review both documents. You can use the body text of a Word document or a combined string of from,to,cc,bcc,subject from emails to generate a MD5 hash value. If you get a match, you can consider it a near-dupe and exclude the unnecessary documents from further review. Just be careful here and trust-but-verify the results by sampling the near-dupes first using a well defined sampling technique like a “simple random sample.”

Searching email subject lines

If the email subject line is one of the metadata fields available to you, then you may use the subject line to exclude irrelevant, generic messages from your dataset. For example, many companies distribute weekly newsletters and quarterly reports that are irrelevant for review purposes. There are also plenty of HR/Administrative type emails that can safely be categorized as non-responsive. When you run across an email with a subject line you know is generic and irrelevant, jot it down and run a search for a keyword or phrase within the subject line field to isolate these irrelevant documents. You can use this to eliminate or cull a lot of “Out of Office” automatic responses as well.

Removing personal emails

You may also receive a dataset that is cluttered with personal emails that are irrelevant to your case. Employees use their work email to send messages to their spouses, their family, and their friends. Identify personal email addresses and search for emails with those addresses in the sender or recipient fields. Then, you can simply remove or cull them from the review.

Suppressing embedded objects

Another easy way to reduce your dataset is to suppress embedded objects occasionally found within Office documents like Microsoft Word or Powerpoint documents. If, for example, embedded content is fully viewable within the original documents, suppressing embedded content will mean that both files are treated as a single file. This can significantly reduce the number of individual documents you need to review. If embedded content is not fully viewable within the original document, you might choose to cull embedded graphics from review, but keep embedded Excel files. If your processing software is sophisticated enough, you can identify embedded files that may not be fully viewable, in which case you can keep only those files in your review queue.

In Summary

Culling your dataset can make the messy eDiscovery process more manageable, leading to cost and time savings. It all begins with getting to know as much as you can about the case and the data set. The outcome of many culling efforts depends on this initial knowledge and check of data quality.

Data sets can be reduced using both automated techniques like duplicate identification along with more creative approaches like removing non-standard file types and identifying junk email via Subject line. Whatever combination of culling tools and techniques is used on your case’s data set, making sure to test them and verify the resulting reduced set is essential for culling success.

Download this whitepaper

212kB - 8 pages