As the calendar races to Election Day, the long-running probe into Hillary Clinton’s use of a private email server has once again reemerged as the central non-Donald J. Trump-generated issue in the campaign. While some have questioned the timing of FBI director James Comey’s announcement that the Bureau is reopening the case (namely, his boss’s boss’s boss), others are more perplexed by the logistics of the investigation itself and what, exactly, investigators hope to find in a sea of email through which, they’ve previously stated, it would be impossible to sift by Tuesday. To be sure, this is a calamity of the highest order, highlighted by the irony that the FBI is apparently using a homebrewed document review tool. For real.
Throughout the week, in interviews with national media outlets, including CBS News, Logikcull has attempted to explain just what the hell is going on and, specifically, in layman’s terms, how the FBI’s “special computer program” is going to make sense of some 650,000 emails in four days. In short, it’s not.
Below we present a primer of where the investigation stands, what the FBI is looking for, how they propose to find it and why that approach is completely futile. As always, the only horse we have in this race is America. We’re just here for the email.
What is the FBI looking for anyway?
Though Comey announced in July that the Bureau would not bring criminal charges against Clinton for her alleged mishandling of government classified information related to her use of a personal server, the emergence of a laptop belonging to Anthony Weiner, estranged husband of top Clinton aide, Huma Abedin, appears for some reason to have put that conclusion in doubt. Why? It’s not totally clear. What’s even less clear is why it took the FBI more than three weeks to uncover emails from a laptop it seized on October 3 — as part of a separate investigation into whether Weiner had been sending illicit texts to an underage girl.
At any rate, the FBI appears to be reexamining, in light of this new trove of emails, some of which appear to have been sent or received over Clinton’s personal server, whether anyone in her inner circle intentionally mishandled, covered up or withheld communications containing national security information. This maze is about as convoluted as the previous sentence. Suffice it to say, there is no new news here, per se, just more email masquerading as news. Now if it turned out that, for instance, the recipient of Weiner’s texts was actually a Russian operative attempting to secure nuclear codes just floating around on HRC’s private server, then we’d have a real story. But even the internet can’t concoct a conspiracy that rich… Fact is, there’s not much to see here, people. At least, not at the moment.
What are the challenges associated with searching 650,000 emails in a few days?
This was the focus of a CBS News story in which our own Andy Wilson, CEO, was featured. Pull quote: “If they do it in a more modern way, it could be done literally in a couple of hours or a day or two.” This is to say that perhaps the FBI’s greatest challenge is the technology itself. To be fair, the Bureau is hamstrung by the Justice Department’s retrograde review protocols (we wrote about them here), but the old-school review platform is no great shakes either.
There isn’t much known about this magic box, other than that one of its primary purposes is to de-duplicate messages the FBI has already reviewed. But the very fact that agents have had to gather at the FBI’s Operational Technology Division in Quantico to build, de-dupe and review the database tells us this project is not particularly scalable, limited by the forces — on-premise software, physical location, device-specificity, linear workflow — that constricts all “legacy” review projects.
Apparently the FBI does not take its cues from its more covert sibling, which is a shame because, as we point out, a review of this magnitude could be knocked out in, literally, hours with the benefit of infinitely distributable resources. On a related note, Logikcull has offered its software to the FBI for free. And that offer stands.
What is the FBI’s likely course of action to review the emails?
The goal of any review of this magnitude is to separate the wheat from the chaff as quickly and precisely as possible. While there may be 650,000 emails at issue, likely only a small fraction of those is pertinent to the investigation. This is the case with almost all legal reviews. A basic rule of thumb is that 90% of any raw document corpus will be junk — duplicate files, clearly irrelevant material, spam, system files, etc. Still, that would leave the FBI with a review set of some 65,000 documents to review. Estimates from other experts have ranged from 1,000 to 20,000.
Regardless, once agents have removed useless and duplicative material from the review set, they will begin to pinpoint a subset of specific types of communications by searching on metadata fields like email sender and recipient, creation date, file type and so forth. With Logikcull, categorization and indexing by these data points is performed automatically. It is almost certain the FBI’s technology will require manual input.
From there, reviewers will likely use keyword searches (e.g. “Classified”, “Carlos Danger,” “nuclear codes” JK ;) to further winnow the data. Most modern Legal Intelligence platforms allow for these searches to be performed not just on the document, but the document’s metadata — for instance, running a search of keyword “HRC” within email subject lines or in recipient domain fields. But it is uncertain whether the FBI has that ability.
Finally, once an “eyes-on” review set has been identified — and it is likely, given the sensitivity and importance of of the investigation, this corpus will be larger than it needs to be — agents will examine the documents one at a time for information they feel Clinton’s camp should have previously turned over, or is otherwise sensitive.
How can one person’s laptop contain 650,000 emails?
This isn’t just any person. But, in seriousness, that these emails reside on a laptop in the first place suggests that the parties at issue here — Weiner, Abedin, Clinton, etc. — had been using a mail app to connect to cloud-hosted email. If Weiner uses a Macbook — Google says he does (and also, don’t click on that link) — it’s likely those emails were synced between the cloud and his, um, device vis-a-vis the MacMail email client.
And why does that matter?
The resultant mailboxes take the form of an .MBOX file, within which all of the emails and attachments, along with the complete folder structure and metadata, are stored. Parsing an .MBOX mail database is not only fast, but easy with the right tools. For instance, using Logikcull, you can drag-and-drop the .MBOX files into the app and Logikcull will automatically extract emails and attachments, de-duplicate all the files, make images searchable, detect languages spoken (alas, not the language of love), preserve all the metadata, identify email domains and categorize, detect privilege content, render each email to PDF… You get the point. Basically everything the FBI is doing manually and linearly. Here’s how the results would appear:
That the FBI appears to be building its own review database is not unlike building your own microwave to cook a 12-course dinner. High-powered, accessible and highly-distributed software is widely available. There’s no need for custom software or, as the New York Times called it, a “special computer program.” There’s also no need to shuttle agents in and out of Quantico, because, with a cloud-based Legal Intelligence solution, you’d just extend secure access to investigators working remotely. It’s no different than creating a Facebook account, and just as easy.
It can’t be overstated what an unnecessary constraint the Bureau has placed on itself by opting to conduct this review in a single physical location. In addition to literally having to fly people to Quantico, as the Times reports, should agents find previously undiscovered emails containing classified information, “copies would have to be sent to other government agencies to determine their classification.” As opposed to just creating different user-permissions and conducting this review simultaneously, on the same documents, collaboratively and in real-time.
Again, FBI, we’re right here for you. There are roughly three days left. Let’s end this national nightmare together.