Authored by Prof J.Mark Bishop
Ahead of the latest episode in the Boyes Turner tech podcast series, Prof J.Mark Bishop shares his thoughts on ‘e-Discovery and Artificial Intelligence...
Can you imagine this? Events unfold and you are dropped into the opening of a long and complex case with 500,000 emails to sift through and you’re not even sure what you are looking for, who you are looking for, or when any incidents of interest may have occurred.
Currently the review of documents is the most labour-intensive task of an e-discovery investigation often consuming more than 75% of the project budget. This is largely because researchers review the documents manually. To put this into context, to review half a million documents by hand, at 25 documents an hour, would take around 20,000 person-hours. Hence, because it is practically impossible to review all documents in the target corpus by hand, results are too often limited by simple keyword searches. Unfortunately coming up with responsive keywords is not trivial as a researcher often does not know exactly what she is looking for beforehand. Surprisingly perhaps, this problem is quite old...
In 380BC the Greek philosopher Plato framed the following Socratic dialogue, that subsequently became known as the Meno:
- How do you enquire Socrates into that which you know not?
- What will you put forth as the subject of enquiry?
- And if you find what you want, how will you ever know that this is what you did not know?
And this, of course, relates to the core problem of e-Discovery because, typically, we don’t know who we are looking for, how they will present, or when any communications of interest occurred.
Plato’s response to the problem was his “Theory of Recollection”, wherein knowledge discovery is merely “our ‘recollection’ from a period long before the soul was imprisoned inside its physical body”. Slightly more prosaically, modern e-Discovery tools reveal answers by the application of Artificial Intelligence and Analytics.
1. Forensic Data Collection
The first stage of eDiscovery is the collection of forensically-sound data for litigation and investigations. Courts can sanction organizations if reasonable steps are not taken to preserve electronic data. Forensic data collection tools such as OpenText’s ‘EnCase’ enable organisations to precisely collect and preserve potentially relevant data, either on premise or in the cloud, with a defensible process that ensures a strict chain of custody. Once litigation data has been collected it must be processed, reviewed and produced.
2. ‘Classical’ e-Discovery Analytics
Market leading e-Discovery analytics from vendors such as OpenText and Nuix typically offer variants on the following core technologies:
Once data has been loaded, documents can be analysed to determine common ‘concepts’. Such ‘concept analysis’ may reveal that users A and B have often communicated on, say, ‘share pricing’ whilst users A and C may have most often communicated on ‘Manchester United’ etc.
On the flipside to ‘data-driven’ concept analysis, e-Discovery platforms will also offer sophisticated ‘user-driven’ key-phrase search. This allows users to search the documents for specified instances of two, three- or four-word phrases (and phrases ‘suitably’ close to these). In addition, sophisticated e-Discovery platforms may also allow ‘predictive search’, wherein users can search for similar documents across the entire corpus.
Next the data can be visualised. Market leading e-Discovery tools such as OpenText and Nuix provide data visualisation tools to plot ‘social network graphs’ foregrounding communications between individuals of interest over specified time windows on particular topics. Such ‘Hypergraphs’ enable users to easily see who a given user has been communicating with. Furthermore, perhaps targets A and B haven’t directly communicated on a given topic but have both communicated on this via a third person, C; hence revealing the ‘degree of separation’ between A and B on a given topic.
Predictive coding prioritises documents for review hence reducing the number for human review. Typically, predictive coding exploits a mini corpus of documents tagged ‘relevant to a case’ and, deploys AI to probe the remaining corpus looking for similar documents. User feedback enables the engine to improve over time and by gathering data on the percentage of relevant documents automatically retrieved, Predictive Coding enables robust confidence measures to be determined (cf. the visibility and transparency of the review).
Modern e-Discovery tools typically embed the above technologies with a sophisticated user interface, to significantly speed up the document review process.
3. FACT360: a radically different approach to Document Analytics
At TCIDA (The Centre for Intelligent Data Analytics at Goldsmiths, University of London) our research in e-Discovery stemmed from a long-standing interest in a difficult, but related, cyber-security issue; the so-called ‘Insider threat’ problem. Specifically, we were interested to see if it was possible to detect ‘early warning signals’ typical of aberrant employee behaviour from email communications. In this search we were not, of course, looking for evidence of the proverbial ‘smoking gun’ (“I am going to defraud my company tomorrow”), but rather searching for subtle linguistic cues characteristic of changing employee motivation.
Our key insight was based on an observation from Claudia Aradau, Professor of International Relations in the Department of War Studies, Kings College. London, that what people actual do [message transactions] quite often yields more useful intelligence than what people say [message semantics]. This field, Transactional Analytics, goes back to the pioneering work of Gordon Welchman at Bletchley Park in WW2.
At Bletchley Park Welchman’s colleague Alan Turing (BBC’s “Icon of the century”) was primarily concerned with decrypting the meaning of coded German communications [message semantics] whereas his colleague Gordon Welchman was primarily concerned with ‘message transactions’ - that German commander A in Bayeux is communicating to German commander B in Caen at 9:17am on 6th June, 1944. NB. It is interesting to note that, unlike Turing, much of Welchman’s research still falls within the UK’s ‘Official Secrets Act’.
Fact360 uses a simple three phase work-flow model - Ingest; Process; Review - and deploys sophisticated new AI algorithms to automatically detect features that best characterise documents and communication channels between users over time. Furthermore, because FACT360 is fundamentally embedded in the time domain, it can easily foreground temporal anomalies. E.g. In characterising communications between two groups of people it can alert on communication ‘anomalies’ (when communications across that channel becomes, in some sense, ‘unusual’).
Furthermore, Fact360 Transactional Analytics can also reveal an individual’s ‘importance/impact’ within an organisation showing how that evolves over time (going up or down, perhaps as an employee is promoted or becomes more ‘in the loop’) and flag anomalies in employee impact.
Fundamentally misconstruing what Transactional Analytics (TA) offers, the CEO of a major UK data security company once remarked to me, “I would never communicate anything `too sensitive’ in corporate communication systems” ... … however, because TA focuses on subtle changes in communication transactions (and not message semantics), it is opaque to whatever topics are being discussed.
‘Classical’ eDiscovery significantly helps legal teams manage complexity by allowing users to search through emails/documents either by specific search term or data-derived ‘concepts’. Classical systems, however, do not automatically identify ‘anomalies’. In contrast, Fact360 brings anomaly detection to the world of eDiscovery; by leveraging corpus structure it reveals documents (and people) to help seed the initial investigation; so revealing the “unknown unknowns” that Plato (and Donald Rumsfeld) so famously alluded to...
Consistent with our policy when giving comment and advice on a non-specific basis, we cannot assume legal responsibility for the accuracy of any particular statement. In the case of specific problems we recommend that professional advice be sought.