Beyond Key Word Searching in Electronic Discovery

Key Word Searching
Listen to Jason R. Baron, the National Archives Director of Litigation give his personal views on better search and retrieval methods. Jason will point out the flaws in key word searching. Jason has been an active member of the Sedona Conference and he is a frequent speaker and writer on this topic. Professionally, Jason Baron has served as the National Archives’ Director of Litigation since May, 2000. In this position he is responsible for overseeing all litigation-related activities confronting the National Archives, including complex Federal court litigation involving access to Federal and Presidential records in the National Archives’ custody.

In addition, Jason Baron is the National Archives’ representative to The Sedona Conference, where he is a member of the Steering Committee for Working Group 1 on Electronic Document Retention and Production, and serves as Editor-in-Chief of The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval in E-Discovery. Mr. Baron is also a founding coordinator of the TREC Legal Track, an international research project organized through the National Institute of Standards and Technology to evaluate search protocols used in e-discovery.

In this show we candidly address some of the issues facing search and retrieval, lawyers misperceptions on how effective key word searching is, some of the ground breaking studies on search and retreival such as the Blair & Maron study in 1985, why many current search and retreival processes do not scale and are so expensive, the work Jason Baron has done coordinating current studies on search and retrieval with TREC, and the overall challenges with search and retrieval.

[DDET Click here to read the transcript]
Karl Schieneman-Interviewer

Jason R. Baron-Guest

K: Hello everyone. Welcome to another addition of ESI Bytes. This is of course, a free podcast for electronic discovery where we try to offer the national content that appears in conferences around the country with national speakers that a price everyone can afford – free, and at a time when a topic might mean something to you. Today I’m real pleased with the show. We’re doing a show on search and retrieval with Jason Baron. Jason has served as the National Archives Director of Litigation since May of 2000. In this position, he’s responsible for overseeing all litigation activities confronting the National Archives including complex federal court litigation involving access to federal presidential records in the national archives’ custody. He also serves as NARA’s representative for The Sedona Conference (as many of the past speakers have been). Jason’s very involved with Sedona. He’s a member of the steering committee for Working Group One on electronic document retention and reduction. He serves as the Editor-in-Chief of the Sedona Conference’s Best Practices commentary on the use of search and information retrieval in e-discovery. He’s also the founding coordinator of Legal TREC – an international research project organized for the National Institute of Standards and Technology to evaluate search protocols used in e-discovery. If you GOOGLE Jason on “search and retrieval” you’d see that he’s everywhere on the Internet at this point. I’d like to add one caveat. He’s only expressing his personal opinions here on search and retrieval, and he is not endorsing any product. Does that sound good, Jason?

J: Yes, thanks Karl, and thanks for having me on the show.

K: I always like to start off generally…well, tell me how you got started in electronic discovery.

J: Well, for some reason (even though I’m from New England), I had Potomac Fever and I wanted to work in Washington. I got my wish (because) for the past 28 years I’ve worked in the federal government. Part of that time was at HHS, 12 years at the Justice Department as a trial lawyer and Senior Counsel in the Civil Division and then my present job that I’ve held for the last 9 years as Director of Litigation at the National Archives. So, I’m a lifer in government. Part of that time (while at the Justice Department) I spent 7 years of my career on one case involving Ollie North’s PROFS Notes – his White House email at The National Security Council that’s known as Armstrong vs. Executive Office of the President — spent 7 years and 10,000 hours on that case. It was the first case (really a landmark case for the government) where email and backup tapes were in an issue in terms of long-term preservation. That was back in the early 1990′s and it carried through. Then, in my present job, I was hit with this wave of litigation involving tobacco. That produced a whole set of questions that we’ll get into here. It seems that email has followed me during the course of my career. It seemed a natural segueway to ask questions of Sedona and be involved in activities that involved electronic records since it was so much a part of my practice.

K: It’s amazing when you spend so much time on a case. It’s like dog years. I had a 4,000-hour case in my life as an attorney. I remember saying that the first 3,000 hours were interesting, but uh…Okay. Well, let’s talk about search and retrieval issues. Most everyone uses key words to identify buckets of data they might preserve or produce in a case. From your perspective, does this work well?

J: Yes and no. It definitely is the baseline for all of us of a certain age. We grew up in the kind of an automated environment with Lexis and Westlaw, practicing law in the past couple of decades. We applied that skill of putting in key words in our early analysis in terms of e-discovery. It is still to this day the de facto baseline standard for how lawyers think about search and retrieval. They get a case (and) usually a Senior lawyer hands a Junior lawyer a case (or a team) assigned to some people to dream up interrogatories, discovery requests, requests to produce, and along with that (on the receiving end of those) you’re expected to come up with strategies for search. The predominant paradigm is to put in key words and see what you get. The problem is (as I found out at various points in my career) that you put in just a few key words to a very large database of ESI – electronically stored information, and you get a tremendous number of hits that are really false-positives that are noise. So the central question that’s animated me for the last half-decade is whether there are more efficient ways to go about searching (rather) than just putting in a few keywords. We see keyword developments – case law that is very interesting, Karl. Prior to the Federal Rule changes in 2006, there were a few scattered cases where courts tentatively reached out and asked whether one side or the other had some keywords. There was some limited negotiation. Post December 2006, you have a veritable cottage industry here (of cases) where courts are seriously engaged in questioning the amount of time and effort and sophistication of the keywords that are being entered into. Beyond that, some judges out there have said that there are alternative ways (beyond just putting in a few keywords) for lawyers to think about retrieving evidence out of large data sets — so we can get into that.

K: Let’s dig a little deeper in some of your experience with keywords as a basis for identifying potentially relevant data that’s produced. That can have some serious cost implications for discovery. I’ve seen on past presentations (that) you used a chart describing a case you were involved in called U.S. v. Phillip Morris. Phillip Morris requested some 1700 odd requests to produce on 30 federal agencies for tobacco-related topics. There were over 32 million Clinton-era emails (that) the government had the burden to search. It was a real interesting analysis you did when you applied it out to other people’s potential litigation. If you wanted to share some of what you discovered…

J: Right. It is my lot in life to worry about White House emails since the National Archives gets all of the White House emails every 4-8 years. Part of what we got are 20 million Presidential emails and another 12 million federal emails from the Clinton Administration White House Executive Office of the President. The Phillip Morris is a RICO action, which involved 1,726 requests to produce to 30 federal agencies, requiring the government to search 50 years of its records (paper and electronic) to come up with tobacco-related documents. The task for myself was not only to search Presidential libraries going back to Eisenhower, but also to search against 20 million Presidential emails from the Clinton-era. We did that with a set of keywords – the kind of keywords that anybody listening to this broadcast would think of (such as) tobacco, smoking, tar, nicotine, and a set of others. It turned out, however, that there were 200,000 hits that we got against the 20 million emails. We had to use a team of 25 lawyers and archivists going through email by email, attachment by attachment (for 6 months) to parse out what was relevant and what was not and ultimately to produce a privilege log of entries. That task doesn’t scale. I have spoken about and written about this with George Paul in a law review in The Richmond Journal of Law and Technology called Information Inflation – Can the Legal System Adapt? The effort doesn’t scale in this way. You certainly can’t have the time to do a search if you’re going up to much more that 20 million objects if you’re at a level where you’re dealing with hundreds of millions. And we’re going to get to a billion emails at the National Archives. If you look at the figure – 200,000 hits from 20 million is 1%. That 1% figure (if you’re up at a billion objects) is 10 million. There just isn’t even time to do a manual search of 1% of what you’ve got after an automated search using keywords.

It’s obvious that we’re at a tipping point in the profession. Those of us who are practicing in cases that have millions and tens of millions of electronic objects know that we’re going to have to figure out strategies for reducing the volume of ESI on the front-end by (using) a lot of techniques (that aren’t presently used and then using something beyond keywords). Then, to really try to get the data set down to what are the relevant and highly relevant documents that are most useful in litigation. So, it is that experience of coming out of the Phillip Morris case that I started asking questions.

I went (naturally) to The Sedona Conference – the leading legal think tank (in which) you’ve had a number of people on your show from The Sedona Conference, including Richard Braman and others. I went to them (and) posed questions and of course, if one volunteers in life, you get tagged as “it” — so Richard asked me to think about putting out a commentary and leading an effort in that direction on search and retrieval. We did that. We pointed out the conditions where keyword searching is helpful but (when) there are also limitations. We (for the first time in the profession) set out in that document – the search commentary that Sedona issued in 2007, a set of alternatives to just throwing a few words at a large data set. Talking about boolean operators and having negotiations over boolean strings where you’re utilizing the power of a bullion string for “ands” and “ors” and “and not’s”, but beyond that, using fuzzy terms, using alternative search techniques that loosely can be talked about as concept searching. Statistical methods, language methods – the kind of search techniques that I would say many of the vendors (and) many of the legal service providers that are out there when you go to a Legal Tech Show – they know that they have that sort of bundle of techniques as part of their technology. Lawyers generally are not well versed in asking the question of “What are useful tools out there to search against large data sets?” So, The Sedona Conference has put out a commentary that recognizes that there are alternatives to keyword searching. I think both the bench and the bar are more cognoscente now as we do this broadcast today then they were a few years ago. There are a range of techniques out there. We know that Judge Grimm and Victor Stanley setout a recitation of alternatives to keyword searching and criticized the lawyers in that case for not really explaining their keyword search analysis. You see that in similar cases where judges are familiar know with a range of techniques and they’re suggesting to the profession that they should at least ask some questions of themselves when they’re going into any relatively complex litigation about how they’re going to search the evidence and document the process.

K: You’re making me feel real good, Jason, that I’m focusing on legal analytics professionally now. Let’s talk about my previous life. The sort of things you’re going to talk about now used to make me cringe when I owned a legal staffing company with a couple hundred people out there on projects. You’ve cited a number of studies on search and retrieval problems with accuracy and cost. The first one that I’d like you to talk about is the Blair and Maron Study from 1985. What did this study reveal?

J: This is interesting because there’s just not a whole lot of studies out there that are testing concepts of accuracy in a legal domain. The Blair and Maron Study was conducted in 1985 by David Blair and Emmie Maron. The case involved a San Francisco Bay rapid transit car accident in which a computerized bar train failed to stop at the end of the line. The case as it was litigated had about 40,000 documents totaling about 350,000 pages in a discovery database. What the researchers did is that they asked the attorneys to estimate how well they had done. They estimated that they had found about 75% of relevant documents during the course of litigation. The research project revealed that using all sorts of different keywords against what was an IBM Stairs Database showed the number of relevant documents was only around 20%, and so there was a tremendous gap between the perception of lawyers and what the research found. That study really has not been replicated for 20 years until the work (that I’m now going to talk about with you) of the TREC Legal Track. We have a situation where there really is a kind of gap in research where I saw and wanted to do something about.

K: You can talk about TREC now, but I wanted to add my own 2 cents on this. I used to (after I sold my staffing company) go to national conferences and talk about the staffing side – the costs and doing it more efficiently and what you’re getting. I could never get on earlier than 4pm on the last day of the conference in front of a slate of speakers and everyone who had already left. This is obviously where most of the cost is, you know, the review. Let’s update the studies that you’ve been instrumental in. What have you been finding with TREC?

J: The TREC Legal Track is sort of an update on Blair and Maron for the age of e-discovery. What I had noticed was when I had asked my questions, “Are there better methods out there?” I didn’t get far. Because I’m a federal employee I went naturally to NIST – The National Institute of Standards in Technology and I saw that they had been running this text retrieval conference for 15 years, but they never invited lawyers to the party. My colleague, Doug Oard (Doug and I co-teach at Maryland) and I went and approached NIST with a proposal to do a Legal Track – to basically run a text retrieval project through a legal domain (kind of a simulated universe of legal discovery). So (in) the Legal Track for the last 3 years (and now we’re in our 4th year) what we have done is come up with a virtual universe of hypothetical complaints on a variety of subjects. Whether it’s products liability or shareholder actions, wrongful death, etc. we come up with 80 or more requests to produce that are just hypotheticals based on those complaints. I had my Sedona colleagues negotiate with each other to come up with consensus Boolean strings. One side would propose one set of terms for how to go about searching against a topic and the other side would come out with the rejoinder and they’d come up with the consensus. That would be the baseline Boolean search method where we then challenge the world’s academics and legal service providers to basically do better – to come in and use their own homegrown search algorithms (whatever they are) to go against a data set to see if they can do a better job in finding relevant documents with less inaccuracy and less noise. For the last 3 years we’ve used the Master Settlement Agreement Database (which is a tobacco database) of OCR documents of about 7 million. This year we are (for the first time) using the public Enron dataset. The last 3 years of the project we’ve had mostly academics but last year we had a couple of legal service providers come in and participate in what’s known as an “interactive task” where we use an expert attorney as a feedback mechanism for them to ask questions from (in order) to tune their systems. By the end of the day, we have hypothetical topics, databases, etc. I get a core of volunteers from lawyers, law students, legal assistants, etc. to do relevancy assessments each year, and then findings are reported on how all the different methods did. If you’d like, I can sort of hit the bottom line here…

K: Yeah…

J: What is paradoxical about the TREC Legal Track research to date is that on the one hand it has only emerged in the third year that there are certain methods that on a one-to-one kind of basis beat the Boolean baseline. It’s very difficult to construct alternative search methods that do a lot better than a well-constructed Boolean string. On the other hand, in the slides that I have put together in my public talks and writings on this, it appears that as much as 78% of relevant documents are left on the table if you only do Boolean strings. Across all topics that we’ve run for 3 years, some topics that run the Boolean strings that lawyers routinely do, they do very well. They get almost all of the relevant documents. But some topics get 0%. On average, it’s only about 22% of the relevant documents that have been found across all of the topics and methods being run. What that (to me) suggests is that there is dark matter in the universe. There’s a lot of relevant documents that are just left behind, and we need to be thinking about a fusion approach – what Gartner has called a “Cocktail Approach,” of using search methods to really grab at getting a richer set of relevant documents hopefully while trying to trim costs and be more efficient about the process. This is only (I say this with some trepidation) year four of the project. I thought when I got into that it would be a one-year deal and I’d be out of it. All of the scientists that I worked with laughed at me. We are in year four (and) have an Open Letter to the legal profession that was just signed on Sedona Conference stationery. It’s up on the Legal Track site and it will be up on The Sedona Conference site signed by Richard Brayman and Ken Withers, myself and Ellen Voorhees who runs the NIST Track generally. We’re urging legal service providers in year four (given that we have the Enron datasets to work with) to come in and work with us and be part of the research (by) participating in the ongoing program. All lawyers will benefit from a broader participation in this project. Ultimately, the goal is to try to find ways to more efficiently search for electronic evidence and reduce costs.

K: It would seem to me that there’s a big part driving the lack of people jumping in. It’s fear. I think you’ve pointed out (that) it’s a cocktail approach. The fear that maybe your technology won’t look as good because you won’t build the right process to go with the technology and all of a sudden, you get tagged with bad technology. It seems like it’s just pretty darn hard to do this extremely well just with technology or just with people.

J: I agree completely that the entire process here is not just using tools, but using a process that works as part of these tasks. I will say this – Legal Track is not US News and World Reports. There is not a ranking system – in the fact there’s a prohibition on advertising and marketing. What someone who steps up as a legal service provider into part of this project, what they’re saying to the world is that they’re willing to participate in a neutral evaluation, a government run project, etc. that is attempting to do something of value for the profession. I think I would think well of an organization that is willing to participate. There won’t be a kind of…there shouldn’t be a level anxiety for participation. Having said that, we will be looking for additional organizations to come in this year. We have about a month before the starting gun that will be sometime after Memorial Day (maybe) right at the beginning of June when we’re really running this Track for this year and we’ll go on. It won’t be the last year of the Track, I don’t think. I think the research is so important for the profession. Reducing costs by better searches…this kind of evaluation program should go on in whatever form whether it’s TREC or something else.

K: I agree. Why do you think information retrieval’s so hard in general? Is it the cocktail and this need to sort of project manage different skill sets?

J: I think there are a number of reasons. The primary reason is that information retrieval is a very difficult domain to make a lot of progress in. That’s because of the ambiguity of language. Every keyword that you can mention has a variety of meanings and there are many ways of describing any particular search. There are just problems with language. At the National Archives, George Bush is a fundamentally ambiguous term (depending on if) we are talking about Bush 43 or Bush 41′s Presidency. Every term that you can come in with has ambiguity and what’s known as “synonymy”. If you want to search for ambassadors you forget about searching for “diplomats”, “counsels”, “officials” and other ways of describing the same concept. You need a really good, smart approach to keyword analysis. This is what lawyers don’t do because we are not trained in scientific terms or in thinking about best practices in terms of processes. That’s why I’m so heartened to see that Sedona’s coming out with a commentary that builds on the search commentary that we did a couple of years ago that did touch on process. The new commentary is called Achieving Quality in the E-Discovery Process. Part of it is about project management and putting together a team led by an attorney that really will be grappling with these issues on the front end, to think through problems not only about search but about the entire e-discovery lifecycle. In addition, that commentary that’s going to be out imminently will be talking about metrics and sampling and other analytics that I think are important and build on the search commentary that Sedona’s done. We’re not very good at this stuff. I know that I’m an English Major and Political Science Major. We’re in some level faking it, but lawyers are by in large pretty smart people who when they’re tasked to understand a problem really get into it. In trial lawyering – my experience is I’ve had to learn something about such topics as the nuclear regulatory industry. I learned it fast and I could cross-examine people or cross-examine them on statistics. Similarly, I think with respect to e-discovery lawyers, this is not so impossible; this is no quantum physics. The concepts of recall, precision, and information retrieval that are set out in The Sedona commentary are the kind of things lawyers should have a baseline knowledge of. I would urge people listening to this to go to the Sedona commentary – it’s free on the Sedona website. Take a look at the appendix for the different alternatives to search methods. Be up on the cases where judges (like Judge Facciola, Judge Peck recently in the Walter Gross Construction opinion, and Judge Grimm) are laying down a gauntlet in the profession and saying, “Look, we all need to think through these search issues in a more strategic way, a more sophisticated way and be able to document our processes so they’re defensible. At the end of the day, if we do that and if we step back and think about this from a holistic way, we’ll all be better off.” Richard Braman’s Sedona Cooperation Proclamation has a key plank in it about negotiating search protocols. That’s what we’ve been doing in TREC for the last 3 years and that is what is being urged by judges and Sedona – that we all have a more cooperative attitude going into Meet and Confers so that you can think through the search process in a way that is more transparent than lawyers are used to in the past. I believe (this) will lead to more fruitful results and for the bottom line save costs for clients.

K: When we talk about this topic (and I’m thrilled that we’re able to dedicate a whole show on this), I think about all the conferences that I’ve been to. I’ve put on a conference every 3 years with a couple of lawyers from Pittsburgh on bests of legal technology and people come in that are in certain offices or who are litigators who want to learn more about this. They’re looking at a room full of vendors and they’re listening to all of these general cases and there’s very little education on how to actually attack this (the nuts and bolts). Even if you walk into a room of vendors, it would be nice if you had an EDRM model of what each device maybe does in order. When you go to a car dealership, you know it’s a Hyundai, a Buick, or a Cadillac. You have branding and a sense of different standards for vehicles. It’s really hard…

J: This is the question that’s animated me for a number of years: (Why are there) no consumer reports, red dots or black dots for buying products that will do the kind of jobs we’re talking about – search and retrieval. TREC is a means towards that end – it’s not a complete answer. I support lawyers asking questions – technical questions. You have to be grounded enough to do that. To ask hard questions if you’re at a Legal Tech Show or if you’re engaged in a bake-off or some sort of demo with vendors coming in to look at their services to see exactly what it is that you’ll be getting and what kind of benchmarking or sort of standards or objective analysis have they done of their own products and services. Sometimes I’m not invited back by certain forums that I speak at because I’m challenging the host and the sponsor to do exactly that. I’m going to continue on my mission to try to urge that lawyers feel more comfortable with and are better educated on these kinds of technical subjects. I’m so heartened that the Sedona Conference has taken this up and has devoted time, resources and energy to putting out really good papers that are analytical in nature (and) that address the analytics that are going on. I’m also very excited about the future. I think there are solutions that can actually reduce costs of e-discovery (so) that it’s not all malthusean, gloom, and doom that costs will only spiral and increase in this area. I think there are changes in technology that have the power of reducing overall costs across the discovery process. That’s really what I think clients want to hear and what motivates me to continue the research that I’m involved with.

K: Let’s go briefly into one other area of the problem with finding information. It’s the battle with recall and precision. If you want to spend just a minute describing the basic conflicting battle that you have pulling the information…

J: Well, I think these are terms that I think lawyers should be generally familiar with. The Sedona Conference’s publication explains it at length. Recall is a measure of completeness. If you have 50 documents that are really relevant in a large data set and you only find 25 of them, you’ve got 50% recall. Precision is a measure of accuracy. If you have pulled out 100 hits and only 20 of them are relevant and 80 of them are noise, then you only have a 20% level of precision. Recall and precision are inversely aligned with each other. What you really want to do is have high recall and high precision. You want to find all the relevant documents you can in a large collection and you also want to have a minimum of noise. You don’t want to go through and be inefficient with your processes. That’s the problem I have with keyword searching that isn’t really well thought out or negotiated. If you just throw out a few keywords in a large data set, you’re going to get a tremendous amount of noise and you’re not going to get all the relevant documents, especially if you do spell the keywords right. Anyone whose listening who has a keyword issue in their practice and wants to put (whatever the term is), for instance, “tobacco” in a data set, if you don’t purposely misspell the word tobacco and find out what the data set has in terms of OCR errors or anything else, you’re missing potentially relevant documents. These are things I’ve learned along the way. I think there are lots of concepts that are out there in the world of information and retrieval that would be very useful to understand. I think the Sedona commentary is of help, here.

K: If you’re trying to potentially…I’m trying to wrap this thing up here, if you were attacking this problem of trying to find the right information on a project (you’ve got your data set that you want to attack), any sort of final thoughts or thoughts you want to reinforce or tips for people?

J: I think of the lonely attorney listening to this…I think he or she has been dumped on by some Senior Partner type and are all alone trying to dream up words. That’s the wrong paradigm. You need to reach out and have an interdisciplinary team – people who understand the corporate data set; either your own or the other side’s corporate culture. Figure out with IT people; figure out with people who are knowledgeable in other disciplines. Figure out how to attack the search problem. I think the days are over where just one attorney is assigned to dream up a bag of words to go after a large data set – you’re just going to miss out. The Sedona commentary on Achieving Quality is aimed at a better project management perspective of this. I urge people to sort of think in new and creative ways about the problem of searching for evidence rather than just relying on techniques that they inherited from law school and knowing only about Lexis and West Law.

K: Great. We’ve mentioned The Sedona Conference as a place to find additional information on this topic. Any other advice on where to look?

J: Well, I would be delighted to have both participants in the TREC Legal Track as well as volunteers throughout the year for the assessment phase. Thanks to Ralph Losey and his marketing of the TREC project on his blog (which I so much appreciate) and others, we have I think on the order around 20-25 volunteers so far. To go to the Legal TREC homepage, all you have to do is type in your favorite browser “trec|2009|legal|track” and it will come up. There will be an overview paper and an open paper to the profession and we will populate it with more documents as we’re going on this year. I would be delighted for people to take a look at that. The ABA Law Journal had an article about TREC. It’s online in the April issue – an article by Jason Krause called In Search of the Perfect Search. Gartner has just done a small paper that talks about the Legal Track and there is some attention being gained about the project. I’m always happy to take calls or emails from people who want to know what we’re doing and be part of the research effort. I trust that you will have my email as part of this broadcast – .

K: Yes. We will have that all up in the description. That’s basically it. Jason, thank you so much for joining us on this show on search and retrieval. I really appreciate it. You’re someone who’s inspired my career path direction and hopefully you’ll inspire many more people with the good work you’ve been doing.

J: Well thanks very much, Karl.

K: You’ve been listening to ESI Bytes. My name’s Karl Schieneman and I’m with JurInnov. Remember to come to ESI Bytes to learn about electronically stored information before electronic discovery “bytes” you back. If you want to hear more podcasts, there’s a whole library of shows located at our website: . Thanks again and take care everyone.

Recorded 04/28/2009


Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>