What Is Concept Searching?

conept searching
Listen to Herb Roitblat Ph.D., OrcaTec LLC EDI Chairman & Principal and former owner and creator of Dolphin Search, discuss search and retrieval issues in electronic discovery. Dr. Herb Roitblat is the primary inventor of the core DolphinSearch technogy (patent No. 6,189,002). Herb led the design of the DophinSearch review tools, DIAD, and ComplianSeek, and was part of the team that brought concept searching and native file review to the eDiscovery industry. Herb is a recognized expert in cognitive science, information management, data mining, statistics, and eDiscovery processes. He is a frequent author and speaker on how data mining and how technology can ease the burden of ediscovery. Herb is also a member of The Sedona Working Group on Electronic Document Retention and Production.

We will focus this podcast on what exactly is concept searching. Lots of tools say they are concept searching. What does this mean? What are the ranges in concept searching? How do you effectively do concept searching? How important is process to the entire search and retrieval effort. How do you build a “cocktail approach” to designing the right electronic discovery process. These and other issues will be addressed by Dr. Herb Roitblat.

[DDET Click here to read the transcript]
Karl Schieneman-Interviewer

Herb Roitblat-Guest

K: Hello everyone. Welcome to another edition of ESI Bytes. This is Karl Schieneman, Director of Legal Analytics and Review at JurInnov. I’m real excited about today’s show as it’s an area that I work in. We’re going to talk about concept searching and what people mean when they say, “concept searching.” It means so many different things (depending on who’s saying it). We have with us one of the pioneers of the concept searching and search engine fields – Herb Roitblat, or Dr. Herb Roitblat we could say. He’s a PhD and co-founder of the principle at Orcatec LLC. Before starting at Orcatec, Herb was also the Executive Vice President and Chief Scientist (and co-founder) of Dolphin Search – one of the early search engines in the field. Herb led the design of the Dolphin Search review tools. He’s part of the team that brought concept searching and native file review to the e-discovery industry. Herb’s recognized as an expert in cognitive search, information management, data mining, statistics and e-discovery processes. He’s been writing about data mining and how technology can ease the burden of e-discovery currently (as well as for years). Herb, thanks for joining us on the show.

H: Thanks for having me. (It’s) a great pleasure.

K: Let’s start off…I always like to ask everyone on the show…how did you first become interested in electronic discovery?

H: It kind of happened by accident. We were trying to use some technologies for doing knowledge management and ended up finding out that lawyers needed to do discovery more than they needed to do knowledge management. We had the technology to do it, so we responded to those two things. At the time it seemed easy and then we learned what it really was all about. It hasn’t been easy for at least 10 years.

K: Okay. Let’s dive into the topic here. We’ve heard a lot about concepts searching over the past few years in electronic discovery. Help us out here (and help the listeners) – what is concept searching?

H: Basically, concept searching is using meaning to help find responsive documents. There are a number of approaches to using that meaning, but they all revolve around the same idea. Instead of just looking for strings of letters as words, rather, let’s look for words as meaningful things. We can identify what the words mean using a number of different tools that we can talk about in a bit. Once we identify that meaning we should be a whole lot better at identifying what the documents are about and which documents are responsive and which ones aren’t.

K: Is the term, “concept searching” overused at this point? Are there different people attaching different meanings to it?

H: There are somewhat different meanings attached to “concept searching”. I don’t know that it’s necessarily being overused so much as there are a variety of tools you can us to get at concept searching. For example, you could use a thesaurus. You’re familiar from your intermediate school days (maybe high school) with using Roche’s Thesaurus – other ways of saying things. In fact, people are very creative in how they say things. Your job as a searcher is to undo that creativity in the sense of trying to figure out how they could have said something and how you can find it afterwards. You could also you a taxonomy. A taxonomy is a hierarchical list of categories. In a taxonomy, if you’re interested in say, cars – searching for the word “cars”, you might also interested in documents that are in supersets in the category of the word “cars” (such as) documents that talk about vehicles. If you’re interested in various things, you can move up and down the hierarchy and find things that name something at either a higher or lower level. A third kind of system for doing concept search involves an ontology. An ontology is like a taxonomy in that it points to things that are related to one another, but it’s different in that it isn’t required to be just hierarchical. You can talk about things that are associated, for example, lawyer and attorney are synonym. You’d find that in a thesaurus, but they’re also related words for a legal professional of the sort. The legal professional also might be called other things. There are other words that are associated with “lawyers”, such as “judge” and “case” and “matter”. You might be interested in documents that talk about those words when you search for a particular words like “lawyer”. There’s yet another approach to concept searching – this is the one that I’ve tended to follow, and that’s a machine learning kind of approach. Rather than having somebody sit down and explicitly design a taxonomy or an ontology, you can let the documents tell you what words are related. In this we follow, say the philosopher Wittgenstein, who argued that the meaning of a word is its use in the language – that’s pretty much right. Back to our “lawyer” example, any document that has the word “lawyer” in it is also likely to have things like “Esq.” and “judge” and “case” and “matter”. Conversely, documents that talk about “judge” and “case” and “matter” are likely to be about (the word) “lawyer”, whether “lawyer” appears in it or not. All of these different approaches try to use meaning to help get at what you’re searching for. The way they do it is essentially query expansion. So if you search for lawyer, you can use any of these approaches. What’s going to happen behind the scenes (sometimes where you can see it and sometimes where you can’t) is going to be a search for “lawyer” + “judge” + “matter” + whatever other words your system tells you are associated. What that’s going to is bring back documents to focus on the meaning of the word on the very top of your list and it’s going to find documents that you wouldn’t have otherwise thought of. It’s going to search for these other words in context and using context (even if it doesn’t have that particular word in it) to find the documents that you might not know to look for.

K: Okay. Typically, at least the industry standard for the last…ever since people started searching for electronic discovery has been word searches. You agree on your word terms and then go out and search. Do you think concept searching improves recall rates on searching for ESI more than word searches?

H: It certainly does. It certainly improves the recall rate where recall is your completeness. Have you found all the documents that are responsive? Well, you have found more of them because you’ve expanded your query. You’ve searched for “lawyer” and you’ve searched for documents that have other words in it. It’s going to improve the number of documents that you find are responsive. It also has the capability of increasing precision (at least at the top of the list). In this sense, precision is your accuracy – percentage of retrieved documents that are responsive. It will improve that because it adds the context for your query so you’re not only looking for the particular word (but) you’re searching for it’s context. It’s going to be more tightly related to what you’re actually interested in or at least what the document collection is actually about when you talk about that word. It can improve the quality of recall. It’s not so much a choice as word search or concept search in that concept search is basically a form of word search; it’s just a form that includes more information than the specific searcher usually thinks of by him/herself.

K: That was actually one of the points Jason Barron made on the show we did on search and retrieval – don’t put one person in a room and tell them to think of all the words they can think of (because) it doesn’t work very well.

H: No. There’s the famous study by Blair and Maron that was done in 1985. It found that attorneys were only capable of guessing query words that would bring back 20% of the responses out there – a 20% recall. It’s not the fault of the search system because the search system found whatever it is we asked them to search for; rather, it was the difficulty of guessing the right words to search. The parties used different terms to talk about things. The same people who wrote about stuff (relatively accident to what they studied) used different words to talk about the same thing. The lawyers figured out 3 keywords for one thing and Blair and Maron (the guys who ran the study) found 27 more words that meant essentially the same thing. One side talks about the disaster (and) the other side talks about the unfortunate incident. Sometimes they mentioned peoples names in the document (and) sometimes they didn’t. It turns out to be a very formidable problem that they can’t figure out exactly what words to search for. That’s were concept search comes in; it helps you to figure out additional words to search for.

K: The Gartner Group did a recent analysis about discovery that talked about the “cocktail approach” of finding any electronic evidence. The analogy is treatments for various viruses where you try different ingredients to treat different symptoms. Different cases require different treatments. Do you think this is the direction search and retrieval is heading (when) trying to come up with different “cocktails”?

H: I think that people would find it very nice to say, “Well here’s a quiver full of arrows where each arrow is a different search engine. Now when you do this, my friend, you don’t have to think about your search. Just run this set of tools and what comes out of it are going to be the documents you magically need in your case.” I think about that as sort of a Hellman’s Mayonnaise jar theory of e-discovery. West of The Rockies is the best food, by the way. The idea is that you throw in a bunch of stuff and what you pour out are e-discovery documents (the responsive ones). That’s just not the case. All of these tools that we’re talking about are tools. They’re power tools. You can talk about how good it is to have an electric screwdriver. You can even talk about when an electric screwdriver is better than another electric screwdriver at screwing in screws, but to say you’re going to get better building if you use electric screwdrivers (rather than) building it with a manual screwdriver? I don’t think that makes a whole lot of sense. So, the idea of a “cocktail” is when it’s appealing, but I think it’s (in essence) barking up the wrong tree; Rather, what we want to know is how we go about using the tools that we have in a way that will give us the best results. And it’s (also) how we use the tools. The quality of the carpenter is a major part of what determines how good the building is rather than the quality of the particular drill or the particular power saw that is being used. Certainly you could build more buildings (and probably better buildings) with power tools, but it still takes a good engineer and a good carpenter to build a good building that you’re going to happy to live in. This “cocktail” notion – I think it’s a good idea. I think that people try to do too much all in one query, so if you mean by “cocktail” you have to use different kinds of queries to get the same information (then) yes. I think that people try to do too much in their queries because they try to have one question that’s going to give them the definitive answer. Maybe what you need to do is do a bunch of smaller queries that are easily understood and more manageable and then combine the results. So (we) were talking about Jason Baron before, and Jason’s been very kind to the field in the sense that he’s been forthcoming about what he’s searched for and how he did it. He talks about the Phillip Morris litigation where he did a query for some very complex things. One of the queries sounds kind of like this: “Master Settlement Agreement” or “MSA” and not “Medical Savings Account” or “Metropolitan Standard Area” or “Section 1415″ or “ETS” and bunch of other things. It goes on. The way I have it written out is probably 15 different lines or “ands” and “or’s” and “not’s” and combinations of things. It’s impossibly difficult to understand things like that. When I first saw a query like that I thought, “Oh, that can’t be right.” It took me about a half an hour to go through to see if just the grammar of the query was right. Then, does it give him the results that he wants? Well I don’t know; it’s hard. If you had broken it up into smaller pieces it would have been easier to deal with and easier to understand. What Jason did is he had this demand from Phillip Morris to produce documents from the government archives. He went through and did a search based on the keywords that he could think of. He looked at that and thought that it brought back too many documents. The next thing that he did was (to) go through and take a sample of those documents and try to figure out how he could limit the number of documents that came back with each query. That’s where you get the “not’s”. He revised it based on the evidence he could get from doing the queries; he didn’t just willy nilly do it and say, “Okay. That’s it, I’m done.” Rather, it was an iterative process. I think that’s really an important lesson to learn (no matter) what technology you’re using. Some technologies are better than others, but whatever technology you’re using (be sure to) analyze whatever it is (and) think about it. Can you do better? Can you get more information out of what you’re doing? Use a full process, not just a throw it over the wall and hope for the best.

K: So if we were to go to the “cocktail” analogy (where) the mix might be one part process, one part technology blended to your satisfaction…we’re making a well-made drink here that we’re looking to drink, right?

H: I think it’s more than technology and process. It’s also evaluation. You not only have to…maybe that’s just part of the process, but you have to look at what you’ve done and evaluate whether you’ve done a good job or haven’t done a good job. I think that in e-discovery, we’ve been generally remised on doing that kind of evaluation. In fact, I think the standards aren’t all that high. The TREC study that came out this year – the Text Retrieval Conference that the Legal Track (again we’re talking about Jason) did something where they compared a small set of documents that had been classified in a previous year with documents that were classified this year in 2008 (the year that they’re talking about). It turns out that of the documents that were in this test that were classified as responsive the first year, 58% of them were identified as responsive the second year. Only 58% of whatever number there was in this evaluation that were identified as responsive the first year, only 58% were identified as responsive the second year. Of those identified as non-responsive in the earlier year, 18% of them were identified as responsive in the second year. Overall, the agreement was around 73%. That’s remarkable in the sense that we kind of have this intuitive notion that if you put together a team of people out there who are reviewing documents they’re going to come up with a pretty accurate set of documents that are going to be responsive. That doesn’t seem to be the case. If you take 2 teams to review the same document, they only agree around the low 70’s. The accuracy of all of this is not as high as one would like. Adding…

K: That makes it really hard to evaluate this stuff. I remember reading in Blair and Maron that they spent about 4 million dollars and basically could have kept going, but they ran out of time and money.

H: My recollection is that they spent $100,000.00, but that was back in 1984 when they did an ’85 study and published it. Yeah, they ran out of money before they found all of the synonyms for one of the terms they were looking for (as they were going through it). This is a tiny, tiny step of documents by today’s standards. When we first started, a gigabyte was a huge collection. Now (it’s like), “Gigabyte, don’t bother me.”

K: And this is 40,000 documents – 350 pages.

H: Yeah, that’s what the Blair and Maron was. Now one gigabyte is usually 50,000-70,000 pages. When we first started, lawyers would tell us, “We know we should be looking at the electronic stuff, but what we did was take a gigabyte of email and loaded it into our desktop Outlook and tooled through it the night before.” By today’s standards, that just won’t cut it.

K: So by evaluating maybe one of the ways to approach this is to use a statistical tool while you’re on a project?

H: Well, I think that’s exactly right. You have to do some careful evaluation, sampling, etc. It’s not enough to be half-hazard about this to be systematic about it, but you can respond to questions in court about “How do you know that you did a reasonable job?” One example is the Victor Stanley v. Creative Pipe case. They said, “Oh yeah, we did this and we had these 70 search terms,” (and the response was), “Well what were those 70 search terms?” (The answer was), “Well I’m sorry but I can’t tell you.” The reason why they can’t tell you (I believe) is because they either didn’t keep track (and that’s being generous) or they weren’t really designed to do what they were supposed to do. That’s just my opinion about it.

K: You should listen in when we have Judge Grimm (who authored that opinion) on a show sometime in early June. It should be an interesting show. I think Tim Opsitnick (from our company at JurInnov) is going to be an additional guest on that one.

H: I’m sure it will be very interesting. He’s quite discreet, though (about his own decisions). He’s willing to talk about other judge’s positions but he’s quite discreet about his own.

K: It is an incredible, useful decision (and) I think it’s keeping it in the right direction. What do you say are the biggest barriers to getting wired acceptance to using better technology within a process and then evaluating what you’re doing? Is it lawyers, judges, corporate clients, bad technology, bad process? What do you think?

H: I think it’s history in that the way many of the people involved in e-discovery these days (especially attorneys) learn their craft is with botches of documents. That’s how they know how to do it and they want to do their best to make everything fit within the methodology that they learned in their adolescence, if you will (maybe that’s an exaggeration). People like to do the things they like to do when they’re in their late teens and 20’s. They learned how to do something then and that’s how they like to do it. At the same time, I think lawyers recognize that the volumes today are just too huge to keep on doing that sort of thing. I think you just can’t possibly do it (with) the amount of time and money it would take. It threatens the legal system when it costs more than many cases are worth, so what do you do? Well, you either find a better way of doing it or you settle. Settling isn’t a good long-term strategy because it encourages people to just take advantage of you if you don’t want to spend the money. I heard a story about a guy who used to sue AT&T every year for patent infringement. In order to avoid the cost of discovery, they would just settle with him. So the next year they’d be back again until finally they decided that they weren’t going to just settle anymore. They actually did the discovery and I think he went away after that. Something’s got to get done or the squeaky wheels are going to run the justice system. So, what things are resisting? Adoption – I think it’s a comfort level. I think there’s a lot of fear that concept searching is a black box that you dump in stuff and stuff comes out. Back to my Hellman’s Mayonnaise jar analogy – you dump stuff in (and) you dump stuff out and you have no idea what was the clausal relationship between the two. I don’t think that it’s appropriate. I don’t think that concept searching is a black box nor is it a black hole; rather, it can be done in a very transparent way. This is how we searched, this is what we searched for, this is what we found, this is why we think we found it, etc. Make it so that you can easily point to it and defend it. People ask, “Could concept search ever be defensible?” (The answer is) Yes. It can be because judges are asking for it. We talked about Judge Grimm (and) Judge Facciola – they’re both recommending concept searching. Then the third objection (I think) people have is, “Well, I already have too many documents. If we do a concept search I’m going to have even more.” The answer to that is you may have a few more, but you have an obligation and a need (aside from the obligation) to find the responsive document. If you don’t find them, you’re opening yourself up for a lot of trouble later on and you may miss things that will be helpful. Not everything is going to turn out to be negative. Not everything is going to turn out to be expensive. The intelligent use of these power tools can help you to defend your case even as you meet your responsibilities for producing information for the other side. I think that it took a long time to get over the idea that you should print out every document before you review it. I think it took a long time to do searches and not just review every single document. We’ve been at this for 10 years, but I think it’s starting to take hold – the idea that you should be as intelligent as you can be and you should use as powerful a tool as you can manage. Do your analysis for e-discovery. I think that people are getting the notion that there’s value in using that tool and that we can cover that value and take advantage (a strategic advantage) by using good strong power tools.

K: The other thing I guess, is you might recall more responsive documents this way. I think (if used appropriately) you still might review substantially less documents if you looked at every single document (which is what a lot of people still do today).

H: Most certainly, true.

K: There is an efficiency angle that could be here, too. I didn’t want that to get lost on people. Well Herb, thank you very much. This has been a really interesting topic that I don’t think is covered enough out there. We’ll keep doing it here at ESI Bytes. Where should people look if they’re interested in finding more information on search and retrieval?

H: There’s the TREC Conference – The Legal Trac. If you GOOGLE search “Trec Legal” you’ll come across it. There’s The Electronic Discovery Institute that is doing research along these same lines in order to find out how comparable machine-assisted reviews are. Then there’s our Orcatec website; we have lots of white paper that are useful at: www.orcatec.com . You can sign up for our newsletter on the website.

K: Well thanks again. This has been helpful and interesting. We look forward to having you on more shows here. Just to let everyone know, for more shows you can look at our complete library at www.esibytes.com . Just to tie it into the material we use sort of a taxonomy approach to search and retrieval. If you want to find a show, I encourage you to look in the right-hand scroll bar and track the EDRM model by speaker name; a little bit of tagging, but it’s a little bit easier to it this way (from my perspective). Just remember (that) the tagline here at ESI Bytes is: Listen to ESI Bytes before ESI and electronic discovery bites you back. Thanks again, everyone. I look forward to doing another show with you soon. Take care. This is Karl Schieneman from JurInnov.

Recorded 05/18/2009


Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>