Freetext Search ≠ Meta Data Search ?

Gav reminded me the other day, when we were discussing the problem of searching, that freetext search does not equal metadata search (also known as field restricted search). Gav was right, of course, as it takes no stretch of the imagination to see that doing a (freetext) search, for documents whose content contains the words "John Doe", is not exactly the same as searching for documents whose Author is "John Doe" (metadata search). The problem with the former is that you may get results back with documents having "John Doe" the Author, as well as "John Doe" the Reviewer.

I claim, however, that metadata search and freetext search are computationally equivalent problems. That is, you can use a freetext search solution to solve the metadata search problem, and conversely, you can use a metadata search solution to solve the freetext search problem. In this post, though, I shall endeavour to prove the first part only, since that is the crux of the original point of discussion.

First, let's do a little bit of review in computational theory. We say that a problem A is reducible to problem B if we can apply a series of transformations that turn problem A into problem B, which, we hope, is less complex and more solveable than problem A. In effect, we're saying that if we could find a solution to problem B, then we would know a solution to problem A as well. It is a common approach that academic researchers like to use when dealing with computational complexity involving NP-complete problems, i.e. problems for which no solution has been found that can be computed in polynomial time. Researchers try to prove that a problem is NP-complete by trying to reduce it to another known NP-complete problem.

Well, I claim that the metadata search problem is reducible to freetext search. Or, in more layman's terms, you can satisfactorily simulate metadata searching using freetext searching capability. And below are the transformations that take the problem from the metadata search domain into the freetext search domain.

Let's say you want to index a PDF document with the metadata field: Invoice_Number = ABC100.
For this document, you provide a secondary file which will serve as the "index file" whose content contains some sort of encoding of the above name-value pair, in addition to the link to the real document. Something like:


<DocumentSummary>
<Invoice_Number>ABC100</Invoice_Number>
<A HREF="file:///C:/Accounting_Data/DOC00000018567.PDF">View Document</A>
<HASHCODE>51E88D049A06D7018D38740772FCAA0A<HASHCODE</CODE>
</DocumentSummary>


where 51E88D049A06D7018D38740772FCAA0A is the computed MD5 hash of the text <Invoice_Number>ABC100</Invoice_Number>.

You will then feed those two files to the freetext indexing engine.

Now then, since the MD5 hash is fairly unique, if I were to do a freetext search for the keyword 51E88D049A06D7018D38740772FCAA0A, it would be highly unlikely that I would get a mixed set of index file and real files in the search result, but rather I would only get a list of index files containing Invoice ABC100, which contained the link to the actual invoice PDF file itself.

Hence, we've proven that we can use a freetext search capability to do metadata search.
QED.

3 comments:

Anonymous said...

The key is "fairly unique". Would you feel comfortable using this with critical data? I wouldn't want my patient record being mistaken for some one elses in an emergency for example. Take a look at http://www.cits.rub.de/MD5Collisions/.

Another problem you would have with this method is trying to do searches based on criteria other than simply equality. For example, people often want to find things based on a date range.

Thanh Hai Tran said...

If MD5 is not good enough, then we use a better hash alorithm, like SHA2 for instance.

Concerning the problem of range search operations:

A bounded range search query can be reduced to equality search of discrete units with OR operators.
e.g. the query "GIVE ME ALL PATIENT RECORDS WHERE ADMITTED_DATE ≥ 'jan-01-2006' AND ADMITTED_DATE ≤ 'jan-31-2006'" can be reformulated as: "GIVE ME ALL PATIENT RECORDS WHERE ADMITTED_DATE='jan-01-2006' OR ADMITTED_DATE='jan-02-2006' OR ... OR ADMITTED_DATE='jan-31-2006'.

Unbounded range search can be reduced to iterative bounded searches, with paging.

It may not be pretty, but...it can be done.

Thanh Hai Tran said...

On some general and intuitive level, I'd tend to agree with you on the issue of speed, except:

* Redundancy: May be, and may be not. More likely, the metadata exists separately in another application and needs to be brought over using an integration tool such as AppConnector.
Even if the same data were already in the content, you couldn't really rely on the fulltext engine to OCR them with 100% accuracy, especially if the document was scanned in, with partially hand-written text.

* Efficiency: This could be dependent upon the implementation. How do you know that the fulltext data itself isn't stored in a
relational database? ;-)

P.S. Welcome to the blogosphere, Scott. Can't wait to read your blog. ;-)