Freetext Search ≠ Meta Data Search ?

Gav reminded me the other day, when we were discussing the problem of searching, that freetext search does not equal metadata search (also known as field restricted search). Gav was right, of course, as it takes no stretch of the imagination to see that doing a (freetext) search, for documents whose content contains the words "John Doe", is not exactly the same as searching for documents whose Author is "John Doe" (metadata search). The problem with the former is that you may get results back with documents having "John Doe" the Author, as well as "John Doe" the Reviewer.

I claim, however, that metadata search and freetext search are computationally equivalent problems. That is, you can use a freetext search solution to solve the metadata search problem, and conversely, you can use a metadata search solution to solve the freetext search problem. In this post, though, I shall endeavour to prove the first part only, since that is the crux of the original point of discussion.

First, let's do a little bit of review in computational theory. We say that a problem A is reducible to problem B if we can apply a series of transformations that turn problem A into problem B, which, we hope, is less complex and more solveable than problem A. In effect, we're saying that if we could find a solution to problem B, then we would know a solution to problem A as well. It is a common approach that academic researchers like to use when dealing with computational complexity involving NP-complete problems, i.e. problems for which no solution has been found that can be computed in polynomial time. Researchers try to prove that a problem is NP-complete by trying to reduce it to another known NP-complete problem.

Well, I claim that the metadata search problem is reducible to freetext search. Or, in more layman's terms, you can satisfactorily simulate metadata searching using freetext searching capability. And below are the transformations that take the problem from the metadata search domain into the freetext search domain.

Let's say you want to index a PDF document with the metadata field: Invoice_Number = ABC100.
For this document, you provide a secondary file which will serve as the "index file" whose content contains some sort of encoding of the above name-value pair, in addition to the link to the real document. Something like:


<DocumentSummary>
<Invoice_Number>ABC100</Invoice_Number>
<A HREF="file:///C:/Accounting_Data/DOC00000018567.PDF">View Document</A>
<HASHCODE>51E88D049A06D7018D38740772FCAA0A<HASHCODE</CODE>
</DocumentSummary>


where 51E88D049A06D7018D38740772FCAA0A is the computed MD5 hash of the text <Invoice_Number>ABC100</Invoice_Number>.

You will then feed those two files to the freetext indexing engine.

Now then, since the MD5 hash is fairly unique, if I were to do a freetext search for the keyword 51E88D049A06D7018D38740772FCAA0A, it would be highly unlikely that I would get a mixed set of index file and real files in the search result, but rather I would only get a list of index files containing Invoice ABC100, which contained the link to the actual invoice PDF file itself.

Hence, we've proven that we can use a freetext search capability to do metadata search.
QED.

GMail ATOM feeds

Hmm...It seems that Google has been providing ATOM feeds into your GMail inbox for a while now. The question is, then, whether I should trust either of Bloglines or NewsGator enough to give them my GMail ID and password, so that they can fetch the mails for me. :-)