Prospective search using python

Prospective search, or persistent search, is a relatively less common method of implementing search where the list of keywords is defined, and when provided a single document it determines the list of keywords applicable to it. This is different from traditional (or "retrospective") search, where many documents are stored into an indexed and when provided with a search term, the search engine returns the list of documents which best match the query. The best real world examples would be how Google News Alerts(or IMHO categorization/clustering in Google News) works. When a new news story is found by Google, it makes more sense to run a prospective search on the news story to find which alert subscriptions (or news category) it belongs to, rather than searching for all the alerts repeatedly on their entire index. Lucene has a MemoryIndex class for just this purpose, ive made a simple implementation in python using pylucene. MemoryIndex is a special class in lucene for on-the-fly searching. It can contain only one doccument which may have more than one field. This is ideal for prospective search. Installation and setup of pylucene is out of scope of this post... RTFM! (do note u need to edit the MakeFile)
import sys, os, lucene, time, threading

def ProspectiveSearch(body, terms):
    lucene.initVM(lucene.CLASSPATH)
    index = lucene.MemoryIndex()
    index.addField("content", body, lucene.StandardAnalyzer())
    parser = lucene.QueryParser("content", lucene.StandardAnalyzer())
    matches = []
    for term in terms:
        score=index.search(parser.parse(term))
        if score > 0:
            matches += [term]
    return matches
sample usage :-
body = "hi my name is sajal kayan"
terms = ["sajal", "good", "boy", "name", "sajal AND NOT kayan", "sajal AND kayan"]
matches = ProspectiveSearch(body, terms)
In this case returns ['sajal', 'name', 'sajal AND kayan'] Note:initVM() is giving problems on mod_wsgi On my computer, this is the benchmark i noticed for a 244 word content. If you know a better method to achieve prospective search in python do let me know. Would also be interested to know if any RPC based search software does this.
Tags: Python search
Categories: Python