Prospective search using python
Wed, Jul 22, 2009
Vote on HN
Prospective search, or
persistent search, is a relatively less common method of implementing search where the list of keywords is defined, and when provided a single document it determines the list of keywords applicable to it.
This is different from traditional (or "retrospective") search, where many documents are stored into an indexed and when provided with a search term, the search engine returns the list of documents which best match the query.
The best real world examples would be how Google News Alerts(or IMHO categorization/clustering in Google News) works. When a new news story is found by Google, it makes more sense to run a prospective search on the news story to find which alert subscriptions (or news category) it belongs to, rather than searching for all the alerts repeatedly on their entire index.
Lucene has a
MemoryIndex class for just this purpose, ive made a simple implementation in python using
pylucene. MemoryIndex is a special class in lucene for on-the-fly searching. It can contain only one doccument which may have more than one field. This is ideal for prospective search.
Installation and setup of pylucene is out of scope of this post...
RTFM! (do note u need to edit the MakeFile)
import sys, os, lucene, time, threading
def ProspectiveSearch(body, terms):
lucene.initVM(lucene.CLASSPATH)
index = lucene.MemoryIndex()
index.addField("content", body, lucene.StandardAnalyzer())
parser = lucene.QueryParser("content", lucene.StandardAnalyzer())
matches = []
for term in terms:
score=index.search(parser.parse(term))
if score > 0:
matches += [term]
return matches
sample usage :-
body = "hi my name is sajal kayan"
terms = ["sajal", "good", "boy", "name", "sajal AND NOT kayan", "sajal AND kayan"]
matches = ProspectiveSearch(body, terms)
In this case returns ['sajal', 'name', 'sajal AND kayan']
Note:initVM() is
giving problems on mod_wsgi
On my computer, this is the benchmark i noticed for a 244 word content.
- 1,492 queries : 0.79 seconds (for whole script only 248ms for the search loop)
- 14,920 queries : 1.519 seconds
- 74,600 queries : 3.425 seconds
- 149,200 queries : 5.552 seconds
- 298,400 queries : 10.328 seconds
If you know a better method to achieve prospective search in python do let me know. Would also be interested to know if any RPC based search software does this.