Full-Text Search in CouchDB Using... CouchDB
couchdb (6)I have a project I am working on where I want to be able to use full-text search on CouchDB, and spent way too much time today looking into the available options. Along the more practical route there are theCouchDB-LuceneandCouchDB-Solrprojects, but I was pretty determined to get full-text search working without resorting to any external projects.
It sort of happened.
Approach one: one-word limitation.
So I'd probably be publicly stoned for calling this "full-text search", but it does let you retrieve all documents that contain a given word. (Although it does take a很久to build the initial index if you have a large database, it took something like 40 minutes for my sad Macbook to build the index for 60k documents which contain a total of 2.3 million words. After the index is created, though, the retrievals are as quick as they are useless.)
例如,让我们说你想要一个名为的视图word
, that looks up words in either the document'stitle
或者desc
attributes. This view function does the trick:
function(doc){vartxt=doc.title+doc.desc;varwords=txt.replace(/[!.,;]+/g,"").toLowerCase().split(" ");for(varwordinwords){发射(words[word],doc._id);}}
Then you can retrieve all documents with the wordhello
using this urlhttp:// localhost:5984 / mydb / _view / search / word?key =“hello”
. Taking this further, you could send a POST request tohttp:// localhost:5984 / mydb / _view / search / word
which contains multiple keys, but that performs an或者操作,而不是一个andoperation, so this doesn't provide a sufficient tool for matching documents that contain a set of words.
(If you were serious about this, you'd want to do a better job of sanitizing words, and to also convert the list of words into a set of words.)
Approach two: horrifying, but works. For some values ofworks.
The key for a CouchDB view doesn't have to just be a string, it can be any valid JSON expression. So, for example, you might think of your search along these lines (not URI encoded for readability):
http://localhost:5984/mydb/_view/_search/lookup?key="couchdb view"
But if you structured your keys differently, you could also think of it along these lines:
http://localhost:5984/mydb/_view/_search/lookup?key=["couchdb,"view"]
也就是说,您可以将键指定为JavaScript数组而不是JavaScript字符串。因此,如果我们可以创建一个包含每个文档可能组合的数组的索引,那么我们可以执行全文搜索。
Wait, why are you closing the browser. Stop. Damnit. It'll be quick. Really fast lookup times. And who cares if the big O notation for both space and speed is horrifying? It's just a one-time cost. A tremendously large one-time cost--yes--but hey, it's kind of novel nonetheless. And if you're only indexing a very small amount of text (just titles, or titles and tags for example), then this may actually work for you.
Here is what the view function looks like:
function(doc){// permutation func by Jonas Raoni Soares Silvavarpermute=function(v,m){for(varj,l=v.长度,i=(1<<l)-1,r=new大批(i);i;)for(r[--i]=[],j=l;j;i+1&1<<--j&&(r[i].push(m?j:v[j])));returnr;};vartxt=doc.title;txt.replace(/[!.,;]+/g,"");varraw_words=txt.split(" ");varwords={};for(variinraw_words){varword=raw_words[i];if(word=="")continue;if(!words[word]){words[word]=1;}else{words[word]++;}}varword_set=[];for(varwordinwords){word_set.push(word);}varpermutations=permute(word_set,0);for(variinpermutations){发射(permutations[i],doc._id);}}
Let's just start out by saying,yes, this actually works. I tested it. It does work. And let's follow that with a caveat: if the value oftxt
超过4的5字,那么它会这样吗metimes trip the 5 second limitation on map functions. (I tried recompiling the code with the delay moved from 5 to 50 seconds, but the change didn't seem to stick for whatever reason. Also, it just shouldn't take five seconds to perform the above code. Itisinherently inefficient, but it shouldn't bethatslow. I think using the Python or Common Lisp view server might alleviate the issues as well. I may try that a bit later.)
If you wanted all documents with a permutation that contained only NFL, then you would go to this uri:
http://127.0.0.1:5984/bossv/_view/_search/word?key=["NFL"]
If you wanted all documents with a permutation that contained NFL, then you could do this instead:
http://127.0.0.1:5984/bossv/_view/_search/word ?startkey=["NFL"]&endkey={}
Finally, I'll briefly mention that you could use an adaptation of this technique to make it possible to retrieve all blog entries that have an arbitrary combination of tags, which is a complex query that a relational database can't easily build an index for, but--using the above technique--CouchDB easily can. So, although this example of created a permutated index is pretty ridiculous, I think the technique could be genuinely useful in some situations.