So there’s a tradeoff here, between the speed with which a search engine incorporates new documents, and the speed with which it responds to queries. Confusion about this makes me grumpy when people talk about blog search and pings (the implication being that if old-school websearch engines listened more to pings, then they would be up-to-the-second fresh).
Traditional information-retrieval engines assume they have all the relevant documents before they build an index. And at the heart of the index are what is called “postings” — mappings of terms to the lists of document IDs that contain them. A postings file might look like this:
aardvark – 1,10,13
anteater – 1,8,49,74
ants – 1,8,10,12,74,89
but compressed and optimized to support one basic operation: scanning the postings to find document IDs that have the union of terms from the query (i.e. scanning for “aardvark” and “ants” should very quickly return: 1,10).
Now there are many tweaks on top of this basic postings framework, but as described, postings are 1) brutally efficient at supporting AND queries, and 2) a real pain to update. Part of the problem is that postings are usually compressed. For example, lists of doc ids are usually represented using run-length encoding – instead of using the absolute number (1,10,13), you use the increments (1,9,3). This makes the numbers smaller on average, so you can represent them more compactly, using some kind of variable-length encoding.
But if you have a new term, or a new document? Good luck. Sure, you can take that new information, and merge it with your existing index to produce a new index. But that means a reformat of the file, and writing out a lot of new bytes. You’re not going to want to do that every time a new document arrives.
Now, aren’t there blogsearch engines out there that claim very quick incorporation of new documents? Yes there are, and I basically buy that at least some of them operate as advertised. But then you have to ask how quickly they serve queries. Which would you rather have: information that’s a number of minutes old and served within less than a second, or info that’s very fresh that takes 30 seconds to query? Depends on your needs, of course, but anywhere close to 30 seconds and I think it’s an unsatisfactory user experience.
When responses to queries are really slow, one is tempted by a terrible suspicion: could there maybe be … SQL under that hood? Database systems (RDBMS’s) that support SQL are built under an assumption that there will be a mixture of read queries and write queries and (more importantly) have a very general model that can support arbitrary mixes of those. If you want to seriously scale a search engine, you cannot afford to build it on top of an RDBMS,