Here’s a more technical (and much longer!) post than usual – some thoughts about web search relevance, which may help explain why I ended up at Powerset.
Relevance Improvements (large and small)
I’ve been working in web search since 2000 (Excite, Inktomi, Yahoo, and now Powerset). During that time, I’ve seen lots of new techniques that made small but significant improvements in relevance for the user, including several different ways to use machine learning. During that entire time, though, I’ve only seen two new techniques that, by themselves, made a large difference, a step-function kind of change. They are:
So what do inbound links get you? A popularity measure and summary text. Each link gives you 1) a vote for the document, and 2) a snippet of link text, which is often a very concise summary of the target page. If many people “vote” for a page and use a particular word or phrase to summarize it, then using those words or phrases for retrieval may give much better results than anything you would find based on the documents themselves. For example, the query web encyclopedia on Google gives you an excellent first result of http://en.wikipedia.org, no doubt because so very many people link to wikipedia and use those terms.
What does proximity get you? For one thing, preferring terms that are close together will help you catch names – for the query George Bush, you are more likely to find stories about one of the American presidents than stories about a man named George who went out into the bush. But this is the simple case, and could be solved with phrase searches, where you insist that the terms must occur right next to each other to count as a match at all. Why is it important that terms are close in the document, even if they’re not adjacent?
Why Proximity Matters
Proximity matters because terms that are close to each other in the text are more likely to be closely connected in the meaning structure of the text.
Take a query like Obama Afghanistan, for example. While it’s possible that the querier is independently interested in the two topics (Obama and Afghanistan), it’s much more likely that there’s some relationship sought: statements by Obama about Afghanistan, a question about whether Obama has been to Afghanistan, etc.
I tried this query on Google, and got result snippets that included these:
But if you look at what’s happening in Afghanistan now, you are seeing the Taliban resurgent, you are seeing al-Qaida strengthen itself,” Obama said. …
MODERATOR: Senator Obama, Afghanistan has just conducted the first elections in its 5000-year history. They appear to have gone very well–at least, …
In both these examples (where the terms have a distance of 14 and 1, respectively (ignoring punctuation)), the proximate terms actually seem to be related, which is probably a good thing. In the first case, we have a quote on Afghanistan attributed to Barack Obama himself; in the second we have a question about Afghanistan that is being addressed to Senator Obama.
This bet of relatedness seems dodgier if the terms are separated by, say, hundreds of words. Or take the case of another result from this same search:
Updated April, 2007. Map of Afghanistan In PDF format … Obama aides won’t say how much money the solicitation raised and would only say that “thousands” …
If you look at the underlying document (going to the cached version in this case) it turns out that the distance represented by that ellipsis is large – 79 words separate the closest occurrence of “Afghanistan” and “Obama”. “Afghanistan” occurs only once, in a sidebar link to a map, which sidebar decorates an unrelated news article about an Obama fundraiser. Not very proximate, and a result that’s correspondingly disappointing in the unrelatedness of the words.
So imagine that we’ve decided that proximity is a good thing – that results are more likely to be relevant when the query terms are close together. Now, how exactly should we score results for proximity? Our job here is not to do every part of search ranking – it’s just to figure out what score to give a query-document pair that captures the proximity aspect, and rewards high-proximity documents in the right way.
The simplest scoring scheme just to assign a score that is the distance between the terms. If you see “Obama” and “Afghanistan” right next to each other, return a score of 1; if you see “Obama [..78 words..] Afghanistan”, return a score of 79. Simple, no? (Of course, we haven’t said anything about how low a score has to be to be considered a good score, but maybe we can let someone else sort that out when all the ranking factors get combined.) The main thing is that a low score is better than a high score. Now, one complexity is that this pair terms may occur multiple times in a document, so which distances do we care about? Um, let’s just take the minimum distance – the distance between the two instances that are the closest.
Now, let’s begin to open the worm-can: what about three-term queries? Your job is to take a three-term query and a document, and return a single score that is a proximity reward. The problem here isn’t that we lack ideas – the problem is that there are a whole lot of ways to generalize from 2 to 3 here that seem equally likely to be good. The generalized score could be:
Do you have any intuition about which of these proposals is more likely to capture “relatedness”? I don’t. So the right approach is probably to hack up a number of scoring methods and see which gives the best results.
When Proximity Fails
We haven’t seen the worst of it yet. Ponder this snippet from the same Google query I issued earlier:
Taliban Comeback in Afghanistan?; Obama vs. Clinton Fallout; Iraqis Not Welcome in United States?; What Should Be Done about Iraqi Refugees?; …
Oh boy. What’s our two-term proximity score? If we ignore puncuation as before, then we have a score of 1 (the best possible) meaning that the terms are adjacent. But if proximity is a proxy for relatedness, it is failing. Clicking through to the document, you see that this snippet is a collection of unrelated headlines, separated by semicolons.
Does this mean that we shouldn’t have ignored punctuation in the first place? Maybe distance between terms should take intervening punctuation into account. We had a really good result (#9) where terms were only separated by commas, and a really bad result (#7) where the separators were a question-mark and a semicolon. Those do seem like stronger separators – maybe we need some kind of discounting scheme for punctuation? Certainly terms occurring in separate sentences should be counted as further apart than terms in the same sentence, even if the number of words between them are the same….
Proximity is a hack
At this point I hope you are getting that hackish feeling. There’s a fundamental phenomenon (something we’re calling “relatedness”) to which a simple measure (“proximity”) initially seemed like a good approximation, which might only need a little bit of refinement. As we refine it, though, we’re feeling the need to add epicycles to epicycles, which is never a good sign.
Admittedly, I’m being a bit faux-naive about the process of constructing a proximity function. You don’t have to cover all the corner cases one by one, and cover each one exactly. You can pull up really large document sets and get statistical answers about how much different factors should impinge (and know, in some average sense, how important an intervening comma is relative to a semicolon), and get closer approximations to, um, whatever it is you’re trying to approximate. So what are we trying to approximate with proximity, anyway?
What Proximity Approximates
I’ve alluded to “relatedness”, vaguely, but haven’t gone much further than that. But I’d like to argue that it’s often true that terms in a query have some (explicit or implied) linguistic or semantic relationship between them, and that a good document-match for such queries is likely to be one that has the same relationship between those terms.
In cases where the linguistic/semantic relationships are made somewhat explicit in the query (e.g, stars smaller than the Sun) you want to retrieve documents where the same relationships are present, or at least respected in some way. In cases where the relationship is hard to divine or is inherently ambiguous (as with our old friend Obama Afghanistan) you want to retrieve only documents where there is some linguistic or semantic relationship between the terms, if only to increase your odds of hitting the right one.
So that’s what proximity really does for search engines – it’s a crude unrelatedness filter. Words that are really really far apart in a document are likely not to be participating in the right relationship, because it’s hard for them to have any linguistic or semantic connection at all over that distance. Proximity increases your odds of getting lucky enough to find your terms in some relationship, which in turn increases your odds of getting the terms in the right relationship. Chancy business, this web search!
Proximity-busters and structure
All of the above assumes that at base your document is a set of words in sequence, and that our base proximity function, however tweaked, is a function on distance in that sequence. But it’s a commonplace in linguistics that structural connections are not that obviously connected to distances in the surface string. In examples like these:
John kissed Mary
John, who was feeling very dapper indeed in his checked sport jacket and regimental tie, kissed Mary.
John is just as close to Mary when separated by one word as by fifteen words. And in general, at the syntactic level it’s pretty easy to stuff sentences with subordinate clauses and parenthetical digressions that can artificially stretch out surface-string proximity without changing the relationships of interest at all. This seems like bad news for pure-proximity measures. And going in the other direction, it’s easy to find discourse markers that, while very short, materially disrupt the relatedness of the terms on either side of them. Try these:
Everyone knows this – so what?
Nothing I said in the last section would be a surprise to a linguist, needless to say – the field is all about the relationship between the surface string and the deep structure. But there are a couple of takeaways of interest:
So, can you do it?
To recap: proximity is both a wonderfully powerful relevance feature, and a total hack. It helps enormously, but it’s not what you really want, it’s just sorta somewhat correlated with what you really want. What you need for what you really want is the underlying structure of all that web content: the real syntactic structure of the sentences, how the sentences connect to each other, how the facts relate, and (maybe) how the discourse flows and the topics connect. We’ve squeezed all the juice we can out of webpages considered as word-vectors; now it’s time to parse this stuff and get at the real structure.
Can that be done? A couple of years ago I would have said no, but I hadn’t seen the PARC natural language technology then, and didn’t know that an effort this concerted and well-funded was on the way. Now, do I think that Powerset will do it? I still don’t know, frankly – there’s so much more to do to make it real and debugged and scaled the way it needs to be. But it’s clear to me that the next big thing in web search is either this or something a whole lot like this, and I think we have the best shot of anyone. And that’s why I’m at Powerset.