Proximity is a hack

June 25, 2007

Here’s a more technical (and much longer!) post than usual – some thoughts about web search relevance, which may help explain why I ended up at Powerset.

Relevance Improvements (large and small)

I’ve been working in web search since 2000 (Excite, Inktomi, Yahoo, and now Powerset). During that time, I’ve seen lots of new techniques that made small but significant improvements in relevance for the user, including several different ways to use machine learning. During that entire time, though, I’ve only seen two new techniques that, by themselves, made a large difference, a step-function kind of change. They are:

  • Use of inbound links. You know: PageRank and all that. Possibly more importantly, it’s also about indexing the text of the link itself (anchortext) along with the target page. (Using inbound links (aka backlinks) is expensive because you have to do a big inversion and sort to find all the links that point to every page.)
  • Term proximity. For multi-term queries, you prefer documents that have the terms occurring close together in the text. (This is slightly expensive (or seemed expensive in 2001) because you have to record the positions of all words in a document along with the words themselves. Documents aren’t just bags of words anymore, they’re vectors of words.)
  • So what do inbound links get you? A popularity measure and summary text. Each link gives you 1) a vote for the document, and 2) a snippet of link text, which is often a very concise summary of the target page. If many people “vote” for a page and use a particular word or phrase to summarize it, then using those words or phrases for retrieval may give much better results than anything you would find based on the documents themselves. For example, the query web encyclopedia on Google gives you an excellent first result of http://en.wikipedia.org, no doubt because so very many people link to wikipedia and use those terms.

    What does proximity get you? For one thing, preferring terms that are close together will help you catch names – for the query George Bush, you are more likely to find stories about one of the American presidents than stories about a man named George who went out into the bush. But this is the simple case, and could be solved with phrase searches, where you insist that the terms must occur right next to each other to count as a match at all. Why is it important that terms are close in the document, even if they’re not adjacent?

    Why Proximity Matters

    Proximity matters because terms that are close to each other in the text are more likely to be closely connected in the meaning structure of the text.

    Take a query like Obama Afghanistan, for example. While it’s possible that the querier is independently interested in the two topics (Obama and Afghanistan), it’s much more likely that there’s some relationship sought: statements by Obama about Afghanistan, a question about whether Obama has been to Afghanistan, etc.

    I tried this query on Google, and got result snippets that included these:
    Result #6:

    But if you look at what’s happening in Afghanistan now, you are seeing the Taliban resurgent, you are seeing al-Qaida strengthen itself,” Obama said.

    Result #9:

    MODERATOR: Senator Obama, Afghanistan has just conducted the first elections in its 5000-year history. They appear to have gone very well–at least, …

    In both these examples (where the terms have a distance of 14 and 1, respectively (ignoring punctuation)), the proximate terms actually seem to be related, which is probably a good thing. In the first case, we have a quote on Afghanistan attributed to Barack Obama himself; in the second we have a question about Afghanistan that is being addressed to Senator Obama.

    This bet of relatedness seems dodgier if the terms are separated by, say, hundreds of words. Or take the case of another result from this same search:

    Result #12:

    Updated April, 2007. Map of Afghanistan In PDF format … Obama aides won’t say how much money the solicitation raised and would only say that “thousands” …

    If you look at the underlying document (going to the cached version in this case) it turns out that the distance represented by that ellipsis is large – 79 words separate the closest occurrence of “Afghanistan” and “Obama”. “Afghanistan” occurs only once, in a sidebar link to a map, which sidebar decorates an unrelated news article about an Obama fundraiser. Not very proximate, and a result that’s correspondingly disappointing in the unrelatedness of the words.

    Implementing proximity

    So imagine that we’ve decided that proximity is a good thing – that results are more likely to be relevant when the query terms are close together. Now, how exactly should we score results for proximity? Our job here is not to do every part of search ranking – it’s just to figure out what score to give a query-document pair that captures the proximity aspect, and rewards high-proximity documents in the right way.

    The simplest scoring scheme just to assign a score that is the distance between the terms. If you see “Obama” and “Afghanistan” right next to each other, return a score of 1; if you see “Obama [..78 words..] Afghanistan”, return a score of 79. Simple, no? (Of course, we haven’t said anything about how low a score has to be to be considered a good score, but maybe we can let someone else sort that out when all the ranking factors get combined.) The main thing is that a low score is better than a high score. Now, one complexity is that this pair terms may occur multiple times in a document, so which distances do we care about? Um, let’s just take the minimum distance – the distance between the two instances that are the closest.

    Now, let’s begin to open the worm-can: what about three-term queries? Your job is to take a three-term query and a document, and return a single score that is a proximity reward. The problem here isn’t that we lack ideas – the problem is that there are a whole lot of ways to generalize from 2 to 3 here that seem equally likely to be good. The generalized score could be:

  • The minimum of the three two-term proximity scores
  • The average of the three two-term proximity scores
  • The length of the shortest span in the document that includes all three terms
  • Do you have any intuition about which of these proposals is more likely to capture “relatedness”? I don’t. So the right approach is probably to hack up a number of scoring methods and see which gives the best results.

    When Proximity Fails

    We haven’t seen the worst of it yet. Ponder this snippet from the same Google query I issued earlier:

    Result #7:

    Taliban Comeback in Afghanistan?; Obama vs. Clinton Fallout; Iraqis Not Welcome in United States?; What Should Be Done about Iraqi Refugees?; …

    Oh boy. What’s our two-term proximity score? If we ignore puncuation as before, then we have a score of 1 (the best possible) meaning that the terms are adjacent. But if proximity is a proxy for relatedness, it is failing. Clicking through to the document, you see that this snippet is a collection of unrelated headlines, separated by semicolons.

    Does this mean that we shouldn’t have ignored punctuation in the first place? Maybe distance between terms should take intervening punctuation into account. We had a really good result (#9) where terms were only separated by commas, and a really bad result (#7) where the separators were a question-mark and a semicolon. Those do seem like stronger separators – maybe we need some kind of discounting scheme for punctuation? Certainly terms occurring in separate sentences should be counted as further apart than terms in the same sentence, even if the number of words between them are the same….

    Proximity is a hack

    At this point I hope you are getting that hackish feeling. There’s a fundamental phenomenon (something we’re calling “relatedness”) to which a simple measure (“proximity”) initially seemed like a good approximation, which might only need a little bit of refinement. As we refine it, though, we’re feeling the need to add epicycles to epicycles, which is never a good sign.

    Admittedly, I’m being a bit faux-naive about the process of constructing a proximity function. You don’t have to cover all the corner cases one by one, and cover each one exactly. You can pull up really large document sets and get statistical answers about how much different factors should impinge (and know, in some average sense, how important an intervening comma is relative to a semicolon), and get closer approximations to, um, whatever it is you’re trying to approximate. So what are we trying to approximate with proximity, anyway?

    What Proximity Approximates

    I’ve alluded to “relatedness”, vaguely, but haven’t gone much further than that. But I’d like to argue that it’s often true that terms in a query have some (explicit or implied) linguistic or semantic relationship between them, and that a good document-match for such queries is likely to be one that has the same relationship between those terms.

    In cases where the linguistic/semantic relationships are made somewhat explicit in the query (e.g, stars smaller than the Sun) you want to retrieve documents where the same relationships are present, or at least respected in some way. In cases where the relationship is hard to divine or is inherently ambiguous (as with our old friend Obama Afghanistan) you want to retrieve only documents where there is some linguistic or semantic relationship between the terms, if only to increase your odds of hitting the right one.

    So that’s what proximity really does for search engines – it’s a crude unrelatedness filter. Words that are really really far apart in a document are likely not to be participating in the right relationship, because it’s hard for them to have any linguistic or semantic connection at all over that distance. Proximity increases your odds of getting lucky enough to find your terms in some relationship, which in turn increases your odds of getting the terms in the right relationship. Chancy business, this web search!

    Proximity-busters and structure

    All of the above assumes that at base your document is a set of words in sequence, and that our base proximity function, however tweaked, is a function on distance in that sequence. But it’s a commonplace in linguistics that structural connections are not that obviously connected to distances in the surface string. In examples like these:

    John kissed Mary

    John, who was feeling very dapper indeed in his checked sport jacket and regimental tie, kissed Mary.

    John is just as close to Mary when separated by one word as by fifteen words. And in general, at the syntactic level it’s pretty easy to stuff sentences with subordinate clauses and parenthetical digressions that can artificially stretch out surface-string proximity without changing the relationships of interest at all. This seems like bad news for pure-proximity measures. And going in the other direction, it’s easy to find discourse markers that, while very short, materially disrupt the relatedness of the terms on either side of them. Try these:

  • “Unrelatedly, ….”
  • “Anyway, …”
  • “In other news, …”

    Everyone knows this – so what?

    Nothing I said in the last section would be a surprise to a linguist, needless to say – the field is all about the relationship between the surface string and the deep structure. But there are a couple of takeaways of interest:

  • Despite their hackishness and inherent limitations, proximity measures genuinely important for relevance of web search, that important, widely-deployed, and lucrative technology. What does this indicate about how important it might be to do the deeper document analysis for which proximity is a stand-in?
  • Term relationships in the document are important even without structure in the query. Remember the argument that proximity is really helping the first stage in a two-stage guess: 1) maximize the chances that the query terms are related in the document in some way, so that 2) we maximize the chances that they are related in the way the user intended. The best situation of all for natural-language search is to get enough structure from the query that you match up the structure of the document and query exactly. Failing that (as in Obama Afghanistan), though, at least we can try to do a better job than proximity on step #1.
  • So, can you do it?

    To recap: proximity is both a wonderfully powerful relevance feature, and a total hack. It helps enormously, but it’s not what you really want, it’s just sorta somewhat correlated with what you really want. What you need for what you really want is the underlying structure of all that web content: the real syntactic structure of the sentences, how the sentences connect to each other, how the facts relate, and (maybe) how the discourse flows and the topics connect. We’ve squeezed all the juice we can out of webpages considered as word-vectors; now it’s time to parse this stuff and get at the real structure.

    Can that be done? A couple of years ago I would have said no, but I hadn’t seen the PARC natural language technology then, and didn’t know that an effort this concerted and well-funded was on the way. Now, do I think that Powerset will do it? I still don’t know, frankly – there’s so much more to do to make it real and debugged and scaled the way it needs to be. But it’s clear to me that the next big thing in web search is either this or something a whole lot like this, and I think we have the best shot of anyone. And that’s why I’m at Powerset. 🙂


    Working at Powerset

    June 8, 2007

    Whew. In December I blogged that I was leaving Yahoo! for a new job, and had every intention of following up with news about the new job and the new company just as soon as I started. But the next time I look up I find that it’s … June(!). I guess new jobs have a way of doing that to you.

    So what is the new company? Powerset, which is applying natural-language technology (originally developed at [formerly Xerox] PARC) to web search. And what does that mean? It means that, rather than indexing the words on webpages, the first thing we do is parse the sentences into the kinds of syntax trees that you might see in a grammar or linguistics class, complete with noun phrases, verb phrases, prepositional phrases and so on. And that’s just the first thing – once we’ve figured out the syntactic structure, we’re extracting every bit of semantic meaning we can to match against any query you might want to enter. And if we can do all this correctly, we should be able to separate semantic wheat from chaff ever so much more efficiently than regular search engines, and find things for you in ways that have never been possible before.

    If all this sounds computationally expensive to try to apply to the entire web, well … it is. There’s a big double bet here, on a couple of decades-long trends: that NLP technology keeps getting ever faster and more mature, and that Moore’s Law continues making computation ever cheaper, and that the historic moment when NLP meets Moore’s Law in the middle at web-scale search is …. 2007. Or maybe 2008. 😉

    So what am I doing? Nothing to do with webspam, right now – we aspire to be a spam target, of course but that time has not yet come. I’m directing an engineering group that does several things, including the metrics and relevance-testing program that tells us how we’re doing beyond the cool demo examples. Among other things, this means that if anyone is in a mood to start shooting the messengers, then I might be the first casualty. But luckily it seems that even at the highest levels people understand that the only way you can possibly figure out how you’re doing (and how to improve) is to be ruthlessly blind and cruelly random in testing and sampling. So far so good. I haven’t even been wearing the Kevlar to work lately.

    Now, Powerset has been all over the press, including multiple New York Times articles. The founders are not shy people, and they’re happy to explain to as many reporters as possible exactly why Powerset must remain in stealth mode. 🙂 And there has been a lot of blog chatter (and even incisive blog discussion) about Powerset’s ambitions and prospects. I’ll cover some of the controversy in a later post. For now, though, let me say that I’m amused by seeing the following simultaneous critiques of Powerset: 1) what Powerset is trying to do is so hard that it couldn’t be done in a million years, and 2) Powerset is lame because it hasn’t launched already. Uh…. either one of these could be true, but surely they’re not both true at once? Pick at most one, please.


    The Influentials Speaker Series takes it to the next level

    October 30, 2006

    I blogged previously about the Influentials Speaker Series that Yahoo! puts on for its employees. I was very very impressed by the influentiality of our previous speakers, most notably Tom Cruise, who is always a strong influence for me whenever I play loud music at home alone. But I am proud to say that the influentialisticness of the average speaker has holistically increased by quantum leaps, with the announcement that our next speaker will be:

    Deepak Chopra.

    (I report, you decide.)


    The beauty of a fire alarm in the springtime

    May 10, 2006

    If you happened to be walking by Yahoo!’s Santa Clara campus in the last couple of weeks, you might have seen a few hundred geeks (me included) flushed out of their dens, and blinking in the unfamiliar sunshine.


    Bonus Tom Cruise-related question

    March 20, 2006

    In what way is the Mission Impossible theme music unusual? (Hint: it would be especially obvious in a written score.)


    The Influential Tom Cruise

    March 20, 2006

    I’ll admit that I’m a little bit embarrassed by the latest speaker to be invited to Yahoo!’s campus. The embarrassment is compounded by the name of the speaker series: The Influentials.

    Just as a reality check, let’s take a quick spin through the original reasons for fame of three of our Influential speakers:

    Tom Brokaw: reading stuff on TV that other people wrote. And having a funny voice.
    Arnold Schwarzenegger: reciting lines in the movies that other people wrote. And bodybuilding.
    Tom Cruise: reciting lines in movies that other people wrote. And dancing in his underwear.

    Sure, Tom Brokaw was also a journalist and wrote a book, and Arnold is honest-to-god the Governor of the very state where we have our headquarters. But it would be a mistake, I think, to try to front that this series is about inviting prominent journalists, politicians, and Scientologists to share their views …. it’s about inviting people whose faces are very familiar because we’ve seen them frequently on screens of various sizes.

    Just as a counterweight, though, I have to say that the newly expanded Yahoo! Research group is generally kicking ass, and as part of that has a speaker series of their own. And guess who the most recent speaker was? Christos Papadimitriou. (I know what you’re thinking: not _that_ Christos Papadimitriou? Yes, _that_ Christos Papadimitriou. Combinatorial Optimization Man himself.) And I really wanted to see his talk, but you know what? The talk was at 10am, and I got there at about 10:02, with the result that the very large room was not only standing room only, but that special kind of standing-room-only where anywhere you try to stand you’re going to be rudely blocking the view of a standing person who staked out a standing-room spot before you did. So I left… But there must have been a good 500 people there. And then there’s the TechDev speaker series, that has brought in Jeff Barr, John Battelle, Eric Brewer, Lawrence Lessig, Mark Pauline (of Survival Research Labs)… So overall, I think we’re doing OK with speakers at various levels of fame and substance.

    It could be that my problem with Cruise/Brokaw/Schwarzenegger, when it comes down to it, is just with the name of the series. Call it the Familiar Faces Celebrity Speaker Series, and maybe it would be just fine. Maybe even kind of fun. I don’t agree with the Governator’s politics, and am not sure that he is an Influential, despite his current office… but was I there? Was I as much of a Treo Paparazzo as the next guy? I’ll never tell.


    Why I can never leave Yahoo!

    October 12, 2004

    A true story: soon after Yahoo’s purchase of Inktomi, a hundred of us Inktomites took a day trip to Sunnyvale for a mass orientation. In addition to the usual HR informational stuff that any new hire would get, there were good inspirational talks about Yahoo’s commitment to search, which we were beginning to believe. By late afternoon the mood was cheerful, even festive.

    At that point, an extremely chipper and cheerful woman from HR took the mike, and after talking for a little while, said that we should all come stand at the front of the room. A hundred geeks stepped forward, suspecting some kind of team-building exercise, but still in pretty good humor about it all. Then she said “The executive team thought it would be _really_ _fun_ if we all sang a _song_ together to celebrate your joining Yahoo!”. (A couple of us became slightly nervous at that point.) Then she said “We thought it would be _especially_ fun if we sang “We Are the World”, since we’re all going to be working together!” (Looks of genuine alarm are shared, but no one is willing to unilaterally make a bolt for it.)

    So: we begin to sing. Not just sing but _sway_ back and forth (under direction), arms around each others’ shoulders. And as we sing, I remember that “fun” is a very similar word to “funny”. And I remember that there is an important distinction to make about funny things: laughing with vs. laughing at. And then, horrifyingly, I remember that the whole day’s proceedings is being _videotaped_. But by then it’s too late. (And yes, there were replays later, and great hilarity about what they’d managed to get the Inktomi guys to do.)

    I don’t want to leave Y! anyway — the work is fascinating, the reach is large, the problems are hard, the people are great. But if I did want to leave … I know they still have that videotape somewhere.