Before I post again, I thought I should add a little separator post, to note that this blog went away for about three years. Yes, I was working at a startup, and was kinda busy for part of that time, but that’s not why – I was just bored with blogging and had nothing to say. Fast forward to 2011.
Here’s a more technical (and much longer!) post than usual – some thoughts about web search relevance, which may help explain why I ended up at Powerset.
Relevance Improvements (large and small)
I’ve been working in web search since 2000 (Excite, Inktomi, Yahoo, and now Powerset). During that time, I’ve seen lots of new techniques that made small but significant improvements in relevance for the user, including several different ways to use machine learning. During that entire time, though, I’ve only seen two new techniques that, by themselves, made a large difference, a step-function kind of change. They are:
So what do inbound links get you? A popularity measure and summary text. Each link gives you 1) a vote for the document, and 2) a snippet of link text, which is often a very concise summary of the target page. If many people “vote” for a page and use a particular word or phrase to summarize it, then using those words or phrases for retrieval may give much better results than anything you would find based on the documents themselves. For example, the query web encyclopedia on Google gives you an excellent first result of http://en.wikipedia.org, no doubt because so very many people link to wikipedia and use those terms.
What does proximity get you? For one thing, preferring terms that are close together will help you catch names – for the query George Bush, you are more likely to find stories about one of the American presidents than stories about a man named George who went out into the bush. But this is the simple case, and could be solved with phrase searches, where you insist that the terms must occur right next to each other to count as a match at all. Why is it important that terms are close in the document, even if they’re not adjacent?
Why Proximity Matters
Proximity matters because terms that are close to each other in the text are more likely to be closely connected in the meaning structure of the text.
Take a query like Obama Afghanistan, for example. While it’s possible that the querier is independently interested in the two topics (Obama and Afghanistan), it’s much more likely that there’s some relationship sought: statements by Obama about Afghanistan, a question about whether Obama has been to Afghanistan, etc.
I tried this query on Google, and got result snippets that included these:
But if you look at what’s happening in Afghanistan now, you are seeing the Taliban resurgent, you are seeing al-Qaida strengthen itself,” Obama said. …
MODERATOR: Senator Obama, Afghanistan has just conducted the first elections in its 5000-year history. They appear to have gone very well–at least, …
In both these examples (where the terms have a distance of 14 and 1, respectively (ignoring punctuation)), the proximate terms actually seem to be related, which is probably a good thing. In the first case, we have a quote on Afghanistan attributed to Barack Obama himself; in the second we have a question about Afghanistan that is being addressed to Senator Obama.
This bet of relatedness seems dodgier if the terms are separated by, say, hundreds of words. Or take the case of another result from this same search:
Updated April, 2007. Map of Afghanistan In PDF format … Obama aides won’t say how much money the solicitation raised and would only say that “thousands” …
If you look at the underlying document (going to the cached version in this case) it turns out that the distance represented by that ellipsis is large – 79 words separate the closest occurrence of “Afghanistan” and “Obama”. “Afghanistan” occurs only once, in a sidebar link to a map, which sidebar decorates an unrelated news article about an Obama fundraiser. Not very proximate, and a result that’s correspondingly disappointing in the unrelatedness of the words.
So imagine that we’ve decided that proximity is a good thing – that results are more likely to be relevant when the query terms are close together. Now, how exactly should we score results for proximity? Our job here is not to do every part of search ranking – it’s just to figure out what score to give a query-document pair that captures the proximity aspect, and rewards high-proximity documents in the right way.
The simplest scoring scheme just to assign a score that is the distance between the terms. If you see “Obama” and “Afghanistan” right next to each other, return a score of 1; if you see “Obama [..78 words..] Afghanistan”, return a score of 79. Simple, no? (Of course, we haven’t said anything about how low a score has to be to be considered a good score, but maybe we can let someone else sort that out when all the ranking factors get combined.) The main thing is that a low score is better than a high score. Now, one complexity is that this pair terms may occur multiple times in a document, so which distances do we care about? Um, let’s just take the minimum distance – the distance between the two instances that are the closest.
Now, let’s begin to open the worm-can: what about three-term queries? Your job is to take a three-term query and a document, and return a single score that is a proximity reward. The problem here isn’t that we lack ideas – the problem is that there are a whole lot of ways to generalize from 2 to 3 here that seem equally likely to be good. The generalized score could be:
Do you have any intuition about which of these proposals is more likely to capture “relatedness”? I don’t. So the right approach is probably to hack up a number of scoring methods and see which gives the best results.
When Proximity Fails
We haven’t seen the worst of it yet. Ponder this snippet from the same Google query I issued earlier:
Taliban Comeback in Afghanistan?; Obama vs. Clinton Fallout; Iraqis Not Welcome in United States?; What Should Be Done about Iraqi Refugees?; …
Oh boy. What’s our two-term proximity score? If we ignore puncuation as before, then we have a score of 1 (the best possible) meaning that the terms are adjacent. But if proximity is a proxy for relatedness, it is failing. Clicking through to the document, you see that this snippet is a collection of unrelated headlines, separated by semicolons.
Does this mean that we shouldn’t have ignored punctuation in the first place? Maybe distance between terms should take intervening punctuation into account. We had a really good result (#9) where terms were only separated by commas, and a really bad result (#7) where the separators were a question-mark and a semicolon. Those do seem like stronger separators – maybe we need some kind of discounting scheme for punctuation? Certainly terms occurring in separate sentences should be counted as further apart than terms in the same sentence, even if the number of words between them are the same….
Proximity is a hack
At this point I hope you are getting that hackish feeling. There’s a fundamental phenomenon (something we’re calling “relatedness”) to which a simple measure (“proximity”) initially seemed like a good approximation, which might only need a little bit of refinement. As we refine it, though, we’re feeling the need to add epicycles to epicycles, which is never a good sign.
Admittedly, I’m being a bit faux-naive about the process of constructing a proximity function. You don’t have to cover all the corner cases one by one, and cover each one exactly. You can pull up really large document sets and get statistical answers about how much different factors should impinge (and know, in some average sense, how important an intervening comma is relative to a semicolon), and get closer approximations to, um, whatever it is you’re trying to approximate. So what are we trying to approximate with proximity, anyway?
What Proximity Approximates
I’ve alluded to “relatedness”, vaguely, but haven’t gone much further than that. But I’d like to argue that it’s often true that terms in a query have some (explicit or implied) linguistic or semantic relationship between them, and that a good document-match for such queries is likely to be one that has the same relationship between those terms.
In cases where the linguistic/semantic relationships are made somewhat explicit in the query (e.g, stars smaller than the Sun) you want to retrieve documents where the same relationships are present, or at least respected in some way. In cases where the relationship is hard to divine or is inherently ambiguous (as with our old friend Obama Afghanistan) you want to retrieve only documents where there is some linguistic or semantic relationship between the terms, if only to increase your odds of hitting the right one.
So that’s what proximity really does for search engines – it’s a crude unrelatedness filter. Words that are really really far apart in a document are likely not to be participating in the right relationship, because it’s hard for them to have any linguistic or semantic connection at all over that distance. Proximity increases your odds of getting lucky enough to find your terms in some relationship, which in turn increases your odds of getting the terms in the right relationship. Chancy business, this web search!
Proximity-busters and structure
All of the above assumes that at base your document is a set of words in sequence, and that our base proximity function, however tweaked, is a function on distance in that sequence. But it’s a commonplace in linguistics that structural connections are not that obviously connected to distances in the surface string. In examples like these:
John kissed Mary
John, who was feeling very dapper indeed in his checked sport jacket and regimental tie, kissed Mary.
John is just as close to Mary when separated by one word as by fifteen words. And in general, at the syntactic level it’s pretty easy to stuff sentences with subordinate clauses and parenthetical digressions that can artificially stretch out surface-string proximity without changing the relationships of interest at all. This seems like bad news for pure-proximity measures. And going in the other direction, it’s easy to find discourse markers that, while very short, materially disrupt the relatedness of the terms on either side of them. Try these:
Everyone knows this – so what?
Nothing I said in the last section would be a surprise to a linguist, needless to say – the field is all about the relationship between the surface string and the deep structure. But there are a couple of takeaways of interest:
So, can you do it?
To recap: proximity is both a wonderfully powerful relevance feature, and a total hack. It helps enormously, but it’s not what you really want, it’s just sorta somewhat correlated with what you really want. What you need for what you really want is the underlying structure of all that web content: the real syntactic structure of the sentences, how the sentences connect to each other, how the facts relate, and (maybe) how the discourse flows and the topics connect. We’ve squeezed all the juice we can out of webpages considered as word-vectors; now it’s time to parse this stuff and get at the real structure.
Can that be done? A couple of years ago I would have said no, but I hadn’t seen the PARC natural language technology then, and didn’t know that an effort this concerted and well-funded was on the way. Now, do I think that Powerset will do it? I still don’t know, frankly – there’s so much more to do to make it real and debugged and scaled the way it needs to be. But it’s clear to me that the next big thing in web search is either this or something a whole lot like this, and I think we have the best shot of anyone. And that’s why I’m at Powerset. 🙂
Whew. In December I blogged that I was leaving Yahoo! for a new job, and had every intention of following up with news about the new job and the new company just as soon as I started. But the next time I look up I find that it’s … June(!). I guess new jobs have a way of doing that to you.
So what is the new company? Powerset, which is applying natural-language technology (originally developed at [formerly Xerox] PARC) to web search. And what does that mean? It means that, rather than indexing the words on webpages, the first thing we do is parse the sentences into the kinds of syntax trees that you might see in a grammar or linguistics class, complete with noun phrases, verb phrases, prepositional phrases and so on. And that’s just the first thing – once we’ve figured out the syntactic structure, we’re extracting every bit of semantic meaning we can to match against any query you might want to enter. And if we can do all this correctly, we should be able to separate semantic wheat from chaff ever so much more efficiently than regular search engines, and find things for you in ways that have never been possible before.
If all this sounds computationally expensive to try to apply to the entire web, well … it is. There’s a big double bet here, on a couple of decades-long trends: that NLP technology keeps getting ever faster and more mature, and that Moore’s Law continues making computation ever cheaper, and that the historic moment when NLP meets Moore’s Law in the middle at web-scale search is …. 2007. Or maybe 2008. 😉
So what am I doing? Nothing to do with webspam, right now – we aspire to be a spam target, of course but that time has not yet come. I’m directing an engineering group that does several things, including the metrics and relevance-testing program that tells us how we’re doing beyond the cool demo examples. Among other things, this means that if anyone is in a mood to start shooting the messengers, then I might be the first casualty. But luckily it seems that even at the highest levels people understand that the only way you can possibly figure out how you’re doing (and how to improve) is to be ruthlessly blind and cruelly random in testing and sampling. So far so good. I haven’t even been wearing the Kevlar to work lately.
Now, Powerset has been all over the press, including multiple New York Times articles. The founders are not shy people, and they’re happy to explain to as many reporters as possible exactly why Powerset must remain in stealth mode. 🙂 And there has been a lot of blog chatter (and even incisive blog discussion) about Powerset’s ambitions and prospects. I’ll cover some of the controversy in a later post. For now, though, let me say that I’m amused by seeing the following simultaneous critiques of Powerset: 1) what Powerset is trying to do is so hard that it couldn’t be done in a million years, and 2) Powerset is lame because it hasn’t launched already. Uh…. either one of these could be true, but surely they’re not both true at once? Pick at most one, please.
There was a sudden rumbling noise, and mild shaking lasting a couple of seconds, max. As it was happening, at least two different co-workers called out “Earthquake!”, cheerfully.
Immediately I went to the usgs.gov site to make sure – and it was this one.
I was chuffed – finally! You see, I moved to the Bay Area in 2000, and never once in my life had I experienced an earthquake, or at least one that I knew was for real. I had missed minor ones in various ways: 1) in New York for the weekend, 2) out driving at the time, and really didn’t notice anything (in retrospect, did it seem a bit bumpy at one point?), and worst of all, 3) just plain slept through one. Every so often, on the 7th floor of a Yahoo! building, the building would seem to shake, and I checked that same site …. but no, it was just the building. 🙂
So I can check that one off now. (And yes, I know – be careful what you wish for. Yes. 🙂 )
1) YHOO will remain an independent entity, and will not merge with either AOL or MSFT
2) Yahoo! Search will close most of the search monetization gap with Google
3) In May 2007, journalistic coverage of Yahoo! will be more positive than negative
4) At the end of 2007, the search engines ranked by share of U.S. searches will be the same as it is today: 1) Google, 2) Yahoo!, 3) MSN (Live), 4) Ask.
5) Despite #3, at least one general web search engine will emerge during the year that is a quality disruptor – acknowledged privately by the major engines to be better than one or more of the majors, and a danger to the current engines, whether or not it is acquired in 2007 or remains independent.
6) In December 2007, Amazon’s move into the webservices business will be seen as a good move by the techbiz press.
7) On December 31, 2007, myspace.com’s three-month-average Alexa traffic rank will be greater than 10.
8) Renkoo.com will take off, and become the 2nd most-popular social-event-arranging service, behind only evite.com. If it is not acquired before then, its three-month-average Alexa traffic rank will be 2000 or below on 12/31/07. [Disclaimer: I have a stake in Renkoo, by marriage.]
1) The front runner and presumptive Presidential nominee for the Democratic party in December 2007 will be Barack Obama
2) The front runner and presumptive Presidential nominee for the Republican party in December 2007 will be John McCain
3) In December 2007, climate change will be a top-5 issue in polls of likely voters in the ’08 presidential elections.
4) On December 31, 2007 there will be fewer U.S. troops deployed to Iraq than there are today (Jan 1, 2007).
1) Michigan will beat Ohio State in their 2007 football matchup.
For the last couple of years, sometime around New Year’s Day, I have written down goals for the coming year. I’m calling them goals, not resolutions, because I think that there’s a difference, and that goals work better, at least for me.
Here’s an example of what I don’t like in a resolution. Let’s say that for any number of reasons (health, life-experience, eco-guilt), you’ve decided that riding your bike is to be encouraged over driving. And so you make the following resolution:
“I will bike to work every Wednesday, without fail!”
Now, what’s wrong with this resolution?
1) It’s very very likely to fail. Rain, vacation, illness – any of those could blow it for you.
2) It’s binary – you either fail or you don’t.
3) On M, T, Th, Fr, Sa, Su, there is absolutely nothing you can do to work on this.
A very likely outcome is that one rainy Wednesday in February you have blown your resolution, and now what’s your guide for the rest of the year, other than feeling vaguely crappy about having blown it?
Here’s a goal I like better:
“I will bike to work 50 times this year!”
How is this different from the resolution version?
1) It’s hard to blow completely until very late in the year. All you can do is fall behind.
2) It’s actually possible to exceed the goal. You could end up doing it 55 times, and feel especially good.
3) You can actually do something about this almost any workday.
4) You can get ahead, temporarily
5) You can get partial credit. Biking 40 times is better than biking 0 times, even if it’s not hitting the target.
Sure, there’s the “danger” that you will bike twice a week in the first half of the year, and then be done, and stop biking, and get fat. 🙂 But let’s be real here – what are the chances? And as downsides go, how bad is that one?
Even better, I think, is
“I will put N miles on the bike this year!”
Now, you can’t goalize every resolution like this. There are things you have to resolve always to do, or never to do. You can’t say “I will be not-mean to my family on at least 200 days this year!”, or set very low targets for the numbers of murders you will commit per month. But for active, intermittent, achievement-oriented activities I think goals are better.
I wish I could close by saying that when I switched to listing goals rather resolutions I started to nail them all, but looking back on 06, hmm, it’s a bit of a mixed bag. This year, though, I resolve that I’ll hit _every_ goal. 🙂
OK 2006, see ya later. I can’t say that I’m sorry, or that I’ll miss you too much.
Professionally it was actually challenging, fun, and (at the end) tumultuous in not a bad way, and 2006 sends me out somewhat crisply, with me wrapping up my former job in December and starting a new one in January. (I’ll post about that tomorrow. 🙂 ) Personally, though, 2006 was pure disaster area. Maybe I’ll see it differently someday, and maybe I’ll appreciate it for the belated personal growth it forced on me, but I’m not feeling that yet.
I think, though, that I start 2007 with one real advantage compared to previous years. This year I *know* that I don’t know what’s going to happen(!).