It’s always nice when the borderline between valuable content and webspam content is clear-cut – in that case, the goal of the search engine is straightforwardly to keep the spam out of search results. Unfortunately, some quality issues are a continuum, from the best of the web to the worst.
One of these issues is what to do about “aggregators” – sites and pages that live only to display arrangements of links and bits of content drawn from other sources. The range is continuous, from very tuned and sophisticated content-clustering and layout engines to the worst kinds of scraper spam.
For high-quality aggregation, try Google News, or ScienceBlogs (which JP turned me onto recently). Google News shows a clustering of news stories that is famously untouched by human hands. I don’t know for sure that ScienceBlogs is entirely an aggregation, without any new content, but it looks that way.
Search engines usually want to show content that provides unique value to users – collecting up a bunch of content found elsewhere seems to violate that. On the other hand, are these high-quality sites? Absolutely. And can pure compilation or aggregation add value? Well, I-am-not-a-lawyer, but apparently copyright law gives at least some thin and guarded support for compilation copyrights. And if humans can get credit for assemblage, I think we should extend the courtesy to algos too.
So some high-quality aggregation sites do belong in a websearch index. With that said, you might not always want them to come out on top in a search. If a query matches some snippet of a story on ScienceBlogs, then probably the original story itself would be a better match – but let’s rely on differential relevance algorithms to sort that out.
At the other end of the spectrum, check out this portion of a doc I found while doing an ego search on a websearch engine:
If I am interested in all things Converse (and I am, I am!) then I should really be interested in this doc …. but I’m not interested. There’s no discernible cleverness in the grouping, and no detectable relevance sorting being applied to the search results. This doesn’t belong in a websearch index, as it’s hard to imagine any querier being satisfied.
Other variants of this kind of spam technique cross the line between sampling/aggregation and outright content theft. Imagine that you get home one night to find a stranger leaving your house with a sack containing your TV, cell phone, jewelry. You might misunderstand, until we explain that he’s actually an _aggregator_ – he’s just _aggregating_ your belongings. Yeah, that’s it.
As an in-between case ask yourself this: if you’re doing a websearch (on Google, Yahoo!, MSN, …) do you want any of the results to be … search-result pages themselves (from Google, Yahoo!, MSN)? That is, if you search for “snorklewacker” on MSN web search, and you click on result #4, do you want to find yourself looking at a websearch results page for “snorklewacker” on Yahoo! Search, which in turn has (as result #3) the Google search results page for “snorklewacker”? (It’s easy to construct links that create specific searches and embed them in pages that will be crawled, so this could happen if search engines didn’t take steps (even by removing themselves via robots.txt).)
For one thing this seems like a dangerous recursion that threatens to blow the stack of the very Internet itself. (It’s for reasons like this that I generally wear protective goggles when searching at home – safety first.) But mainly, it just doesn’t seem to be getting anywhere. Search engines are aggregators themselves – on-demand aggregators that show you a new page of results based on the terms you typed. What’s the point of fobbing you off on another term-based aggregator?
Blog aggregators, search engines, tag pages – all of these are fine things as starting points, but the potential for tail-chasing is pretty high if they all point to each other. I say the bar for inclusion ought to be pretty high (though as always it’s just MHO, and not to be confused with my employer’s O at all).