“Spam” to most people means the tasty canned meat delicacy. (Yes, I mean “delicacy” unironically. 🙂 ) Secondary meaning is unwanted bulk commercial email. A distant third is search-engine spam or webspam – what my group at Yahoo! Search combats. That’s any kind of trickery by webmasters or search-engine optimizers to artificially increases their ranking in search results.
The first question I get asked when people hear the web/email spam distinction is whether the same anti- technologies work for both. The answer is: not as much as you’d think. Sure, there might be some underlying techniques that are similar (machine learning, graph analysis, text processing, and whatnot). But the power relationships and trickery dynamics are really different when you get down to cases. I’ll just give one example: misspellings.
Everyone gets bizarrely spelled email spam, wanting to tell you all about, say, the benefits of \/1Aggr/_\. Why are they spelling things strangely? Because they’re assuming some kind of Bayesian filter, between them and the user, that is essentially a weighted dictionary of spam/non-spam terms. If they concoct a spammy term that’s not in the dictionary they’ll evade any penalty, yet still be understood.
Web spammers want to do anything they can to be retrieved, but they’re starting with some kind of query from the user. So what happens to the webspammer who embeds \/1Aggr/_\ in their webpage? Nothing much, because no one’s going to type that as a query…. Not that webspammers don’t misspell – some turn misspelling into a business model. But they typically squat as close as they can in edit distance to the right spelling, because they want to match misspellings that people might actually type (like “Viaggra”).
Email spammers have an infrastructural channel right to the user, and the only thing that stops them is a filter in the middle. For webspammers, the channel is the search engine, and in some sense users and the engines both have to cooperate to get that result returned.
There are lots of other differences, but the only other one I’ll mention is cycle time. The webspam world is fast-moving, but the crawling and indexing process at major engines is still measured in days for most docs at most major engines. Email spam is brutally fast in its evolution, with new tricks and new countermeasures turning around in minutes. Man, am I glad I’m not in _that_ business. 🙂