We had the AIRWEB anti-webspam workshop Thursday before last, in connection with the SIGIR conference. In theory this was organized by Brian Davison, Marc Najork, and myself, but this is really Brian’s baby, and he did most of the work.
Overall, I was pleased. Unlike the first AIRWEB, the paper reviewing process was “competitive”, meaning that there were enough paper submissions that we had to reject some of them. Our early fears that we would be crammed into an un-airconditioned classroom at the University of Washington were unfounded – we were crammed into a pleasantly airconditioned classroom. It was actually a fine size for the 50 or so people who registered – all of us sitting in those elementary-school-style all-in-one desks with the writing surface bolted on to the chair. Kind of takes you back (way way back), as well as giving you a real test of how the years add girth.
Attendees were a mix of academics and industry people, with industrial folks divided between research-lab types and fighting-spam-in-the-field types, from Yahoo!, Google, Microsoft Research, Technorati, Ask.com, and a lot of others. A high point was Jan Pedersen’s overview of sponsored search (aka search ads) – slides here (PDF). I also saw a couple of techniques that we definitely have never thought of, and that are worth giving a try at Y!.
Late in the day, we had a panel on blogspam, where I subbed for Andrew Tomkins. It was sort of cute – after some intro statements, the six of us pulled our tiny age-inappropriate desks up to the front, and sat to await questions. It felt like an informal spelling bee. Most of what I talked about in the opening was the explosive and surprising growth of adoption of the nofollow standard for marking untrusted links, and the extent to which it doesn’t help.
And then, as always, the question came up: why don’t the major engines share data, and in particular, blacklists of nasty webspammers? Oh boy. Natalie Glance in particular has been an advocate of this, and at an earlier Spam Summit presented a clever voting scheme that engines might use to combine such data. I always feel both like a grumpus and a corporate tool when I say this, but I can think of two good reasons why it’s just never going to happen, and one pretty good reason why you might not even want it to. (I’m going to leave it there, but comments are welcome.)