Recap – Pubcon Las Vegas 2006

November 19, 2006

Just returned on Friday from Pubcon Las Vegas. Vegas is the place to be if you are drinking and gambling … and if you’re not (as I wasn’t), it’s all kind of odd. I had the Scrooge-like realization on the return flight that it could have been Salt Lake City for all I cared. 🙂

It was nice to see Brett Tabke (WebmasterWorld) and Danny Sullivan (SearchEngineWatch) on stage together, since they have run parallel and competitive sites and conferences for many years (with Danny moving on to something new next year). It was cool that Sullivan showed up, and I’m sure it helped attendance.

Of the keynote speeches that I saw (Guy Kawasaki, John Battelle, Danny Sullivan), Danny’s was my favorite. A lot of it was historical review of the search industry (with Danny having seen it all), but he also argued interestingly that the diversification of search companies into other kinds of advertising has had an obscuring effect on the measurable success of core, intent-based search. Danny also got in a couple of good shots at Y!: an earnings joke (fair enough), and a screenshot full of spammy results (I recuse myself from comment on that one). Battelle’s talk was interesting, and he is a great public speaker. He did, however, explore the risky edge of using the bully keynote pulpit for sales purposes, by devoting a lot of his speech to case studies from his own Federated Media venture.

The panels I was on seemed to go well. Frankly, I wasn’t sure that what I presented at the Site Structure for Crawlability panel was news to anyone, but the Duplicate Content panel seemed to generate a lot of discussion. Brian White from Google gets the good sport award for his starring understudy role on the crawlability panel. 🙂

The Yahoo! Publisher Network Party was fun, fabulous&glamorous and also seemed, um, expensive. (Sorry, that’s just the stingy and bitter shareholder in me talking. 🙂 ) It’s a marketing question I guess (and therefore way beyond my expertise), but I’m interested in the impact on your customer base of admitting 100 to the exclusive party, and turning 500 away. If you’re one of the 500, does it make you think that the hoster is a really cool company who you should try even harder to impress? Or does it just make you mad?

The biggest news, of course, was the Open Sitemaps announcement (here’s the Y! blog post and the Google blog version). The idea of engines reading a common sitemap format from webmasters is so cool and sensible that you know it must have taken a lot of work behind the scenes to get these large companies to come together on it. Props to the Google Webmaster Central people, and to Priyank Garg, who made it happen from the Y! side.

What did I miss? Unfortunately, quite a bit. I really wanted to see the Interactive Site Review session (always fun). But while 1000 people watched that session in a brightly lit room, I was the only person in a similar room next door, sitting in the 35th row or so, talking on my cell phone in the dark. 🙂 And the conference was basically over for me on Thursday night, as I had to leave earlier than I’d planned. Among other things, this meant that I missed the Pub day on Friday. This is the second time I’ve been to Pubcon, and both times I’ve missed the Pub. Can I really say that I’ve been to Pubcon yet?

The word “spam”

November 18, 2006

At SEO conferences (like Pubcon Las Vegas) I often introduce myself by saying that, among other things, my group fights search-engine spam. But I’ve noticed that whenever I say the S-word to SEOs there’s a little flinch, an awkward little pause. Not as though I’ve done something unforgivable, but as though I’d … farted in church or something. Kind of like the way that people who don’t really like strong language react when someone curses.

This is a cultural mismatch, because I can tell you that within Y! Search we use the word pretty freely. 🙂

Anyway, saying “black-hat SEO” instead doesn’t seem to produce the same reaction, so that’s what I’ll say going forward. Yeah, that’s it.

In Las Vegas for PubCon (WebmasterWorld)

November 14, 2006

I’m taking a few days to see the presos and meet the folks at Webmaster World in Vegas. I’ll be representing Y! Search on two panels:

Site Structure for Crawlability

Duplicate Content Issues

If you’re at the show, drop by and say hello – or leave a comment – or send me a note at tim underscore converse at yahoo dot com.

Search engine optimization (SEO) from black to white

October 31, 2006

In one of my favorite Saturday Night Live skits, “Tom Ridge” explains the U.S. terrorist-threat color codes:

Tonight, I’m proud to unveil my agency’s new weapon in the War on Terror: the Homeland Security advisory system. It’s a simple five level system, which uses color codes to indicate varying levels of terrorist threat. The lowest level of threat is condition OFF-WHITE, followed by CREAM, PUTTY, BONE and finally NATURAL. It is essential that every American learns to recognize and distinguish these colors! Failure to do so could cost you your life. For those who may have questions, an excellent guide will be found on page 74 of the spring J. Crew catalogue.

Now, what precisely do these threat levels indicate? Condition OFF-WHITE, the lowest level, indicates a huge risk of terrorist attack. Next highest, condition CREAM: an immense risk of terrorist attack. Condition PUTTY: an enormous risk of terrorist attack. Condition BONE: a gigantic risk of terrorist attack. And finally, the most serious, condition NATURAL: an enormous risk of terrorist attack.

Here’s my attempt to give SEO’s more than just two or three colors.

Background: A naive (non-SEO) webmaster or content producer simply makes a site, without a thought or a care to the world of search engines. Or if there’s a thought it’s a thought of hopeful trust: if I make a useful interesting site on topic X, then the search engine will figure that out and deliver users who care about X to my site. SEOs and SEO-aware content creators construct sites instead with an eye to how search engines work, and make content that is designed to be retrieved. The white-hat/black-hat continuum is about the extent to which SEOs are working with search engines or against them. Black-hat SEOs are also known as search-engine spammers.

Dark inky black: The SEO’s (or in this case the spammer’s) interests are totally divergent from both the engines and the users – the SEO wants to trick the search engine into handing over users who are ripe to be tricked themselves into a situation of malicious harm. For example, the SEO might name his domain just one typo-character away from a famous domain name, then install spyware on the computer of any user careless enough to visit, or attempt to impersonate a major portal’s login page to collect logins and passwords.

Charcoal: The SEO tries to trick the engine into showing the user something totally unrelated to the query, and possibly offensive, but doesn’t actually commit any illegal or fraudulent acts within five seconds of the first user click. Example: a (heinous) pornspammer who stuffs the page with irrelevant non-porn keywords targeting innocent queries, maybe via invisible text. 99.9% of searchers will be searching for something else and will be put off; 0.1% will be searching for something else, but will, um, flexibly and opportunistically reorient their interests.

Dark gray: The SEO collects (aka steals) random text from other sites, and uses it to create thousands (or millions) of pages targeting particular queries. The pages have nothing original of value, but do have ads.

Slate gray: The SEO creates thousands (or millions of pages), all of which point (by linkage, or framing, or redirection) to the same content, which might actually be interesting to the searcher.

Gray: The SEO reads the guidelines of search engines, and tries to juice up their sites just enough to fly under the radar on all dimensions – artificial linkfarms that remain small, automatic content duplication that is arguably not too abusive, etc. The goal is to get enough referral traffic as possible, without too much reference to whether it is interested traffic.

Light gray: The SEO creates “original” content in bulk the old-fashioned way, thinking first of all of search engine rules, secondly of duplicate detection algorithms, and lastly of whether the text makes sense to human beings and is something anyone would ever want to read. Then the SEO experiments with all the parameters (keyword density, internal linkage) trying to move up for the queries of interest.

Off-white: The SEO ensures crawlability of the site, restructures it if necessary for size of pages and internal linkage, and then injects terms to specifically target the important keywords and queries. He doesn’t create linkfarms, but friends and allies are importuned to link with specific text and phrases.

White: The SEO starts (if lucky) with a site full of content you can’t find anywhere else, and that answers a need that searchers actually have. Then the SEO makes sure the site is crawlable, and that titles and internal links make sense and are descriptive. Then the SEO thinks hard about the queries that really should pull up this content, and tries to discover if the right terms are present. Then (the hard, artful part), he or she rewrites content with a dual consciousness of the infovorous human reader and the termnivorous spider, making sure that the most important terms and phrases for the spider are present (in all their forms) and forefronted for the spider, without degrading the quality for the reader.

Luminescent pearly white: This would be a case where the SEO designs a site to show up for relevant queries and _not_ to show up for irrelevant queries. Do luminescent SEOs exist? Well, Jon Udell is one anyway.

SEO book review: ABC of SEO (George)

September 3, 2006

I never know what to make of books organized in this pseudo-glossary style, where the only organization is alphabetical-by-topic-title. It seems like an abdication of the author’s responsibility to impose meaningful structure. It also makes the reviewer’s task slightly artificial – no doubt the encouraged mode of reading is dipping and sampling (perhaps with odd free moments in the smallest room of your house), but of course as a reviewer I felt obligated to read it cover-to-cover. So how will my experience match up with yours?

With that said, The ABC of SEO is surprisingly meaty, and rewards cover-to-cover reading too. It’s really a set of small essays, each of which is organized nicely on its own, and is surprisingly dense with technical info. It becomes increasingly clear that the glossary use-case is a fiction (would you really turn to this book for definitions of terms like “Competition”?), but it supports browsing through the table of contents for the topics you care about. Before the alphabetically-organized portion, George gives a good and balanced overview of the search-engine ecosystem, and the roles that engines, publishers, SEOs, and advertisers play in it.

The book really shines in the longer in-depth technical entries, where George explains details of crawlers, webservers, log files, and so on. Entries I thought were particularly strong or compelling (in alphabetical order, naturally 🙂 : Altavista, Anchor Text, Banning, Black Hat SEO, Content Targeted Advertising, In-Bound Links, Keywords, Misspellings, Robots and Spiders, and (especially) Traffic Analysis.

I like the point that George makes at several points in the text: no contract for services has been signed between SEOs (or webmasters) and the engines. Webmasters have no obligation to abide by rules set up by the search engines; engines have no obligation to include or rank a given site, and are entirely within their rights to exclude sites if they feel it improves user experience. I like this because it clarifies a debate that often gets a little hysterical and/or moralistic in both directions.

The book has a copyright of 2005, which (in this fast-moving domain) dates it slightly – it’s clear from the text, for example, that MSN was just launching its own in-house web search engine at the time of writing (with some entries written before, some after). Also, in general, it has to be said that the book focuses much more on Google than on the other major engines, including a lot of focus on PageRank itself. Although there are entries for both Y! Search and MSN, you’ll find nothing particularly detailed there on aspects of those engines not shared by Google.

George also cautions against some practices that might confuse SE crawlers, without naming specific engines that might be confused. He’s right in general that _some_ engines have had problems with these constructs, especially in the past. But it’s worth clarifying that Yahoo!’s crawler in particular has no problem with the following:

o Framesets. Whether or not framing is good web design, the crawler can snarf up the framed components into one bundle, without problem.

o Dynamic URLs. In general, it’s good to minimize the number of arguments after the ‘?’ in the URL, particularly when the arguments don’t affect the content and cause the same content to have many URLs (as with session IDs). But the crawler does not have any inherent problem with such URLs.

o Invalid HTML. No engine that I know of will discard your pages just because the HTML is badly formed (e.g. has start tags without corresponding end tags).

Overall, The ABC of SEO gets an enthusiastic thumbs-up for detail, accuracy, and sensible advice.


August 24, 2006

It’s always nice when the borderline between valuable content and webspam content is clear-cut – in that case, the goal of the search engine is straightforwardly to keep the spam out of search results. Unfortunately, some quality issues are a continuum, from the best of the web to the worst.

One of these issues is what to do about “aggregators” – sites and pages that live only to display arrangements of links and bits of content drawn from other sources. The range is continuous, from very tuned and sophisticated content-clustering and layout engines to the worst kinds of scraper spam.

For high-quality aggregation, try Google News, or ScienceBlogs (which JP turned me onto recently). Google News shows a clustering of news stories that is famously untouched by human hands. I don’t know for sure that ScienceBlogs is entirely an aggregation, without any new content, but it looks that way.

Search engines usually want to show content that provides unique value to users – collecting up a bunch of content found elsewhere seems to violate that. On the other hand, are these high-quality sites? Absolutely. And can pure compilation or aggregation add value? Well, I-am-not-a-lawyer, but apparently copyright law gives at least some thin and guarded support for compilation copyrights. And if humans can get credit for assemblage, I think we should extend the courtesy to algos too.

So some high-quality aggregation sites do belong in a websearch index. With that said, you might not always want them to come out on top in a search. If a query matches some snippet of a story on ScienceBlogs, then probably the original story itself would be a better match – but let’s rely on differential relevance algorithms to sort that out.

At the other end of the spectrum, check out this portion of a doc I found while doing an ego search on a websearch engine:

If I am interested in all things Converse (and I am, I am!) then I should really be interested in this doc …. but I’m not interested. There’s no discernible cleverness in the grouping, and no detectable relevance sorting being applied to the search results. This doesn’t belong in a websearch index, as it’s hard to imagine any querier being satisfied.

Other variants of this kind of spam technique cross the line between sampling/aggregation and outright content theft. Imagine that you get home one night to find a stranger leaving your house with a sack containing your TV, cell phone, jewelry. You might misunderstand, until we explain that he’s actually an _aggregator_ – he’s just _aggregating_ your belongings. Yeah, that’s it.

As an in-between case ask yourself this: if you’re doing a websearch (on Google, Yahoo!, MSN, …) do you want any of the results to be … search-result pages themselves (from Google, Yahoo!, MSN)? That is, if you search for “snorklewacker” on MSN web search, and you click on result #4, do you want to find yourself looking at a websearch results page for “snorklewacker” on Yahoo! Search, which in turn has (as result #3) the Google search results page for “snorklewacker”? (It’s easy to construct links that create specific searches and embed them in pages that will be crawled, so this could happen if search engines didn’t take steps (even by removing themselves via robots.txt).)

For one thing this seems like a dangerous recursion that threatens to blow the stack of the very Internet itself. (It’s for reasons like this that I generally wear protective goggles when searching at home – safety first.) But mainly, it just doesn’t seem to be getting anywhere. Search engines are aggregators themselves – on-demand aggregators that show you a new page of results based on the terms you typed. What’s the point of fobbing you off on another term-based aggregator?

Blog aggregators, search engines, tag pages – all of these are fine things as starting points, but the potential for tail-chasing is pretty high if they all point to each other. I say the bar for inclusion ought to be pretty high (though as always it’s just MHO, and not to be confused with my employer’s O at all).

SEO book reviews

August 20, 2006

I’m planning to start a short book review series on this blog, focused on SEO (Search Engine Optimization) books. The idea would be to review the books from a search engine perspective – what makes sense, what seems crazy/dangerous, etc. I’ll also focus on Y!-specific info and advice to the extent I can.

I’m going to start with The ABC of SEO, by David George, not _only_ because it’s short, really. 🙂 If there are any other SEO titles that people have liked and/or would like to see reviewed, please leave a note in the comments.