Hotel wifi – there oughta be a law!

October 7, 2006

I’m your typical only-slightly-left-of-center liberal, in that while I believe we need govt regulation and consumer-protection laws, I also believe market forces will auto-correct a lot of lame consumer offerings. Except when I personally feel jerked around – then it’s martial law time.

Case in point: hotel wifi. What kind of internet access do you have? we ask brightly. Wireless internet in all the rooms! they say brightly. Then you arrive, and begin to dimly realize that the real situation is some small underpowered wireless router hidden behind the boilers in the sub-basement of your 10-story hotel. And as long as they have one of those, somewhere, that’s powered up some of the time, they can say that, can’t they?

Clearly it’s past time for a Federal Bureau of Wireless Quality Inspection, complete with unsmiling G-man trenchcoat types impersonating laptop-toting guests, and their black-jumpsuited tech experts in wardrive vans monitoring signal strength to the microbar. Just think of the restauranteur’s fear of the Health Inspector (in the U.S.) or the Guide Michelin reviewer (in France) and you’ll see where I think we need to go with this.


Proper uses of bugzilla

November 24, 2005

What makes for a good bug (report)? The usual responses might refer to the need for supporting information like software version, operating system, error messages, backtraces, etc. While I agree, my question is more basic: what’s a good problem to file as a bug?

I think it makes sense to work backward from the possible outcomes that your bug system provides. My system (a version of bugzilla) has these:

FIXED – A fix for this bug is checked into the tree and tested.
INVALID – The problem described is not a bug
WONTFIX – The problem described is a bug which will never be fixed.
LATER – The problem described is a bug which will not be fixed in this version of the product.
REMIND – The problem described is a bug which will probably not be fixed in this version of the product, but might still be.
DUPLICATE – The problem is a duplicate of an existing bug. [..]
WORKSFORME – All attempts at reproducing this bug were futile [..]
PUSHED – [..] Another form of FIXED. In the development context, FIXED means that the change has been commited.

If these are the possible outcomes, then one possible definition of a good bug is something that:
1) Might ultimately be resolved into one and only one outcome (otherwise why file in to this system?), and
2) (at the time of filing) is not clearly destined for one and only one outcome (otherwise, why ask for an investigation?)

So say we have the following bug:

Bug 666666: Peace on Earth and good will toward men.

Hmm. Currently, it’s definitely not FIXED, not PUSHED, and not INVALID. WONTFIX? A bit pessimistic that, in the long term. DUPLICATE? Undoubtedly, but let’s assume we’re dealing with the first one. WORKSFORME? Not at all.

What’s left? LATER, and REMIND. Now, these are actually good responses. Yeah, that’s it — we’re going to get to that one LATER, and if by any chance we forget, then do REMIND us. But you get the feeling that these categories are a bit weaselly, and in fact may have been added _because_ people weren’t filing good bugs. So let’s set them aside, and decide that bug 66666 is still open.

OK, if a good bug is one that might be resolved into one of the non-weaselly categories, then what would it mean to close bug 666666? I guess that there would be no disturbances of the peace (at all) and no failures of good will (at all). You see the problem. Even if we were very very pleased indeed with progress on this front (harmony in the Middle East, etc.), we couldn’t really close it.

So this one we’re likely to leave open for a while. But this is discouraging, isn’t it? Every day you come in to work, and you see your bug list (“Good will toward men”, “Cancer cure”, “Affordable anti-gravity device”), and it’s really hard to see which one you’re going to knock off before lunch. And then there are the ones that seem achievable, but annoyingly vague (“Less spam in search results”). Where to begin?

So it isn’t just that some bugs fit the paradigm less well than others – it’s that bugs that don’t fit the paradigm begin to kind of pollute the workflow and drag down morale. The problem isn’t with quality control on the code; it’s with quality control on the _bugs_. Bugs need to be fully-baked, crisply-defined, bite-sized, toasty little nuggets of unitary progress to be useful.

So it seems like resolutions should include more filer feedback beyond the somewhat surly and uninformative alternatives of WONTFIX and LATER. How about these:

BREAKITDOWN – This bug seems to be multiple bugs in one — refile individually
SCALEITBACK – This bug is breathtaking in its ambition and scope — try a version for humans
DEFINESUCCESS – This bug doesn’t seem to have an endpoint. Refile with a description of what it would mean to close the bug

(The last of these (DEFINESUCCESS) might also be a good entry field for all bugs filed, even if freetext and optional: describe what it would mean to you (and only to you) if this bug were fixed. What would a test look like?)

Search engines – speed and freshness

September 26, 2005

So there’s a tradeoff here, between the speed with which a search engine incorporates new documents, and the speed with which it responds to queries. Confusion about this makes me grumpy when people talk about blog search and pings (the implication being that if old-school websearch engines listened more to pings, then they would be up-to-the-second fresh).

Traditional information-retrieval engines assume they have all the relevant documents before they build an index. And at the heart of the index are what is called “postings” — mappings of terms to the lists of document IDs that contain them. A postings file might look like this:

aardvark – 1,10,13
anteater – 1,8,49,74
ants – 1,8,10,12,74,89

but compressed and optimized to support one basic operation: scanning the postings to find document IDs that have the union of terms from the query (i.e. scanning for “aardvark” and “ants” should very quickly return: 1,10).

Now there are many tweaks on top of this basic postings framework, but as described, postings are 1) brutally efficient at supporting AND queries, and 2) a real pain to update. Part of the problem is that postings are usually compressed. For example, lists of doc ids are usually represented using run-length encoding – instead of using the absolute number (1,10,13), you use the increments (1,9,3). This makes the numbers smaller on average, so you can represent them more compactly, using some kind of variable-length encoding.

But if you have a new term, or a new document? Good luck. Sure, you can take that new information, and merge it with your existing index to produce a new index. But that means a reformat of the file, and writing out a lot of new bytes. You’re not going to want to do that every time a new document arrives.

Now, aren’t there blogsearch engines out there that claim very quick incorporation of new documents? Yes there are, and I basically buy that at least some of them operate as advertised. But then you have to ask how quickly they serve queries. Which would you rather have: information that’s a number of minutes old and served within less than a second, or info that’s very fresh that takes 30 seconds to query? Depends on your needs, of course, but anywhere close to 30 seconds and I think it’s an unsatisfactory user experience.

When responses to queries are really slow, one is tempted by a terrible suspicion: could there maybe be … SQL under that hood? Database systems (RDBMS’s) that support SQL are built under an assumption that there will be a mixture of read queries and write queries and (more importantly) have a very general model that can support arbitrary mixes of those. If you want to seriously scale a search engine, you cannot afford to build it on top of an RDBMS,

Some incorrect captcha answers

April 2, 2005

I’ve modified this weblog software to make comment submitters solve captchas (in this case, distorted images of 6-letter words) to prove that they are human. I also log both the attempted answer and whether the submission attempt succeeded. I looked at the logs for the last week, and all the answers that _were_ 6-letters words were correct. About 100x as many, though, were not 6-letter words at all. Some sample answers:


Nice tries, but unfortunately not entirely correct. What this must mean, though, is that someone out there is running form-filling software that parses the form and stuffs a default string into all available slots that the form offers, even when (as with my test answer slot) it’s not found on anyone else’s installation…


February 21, 2005

Several months ago I threatened to make the captcha generation I use on this site available as a web service. I have finally done that, at I was able to do this in part because CommerceNet was kind enough to donate and host the server that runs on. Thanks CommerceNet!

In addition to providing the kind of distorted text images that you see here when you add a comment, I have a couple of pure-text captcha types as well. The motivation here was pure accessibility-guilt, and the captchas are extremely lame and crackable in comparison to the image-based ones. Devising automatically-generated pure-text captchas that are difficult for programs to solve looks like a hard problem; in particular, any text you generate with a tree structure can be correspondingly parsed. I’m hoping that someone at this workshop has advanced the state of that art.

I also added a new type of image-obfuscation that looks like this:

Here’s a sampling of the captcha types that supports so far.

Animated fugues

December 5, 2004

Here’s a cool use of Shockwave (and also the fugue I’m wrestling with learning to play). Don’t miss the animations in the diagram. The whole set is here.

Captchas and accessibility

October 18, 2004

Here’s a demo of the captcha generation I have so far — the script talks to a mini web service to get the data. I’ll add more options to the web service and then make it available. [Update: the ‘demo’ is now just the way I screen comments on this blog.] Interestingly, although I have fond feelings for the gd image library, nothing I needed for doing nice image distortion turned out to be there, at least in the PHP-bundled version — I had to do it at the pixel level.

John’s comment led me to think more about accessibility. As I said in a comment, if an individual running a site decides that inaccessibility to bots is more important for that site than accessibility for some humans, I’m not going to stand in judgement (someone else will do that for me, I’m sure). But I guess that if you’re writing code that multiple people might use for bot-screening, it would be irresponsible not to include some alternative to images. So I’m thinking of adding an “alt text” captcha, most likely some kind of MadLibs-style description of a number. (Instead of blocking the visually-impaired, it would block people who can’t do any arithmetic in their heads — bug or feature?) The difficult part is in coming up with obfuscation that couldn’t be easily reversed by parsing it.

If captcha generation becomes more widespread, I wonder if open source would help confer some abuse resistance. I mean, if MSFT used a particular style of capture for single-signon, there’d be a lot of incentive to defeat it. But imagine lots of captcha servers, each run by someone who likes to mess around occasionally with the obfuscating code… is it ever going to be worth trying to beat them all?