One of the less visible improvements that was made to Javablogs over the New Year was an extensive refactoring of the component that retrieves and parses RSS feeds. Aside from the important task of... well... making it work reliably, we made it do a few useful things like update feed and site URLs automatically when they move.
Anyway, we encountered one problem with a certain blog host that will remain nameless. Actually, we encountered two problems, but the fact that half the time it would hang indefinitely when asked to serve a feed isn't relevant to this rant.
The problem lay with what happened when you tried to retrieve the RSS feed of a blog that no longer exists. This unnamed blog host redirects you to a nice HTML page that tells you the blog isn't there any more. From a human useability perspective, this is fine.
From the point of view of a bot, though, it sucks.
The RSS reader expects to find an XML document at the given URL. Instead it gets HTML back, which causes the poor little XML parser to shit itself. Sure, we could check the Content-type
, but nobody gives a meaningful content-type for RSS anyway so that would just cause more problems. Regardless, it's still a fundamental misrepresentation of the real error: the document isn't the wrong type, it's not there any more.
HTTP already has a perfectly good way to tell any agent a resource isn't there any more. It's called the 410 response-code. 410 means "Gone". The resource used to be there. It isn't any more. We don't know where else it might be. Deal. If our bot got a 410 response-code, it would know categorically that it shouldn't bother to check that RSS feed again, because it's gone for good.
Well it could, but it wouldn't because we never bothered to program that bit. Nobody actually uses the 410 response-code. But even the somewhat less accurate "404 Not Found" response would be more useful than sending us back a redirect to a deceptive "200 OK" and then dumping the wrong document in our laps1.
HTTP has a bunch of useful response codes. It's even got a few that nobody's worked out how to use yet, like "402 Payment Required". As web application developers, we should be familiar with the codes and use them when they are appropriate. It makes our applications more friendly to any clients that may want to visit them, whether they have human eyes or not.
The only caveat is related to Internet Explorer. IE is traditionally very unfriendly to any page that is served up as an error. It assumes that any error page below a certain length can't possibly contain enough information to explain to the user what really went wrong. If it finds a too-short error page, it will replace it with the standard Internet Explorer error page, which still completely fails to explain to the user what really went wrong, but uses much more words to do it.
If you find you're getting that problem with IE, just pad your page with a few hundred characters of <!-- HTML comment --> until your own page appears once more.
1 The bot doesn't stop checking 404'd feeds entirely. The web being what it is, pages vanish temporarily all the time. It does, however, check for them less often the longer they're not there.