Valid HTML Woes

by Charles Miller on December 31, 2003

SGML (and by extension, HTML), can be rather annoying to deal with. I am beginning to understand why everyone was so happy when XML turned up and started to supplant it. For example, did you know that the following is a completely valid HTML document?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<head>
   <title>I am a fish</title>
                                                                                
<p>Blah

Run it through the W3C validator if you don't believe me. That page will earn you a nice "Valid HTML4" badge. The closing of the <head> and <p> tags and the entire <html> and <body> tags are all implied by the positioning of the other elements in the document. When you run the document through an SGML parser, it should insert all the ‘missing’ bits for you, so when you look at it through the parser, it magically becomes this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
  <head>
     <title>I am a fish</title>
  </head>
  <body>                                                                                
    <p>Blah</p>
  </body>
</html>

A more relevant annoyance is the handling of <foo />. In XML (and in the weird hybrid XHTML that people are encouraged to write to be backwards compatible with modern browsers), <foo /> is assumed to be equivalent to the empty tag <foo></foo>. Unfortunately, this works because web browsers aren't SGML parsers. In SGML, <foo /> is considered equivalent to <foo >>.

This would be OK if it were just XHTML documents that contained the <foo/> notation. XHTML is required to be valid XML, so you can recognise it and run it through a straight XML parser instead. However, it's started to creep into regular HTML documents as well, through cut-and-paste page writing, developers' finger-macros and guru incantations adopted half-understood by willingly ignorant acolytes. So if you pass one of those through an SGML parser, you'll end up with all sorts of extraneous greater-than signs lying around the place.

This is far too much to think about on New Years Eve. Maybe I should go somewhere, get drunk and watch some fireworks.

(Random note. My site doesn't validate. Nor do I expect that it ever will. There's a certain effort/reward ratio involved in maintaining a valid site, and for me, the reward simply doesn't even approach the required effort. Especially considering the work that would be required in going back through over a thousand old entries and giving them valid, semantically appropriate markup.)

Previously: Java IDE Feature Request: Per-project file templates

Next: Ways to Spot I've Been to the Cricket