SGML (and by extension, HTML), can be rather annoying to deal with. I am beginning to understand why everyone was so happy when XML turned up and started to supplant it. For example, did you know that the following is a completely valid HTML document?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<head>
<title>I am a fish</title>
<p>Blah
Run it through the W3C validator if you don't believe me. That page will earn you a nice "Valid HTML4" badge. The closing of the <head> and <p> tags and the entire <html> and <body> tags are all implied by the positioning of the other elements in the document. When you run the document through an SGML parser, it should insert all the ‘missing’ bits for you, so when you look at it through the parser, it magically becomes this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>I am a fish</title>
</head>
<body>
<p>Blah</p>
</body>
</html>
A more relevant annoyance is the handling of <foo />. In XML (and in the weird hybrid XHTML that people are encouraged to write to be backwards compatible with modern browsers), <foo /> is assumed to be equivalent to the empty tag <foo></foo>. Unfortunately, this works because web browsers aren't SGML parsers. In SGML, <foo /> is considered equivalent to <foo >>.
This would be OK if it were just XHTML documents that contained the <foo/> notation. XHTML is required to be valid XML, so you can recognise it and run it through a straight XML parser instead. However, it's started to creep into regular HTML documents as well, through cut-and-paste page writing, developers' finger-macros and guru incantations adopted half-understood by willingly ignorant acolytes. So if you pass one of those through an SGML parser, you'll end up with all sorts of extraneous greater-than signs lying around the place.
This is far too much to think about on New Years Eve. Maybe I should go somewhere, get drunk and watch some fireworks.
(Random note. My site doesn't validate. Nor do I expect that it ever will. There's a certain effort/reward ratio involved in maintaining a valid site, and for me, the reward simply doesn't even approach the required effort. Especially considering the work that would be required in going back through over a thousand old entries and giving them valid, semantically appropriate markup.)
Eh, I wouldn't worry about this stuff too much. From a practical standpoint, valid markup is really only good for one thing -- tackling cross-browser display issues. Writing valid code doesn't eliminate cross-browser issues, but it reduces the problem set considerably. And that's about it. For most people, almost-valid markup is good enough. No need to obsess over every little deprecated attribute and unescaped ampersand, unless you're an anal-retentive type (like me).
(The sole exception would be sites that actually *need* to send perfect XHTML across the wire, because they're embedding some other XML content, such as MathML or inline SVG. But I can count the number of people who are doing that on the fingers of one hand.)
The looseness of HTML is a feature. It's part of what makes HTML so easy for eight-year-olds and eighty-year-olds. But it's actually very easy to transform invalid HTML into valid HTML; just use Tidy. http://tidy.sourceforge.net
In my thesis "How to cope with incorrect HTML", I do a validation effort of 2.5 million HTML pages (XHTML was not considered at the time) taken from http://dmoz.org. It showed that only 0.71% of all HTML pages were valid, and describes in some details the different errors the pages had. You can jump directly this chapter via http://elsewhat.com/thesis/pages/?nr=81 , which shows each page of the thesis as a separate image.
The implications for the invalid HTML is that it is very difficult to develop new browsers, as they have to simulate every error-fixing IE has implemented. This is one of the reasons that Opera and Mozilla have struggled to display every page on the net.