Error Handling vs Error Recovery

by Charles Miller on July 14, 2003

Part two of the “Lessons Learned When My Blog Died” trilogy:

Lesson Two: Handling errors is not enough.

An error is handled if the program is able to recognise that something unexpected has occurred, and trigger some alternative but explicit execution path as a result. For example, if you are loading data from a corrupt file and your program does not handle errors, the corrupt data in the file will propagate through your program unexpectedly, leaving people stuck with names like ‚ƒ„…†‡ˆ‰Š.

Or another example: Java contains implicit bounds-checking on arrays that allows it to automaticallly detect when too much data is trying to be stuffed into too small a space. By throwing an exception, Java handles that error condition, at least at a low level. C, on the other hand, does not. Where a C program does not explicitly handle the buffer overrun case, the error can cause unpredictable events elsewhere in the program, leading to the classic stack-smashing security exploit.

Handling an error means:

  1. Noticing that something unexpected has happened
  2. Triggering some alternative logic to return the program to a predictable state.

That “predictable state” could be the program printing out a cryptic error message like “Can't use an undefined value as a SCALAR reference at lib/MT/ObjectDriver/ line 354.” and then dying. It doesn't matter: the error is “handled’ if you do something predictable with it.

Obviously, handling an error is not enough. One should also attempt to recover from an error. The error quoted above is happening when I log in to MT, because it tries to list the five most recent comments, and my comments db was corrupted when I ran out of disk space. MT handles that error by throwing up an error screen, and inviting me to do something else that isn't broken. The error is handled, but not recovered from.

To recover from an error, you must first handle the error, but in the new path of execution in the error handler, you must:

  1. Recognise the cause of the error
  2. Take steps, if possible, to prevent the error from happening again
  3. Take steps, if possible, to continue the user-requested action by working around the error condition

Sometimes, it's possible to recover gracefully from an error. In the case of a corrupt comment in my comment database (I can still post new comments and retrieve old ones), MT should recognise that the corrupt entry really isn't going anywhere on its own, and have some strategy to deal with it.

Files get corrupted all the time, especially heavily used ones. I once lost an entire Windows 2000 installation because a single, 10Mb file became corrupt. Needless to say, I wasn't happy about it.

Obviously, recovering is a lot more difficult than just handling. To handle something, you just need to know that something bad happened. To recover, you need to know just what the bad thing was, and think of ways to get around it. That's a lot more work, and in the face of things that aren't expected to happen often (like files getting corrupted), the need to get a feature written seems far more important than handling every little thing that might go wrong.

Design can also play a significant part in aiding (or hindering) error recovery. The standard Java error-handling policy of “just throw an exception in the air and hope somebody deals with it” is usually particularly deficient in this regard. Every layer of encapsulation the error goes up through, you lose more ability to deal with the issue at the lower level.

Design can also make error recovery harder. Staying with the subject of file formats, if you place all your data in a single file with variable-length records and no recogniseable record separator, you're really quite dead if you get a single bad entry. Similarly, if you're using a transaction log to keep anything in sync, each log entry relies on the previous entries all having been performed. So if a corruption appears halfway through the transaction log, anything after that corruption will be rendered completely useless.

So, the moral of today's story. Don't just think about how to handle an error, also put thought into how you might recover from it.

Note: I'm using a pretty ancient version of Movable Type here, mostly because I'm scared of upgrading. Any or all of these issues may have been fixed in later versions. I'm not trying to criticise the product, I'm just trying to make a few general points and make some good of an annoyingly broken weekend.

Previously: Disk quotas: impose hard limits with care.

Next: Running an IRC Channel