Databases, Part One

My distrust of Relational Databases comes from a time before I was a full-time programmer. I was a part-time programmer, part-time sysadmin. I wanted to install an SNMP client for Linux. The problem? The client wanted to store data somewhere, and the writer of the client had decided to store it in MySQL. This was a desktop application, a pretty small one for that matter, and the author had decided I needed to install an RDBMS just to store a few hundred Kb of data. Ever since, I've taken to examining very carefully every time I just assume that the answer to the question “Where do I store my data?” is a relational database.

RDBMS are slow. They're massive generalisation, and thus will waste an awful lot of cycles trying to cater for possibilities that you'll never encounter in your application. RDBMS require specialist expertise to tune effectively. DBA's are a particular breed of wizard, with an incredible store of arcane knowledge. RDBMS are large applications that require additional maintenance. You suddenly have to wonder about the bugs and security holes of a whole new big application. When a new major version of the database comes out, and you want to upgrade, you have to dump and restore all your data. If you upgrade your Linux distribution, you might find your data has been left behind entirely.

I feel pretty confident saying that a significant number of people who are using a Relational Database, shouldn't be. They shouldn't be for one of the following reasons:

They're not storing enough data.
They don't need to make arbitrary relational queries.
They don't need ACID transactions.

As an illustration, I'm going to pick on one application: the QDB (Quotes Database) at bash.org, which I go to occasionally, and which is often painfully slow. As a disclaimer; I know that the site uses MySQL for data storage because it tells me it does. I don't know what caching the application does internally, it could be they're using a really smart architecture and the database isn't the bottleneck. It certainly appears to be, though. Quite often, I get an error message telling me it timed out connecting to the database, which seems to point to the DB as being the point of failure.

Looking at the statistics on QDB, there's fewer than 15,000 quotes in the database. Very generously figuring an average quote at 1k, that's 15Mb of memory. That's nothing. That's room for the database to grow by a factor of twenty and still fit in the RAM of a low-powered server. And think how much cheaper RAM will be by the time the database hits 300,000 quotes.

The select operations that need to be performed on the quotes are really simple. The only ones seem to be sorting by rating, random selection, and text search. The first two can be handled incredibly efficiently by in-memory data-structures. Even if you're really strapped for memory and can't hold all the data there, the application could store the quotes in something simple like Berkeley DB, keep sorted indexes in memory, maintain an LRU cache, and cheat the random function by keeping a “random pool” of quotes in memory that just cycles every so often. Remember, you can serve the same “random” page to as many people as you like, so long as you never give it to the same person twice.

The only complex query in this application is a full-text search, and there are specialist full-text search packages available that don't rely on an RDBMS. Java hackers should seriously consider Lucene for this purpose, for example. Remember, RDBMS's are slow, so the same full-text technology will be faster outside an RDBMS than inside.

One of the big benefits of a real database is ACID transactions. Most applications don't need them. If you're storing financial data, ACID is a necessity. If you're running a web application, it's overkill, and once again it's overhead that your application will suffer from. MySQL became the most popular Open Source database long before it had transactional support, because it was fast. Why was it so fast? Well one might guess, partly because it didn't support transactions. And nobody but a few reviewers at Ars Technica really noticed for years.

Sometimes, you'll need an RDBMS. You're storing massive amounts of data. ACIDity is something you just can't do without. You want to be able to do arbitrary relational queries on your data, or access your data in so many different ways that you'd basically have to reinvent the relational model just to implement them. If you find yourself in this position, go for it.

Sometimes, you'll want an RDBMS. You need to store data somewhere, you're familiar with SQL and the database API, you don't want to worry about coming up with your own data model, dealing with concurrency issues, or messing around with different ways to safely persist your data to permanent storage. This is when you should take a good, long look at yourself. This sort of decision-making will come back and bite you in the long run.

The Fishbowl

Databases, Part One