You may or may not have noticed that Javablogs has been going through another of its periodic bouts of low uptime. It's not been nearly as bad this time as it has been, though, mostly because the Contegix guys have been doing such a great job of monitoring the server and kicking it when it goes down.
After a week of totally failing to locate the performance drain, I managed to catch the site just as it was falling into a deep hole. 'top' and 'vmstat' both told me that we weren't using a third of the available memory, and the CPU load rarely hit double-digit percentages. All of the resources seemed to be sinking into I/O.
So I did what you always do when you're lost in I/O, but too lazy to pinpoint it exactly:
- Increase Postgres' shared memory setting
- Add an index or two where we were still doing occasional full-table scans on the blog_entry table
- Cache a few expensive DB queries that nobody will notice aren't performed live.
- Disabled swap on the server (
swapoff -a
)
OK, maybe the last one isn't something you always do, but I had good reason to try it. In advance, I'd like to say that I am not a professional sysadmin, nor am I a kernel guru. I could be getting this entirely wrong. It seems to be working for me, though, so I thought I'd share.
A naïve implementation of virtual memory will fill up the physical RAM first, maybe leaving a little space for disk caching. Once there is no more RAM left, it will reluctantly start paging unused blocks of memory out onto disk. Advanced VM managers, on the other hand, recognise that you get the best performance through a balancing act between disk cache and virtual memory.
This is why, on a modern Linux box, you find yourself using significant amounts of swap long before you've run out of memory. If a page of virtual memory is accessed less often than a particular block on the disk, you get better performance caching the disk block than you do keeping the memory block in RAM. Thus, it's possible to get better performance on a system with swap than one that holds everything in physical RAM; purely because it gives the OS more breathing-room to do clever things with caches.
I couldn't help but think, though, that Linux was getting it wrong in Javablogs' case. Almost half of the Java process was paged out, as were substantial amounts of the Postgres connection processes. At any moment, there was a lot of stuff moving in and out of swap, even though physical RAM was divided 30/70 in favour of disk cache.
Paging out bits of the Java VM is a pretty bad idea. Java does a lot of traversing, compacting, slicing and dicing its heap to collect garbage and maintain performance, and if it has to start pulling chunks of that off disk, your whole application's performance is going to suffer.
Similarly, I'd granted more memory to the Postgres connections specifically because I wanted to give Postgres more room to cache things in fast RAM. Having Linux in turn push parts of that back onto disk was defeating the purpose entirely.
Ultimately, I realised that of the two main processes on the box, I trusted them to regulate their own memory more than I trusted Linux.
So I turned off swap. And everything seems to be running pretty smoothly. There are still bursts of IO as the database does something complicated, but in general, the box is back to barely working up a sweat.
Of course, the other thing turning off swap does is remove the breathing-room you get if you run out of physical memory. So I did some back-of-the-envelope calculations. Typically, the box Javablogs is on is using about a third of its physical RAM for programs and data, with the OS holding on to the remaining two thirds as disk cache. Even if Javablogs were pushed up to using its configured maximum heap size, the memory usage would still only go up to 60-65%.
Like I said to Matthew Porter at the time: "The box may die hard. But only if I'm very, very wrong."
The 2.6 Linux kernel has a "swappiness" sysctl, that allows the sysadmin to fine-tune the balance between swap and cache. It would have been interesting to play with that, but for the fact we're not running 2.6.
It would be even cooler if you could do some kind of per-process "swap-niceness". In the same way that 'nice' raises or lowers a process's priority in the scheduling queue, "swap-nice" could raise or lower a process's priority for having bits of it swapped out. Maybe such a thing exists already. I couldn't find it.