December 2007

« November 2007 | Main Index | Archives | January 2008 »

24
Dec

API Archaeology

  • 5:22 PM

Occasionally in Java, you come across an API that makes you sit up and go "What were they thinking?" Take, for example, the code to list all the threads in the current ThreadGroup. Rather than having the obvious method: i.e. one that returns a list (or array) of threads, the signature looks like this:

int enumerate(Thread[] list)

You pass an empty array to the method, which will be filled with Thread objects. The method then returns the number of threads it placed in the array. If the array is not long enough to accept all the threads, the overflow will be silently discarded.

To initialise the array, you must rely on ThreadGroup#activeCount, which only returns an approximation of the number of threads that enumerate might return.

If you're looking to avoid memory leaks in a non-garbage-collected environment, then it makes perfect sense for an API to push responsibility for memory management back up its caller, and to gracefully handle whichever buffer-size it's given to fill. When you've got pervasive GC, it just looks (and is) clumsy.

So the obvious answer to "What were they thinking?", of course, is "They were thinking like C programmers".

Which in turn leads one to suspect that this particular API has been around since before Java was called Java.

Merry Christmas

  • 3:37 PM

Merry Christmas.

Responsiveness

We have a joke around the office, when a new hire is setting up their development environment for the first time. "Now we go to lunch while Maven downloads the Internet."

Software should be responsive. When you tell it to do something it should do it. Take a pristine Maven installation and tell it to build a Java application -- i.e. perform its most obvious, primary function -- and Maven will first say "Wait a moment, I have to download three dozen different components before I can even start doing what you asked me to do."

"Ah", you say. "That's how maven works. It's a modular system, you're just using it to build a Java app."

To which I reply: "Then don't give me a spoon and tell me to use it to cut steak."

The fact that all these components are downloaded separately and in serial mean the overhead of not including them is far greater than if they'd put them all (even plus a bunch more you might not need) in the core installation. Worse, though, this happens not when you're installing maven, but when you're trying to use it to do something else.

This process doesn't stop at installation. Far too often you'll run maven and it will gleefully traipse off through various repositories doing its own internal housecleaning before actually performing the build. It's like walking up to a shop assistant and asking him to help you find a book, only to watch him dust shelves for fifteen minutes first.

Reliability

Consider two pieces of software. One uses maven (and 1–n artifact repositories) to manage its dependencies, the other keeps all its dependencies in source-control. How many potential points of failure are involved in checking out and building each product?

Repeatability

Builds must be repeatable. If you check out a particular version of your code and build it with particular versions of your tools, you should get a product that is binary-identical each time. (Modulo things like compiled-in build dates, obviously)

Maven seems to try as hard as it can to prevent this. Files go missing from public Maven repositories and suddenly a whole swathe of historical versions of open source projects can't be built without hacking. ibiblio reorganises its directory layout and chaos ensues. Imagine what happens in ten years time when maven has been superceded by some new tool, public maven repository maintenance is an afterthought, and you desperately need to patch some legacy Java app?

For well-resourced projects, the solution is to maintain your own repository and ensure all your dependencies will be available from it, for all time, but even that won't help you if you suddenly need the compile-time dependencies of a project you previously only used as a binary.

(A number of responses to this blog post have assumed that all my problems could be solved by a locally maintained repository, and/or a repository proxy. We have both. They help to some extent, but they in no way solve the problem completely. All they are is a band-aid over the fundamental issue, and once again, additional potential points of failure.)

Most open-source projects just assume their dependencies will continue to exist in "the cloud" for eternity.

Plugins are the worst culprit. Since the core of maven exists solely to download an Internet's-worth of plugins to do the heavy lifting, and maven has a nasty habit of upgrading those plugins without any user-prompting whatsoever, builds can be crippled by some well-meaning committer "fixing" some piece of functionality. I'm told this has been fixed recently (or will be fixed soon) but versions should not be a moving target. v. 2.0.1 of your build tool should be v. 2.0.1 of your build tool. Forever.

Case Study: Dependency Management

One of the big strengths of Maven 2 is supposed to be the way it manages dependencies, including transitive dependencies. So if jar A requires jar B which requires jar C, Maven will sort all this out for you.

Tracking down dependencies and sorting out their transitive relationships is a tricky task, but it's a tricky task you only ever have to do when you modify your dependencies. Maven, on the other hand, wants to do this job every time you build, which adds a huge responsiveness overhead, as the "pom" definition files of each dependency must be retrieved and analysed alongside their jars.

Dependencies may live in a number of different repositories, and these repositories are out of the control of the user, especially in the case of maven-built open-source projects that almost universally rely on the public ibiblio, apache and Codehaus repositories. This impacts both reliability, as all these repositories must be available, and repeatability, as changes to the repositories may have catastrophic effects on the build.

Reliability problems also creep in because maven, forced to do dependency resolution in each build, must hide a lot of what it's doing from view lest it overwhelm the user even more. Conflicting transitive dependencies are resolved implicity, and you have to make a concerted effort (with clumsy tools) to manually find out what was going on.

Paradoxically, by trying to make dependency management easy, maven makes it incredibly hard. It becomes dangerously easy for a project to accumulate dependency cruft —— at best unnecessary, at worst conflicting —— and excruciatingly painful to remove them.

Conclusion

Maven 2 performs a difficult task, and there are a lot of moving parts — plugins, proxies, repositories — between typing mvn install on the command-line and getting a working system. But there has to be something fundamentally wrong with any tool that, whenever I use it, seems to have at least a 50% chance of completely fucking up my day.

SCENE: Somewhere deep in the bowels of 1 
Infinite Loop, Cupertino

        DEVELOPER

    Boss. The mailing-list has exploded 
    again. They're asking when Java 6 is 
    going to come out. This is really 
    starting to get out of hand.

        MANAGER

    Tell them we can't comment on future 
    releases, and get back to fixing 1.5 
    bugs.

        DEVELOPER

    I can't! They're not buying it any 
    more!

        MANAGER

    Well, release another 1.5 update 
    and make sure there's lots of stuff 
    in the release-notes to reassure 
    them that we're working on something.

        DEVELOPER

    We just did that. But because we 
    called  it "update 6" it got all their 
    hopes up.

        MANAGER

    Damn. What's the status of Java 6?

        DEVELOPER

    There's a lot of work left to do. We 
    need more reso...

        MANAGER

    I know. We need more resources. I'll 
    just go upstairs and convince the
    brass that Java 6 is more important 
    than the iPhone SDK. Then I'll ride
    back down here on my FLYING PIG.

    So what is the status of Java 6? Can 
    we put out a new DP in the next twelve 
    hours?

        DEVELOPER

    We've fixed a couple of bugs. It 
    works on Leopard now! But I'm 
    not sure...

        MANAGER

    Does it compile?

      
        DEVELOPER

    ...yes

        MANAGER

    Does the installer work?

        DEVELOPER

    ...yes, but...

        MANAGER

    But what? It compiles, it installs, 
    let's bang it up on ADC today and 
    be done with it!

        DEVELOPER

    Well... I installed the alpha 
    yesterday. The next morning my 
    Mac wouldn't boot,  my hard drive 
    was overwritten with random 
    bytes, and my cat was dead.

    ...come to think of it, that last one 
    might be a coincidence.

        MANAGER

    That, my friend, is what disclaimers 
    are for.

From a discussion on an internal Atlassian blog about Java web server security:

Tomcat... iz NOT vulnerable!

(Created with lolcat builder)

Oh, and one more (Fisheye is Atlassian's magical source-code repository viewer):

Fisheye cat... iz watching ur commitz!

java.lang.OutOfMemoryError: unable to create new native thread

This error means you have allocated too much heap, and you need to reduce your -Xmx setting. A similar symptom is this:

Caused by: java.lang.OutOfMemoryError
      at java.util.zip.ZipFile.open(Native Method)
      at java.util.zip.ZipFile.<init>(ZipFile.java:204)

The amount of memory that can be addressed by a process is dependent on the operating system that process is running under. The absolute maximum for 32-bit applications is 4GB, as this is the most memory you can address using a 32-bit pointer. On Windows systems the theoretical limit is 2GB, but don't try allocating even that much. There are various hacky ways to address more memory, but that's another story.

Say someone starts up a JVM with an -Xmx setting that reserves 3.5GB for the application heap. That means there is at most 512MB of addressable memory left outside the heap for that process. Even before the application starts up, some of that memory will be used to load the JVM itself: link in shared libraries, mmap rt.jar and so on. This leaves precious little memory available for anything else.

Starting a new thread (the first error) requires the JVM allocate space outside the heap for that thread's stack. Opening a zip file (at least before Java 1.6) memory-maps the entire file, leading to the perverse circumstance that the maximum size zip file you can open before getting an OutOfMemoryError is the inverse of the amount of heap you've allocated.

And people wonder why I'm so picky about the garbage-collection question in tech interviews...