September 2003

« August 2003 | Main Index | Archives | October 2003 »

29
Sep

dual-imac-s.jpg With the help of some links from Les Orchard, I have managed to enable the extended desktop on my iMac. It turns out that the hardware is perfectly capable, but it was disabled in the firmware because this was before the release of the G5, and Apple wanted to leave people some reasons to spend the extra money on a Powermac.

The effect is remarkable to say the least. Thanks, Les.

Sadly, I doubt I'll manage to hack in a third monitor. which is a pity, because it'd look so cool.

Let down...

  • 3:46 PM

I realised this morning that a large amount of my apartment is being wasted by the space-guzzling 19-inch CRT monitor that's hooked to my Windows box. I haven't used my Windows box for more than a few minutes at a time since... er... March? It would make far more sense to throw out the CRT monitor, get an external 17-inch LCD display and hook it up so I could share it between the Windows box (in those rare moments I use it), and the computer that gets most of my attention, my iMac.

(Switching to LCD would be necessary for aesthetic reasons. Putting the iMac next to that gigantic CRT screen would just look totally wrong.)

Except according to the guy in the shop, and confirmed by a quick Google search, the external monitor port on the iMac only supports mirroring desktops, not extending them. I'd assumed it would be like the Powerbook, and be able to do both.

Damnit. For the first time since buying the iMac in February, I have a reason to be disappointed in it. This is the first thing about the machine that hasn't "just worked" for me. Oh well, I guess seven months is a pretty long honeymoon.

A quick newbie's guide to this exciting new coding paradigm:

  1. A dysfunctional program consists of a collection of dysfunctions.
  2. A dysfunction works on a number of arguments.
  3. Dysfunctions are not called explicitly. They are invoked as soon as their requisite arguments exist.
  4. If there are no arguments, a system-level dysfunction named "awkward silence" will attempt to invent some
  5. Dysfunctions do not produce any results or return values, they work entirely by mutating their arguments.
  6. The majority of dysfunctions contain at least one infinite loop, and thus never end. Well-written dysfunctions create this loop as a constant byplay between two or more arguments.
  7. During a dysfunction, arguments may be mutated into new arguments, or may spawn additional arguments.
  8. All dysfunctions in a program are considered to be running in parallel. As arguments can not be locked exclusively by a single dysfunction, the exact interactions of dysfunctions can not be predicted. This is a feature.
  9. In the event an argument is no longer needed, it can be discarded. The Garbage Collector will place all discarded arguments in a pool. If the program runs short of arguments, old arguments will be randomly revived to keep it going.
  10. Short of external intervention (e.g the Unix SIGKILL), a dysfunctional program will never end.

While no dedicated Dysfunctional Programming language currently exists, Larry Wall was recently caught wandering down a corridor rubbing his hands together and muttering "I'll get them! I'll get them all!"

On Monday evening, I did a bit of Ruby hacking. On Tuesday morning, I arrived at work and my first task involved iterating through a list, and doing something to each of its elements. My fingers were already typing list.each, and I had to wrench my brain from my Ruby mindset back to the Java Way.

Some time later in the day, I read Philip Greenspun's much-linked Java-is-an-SUV flame.

Problem: I have a list of objects. I want to create another list containing the ‘id’ property of those objects.

Solutions:1

Ruby
list.map { |i| i.id }
Perl
map { $_->id } @list;
Python
[x.id for x in list]
Common Lisp (Corrected by Andreas)
(mapcar #'id list)
Smalltalk (from James)
list collect: [:each | each id]
OGNL (from Jason)
list.{id}
Java
List ids = new ArrayList();
for (Iterator i = list.iterator(); i.hasNext(); ) {
    ids.add(new Long(((Thingy)i.next()).getId()));
}
Java w/ 1.5-style Generics/For Loop/Autoboxing
List<Long> ids = new ArrayList<Long>();
for (Thingy x :list) ids.add(x.getId());
Java w/ Commons-Collections (from Chris)
Collection ids = CollectionUtils.collect(
     x, new Transformer() {
            public Object transform( Object thingy ) {
                return ((Thingy)thingy).getId();
            }
        });

1 I was too lazy to test that they work, but the syntax is close enough. Bug reports to /dev/null.

I was thinking last night that I haven't really bought many CDs lately. I guess it's partly because (IMHO) the music scene is going through one of those downturns during which it doesn't produce much that I actually like. Or perhaps it's just that I'm finally getting older and my ability to adopt new music has atrophied: in twenty years I'll still be listening to my old Nine Inch Nails CDs, muttering "In my day, music meant something!"

Popular music has about reached the point today as it was in the late 80's: so over-manufactured that it's due to be torn apart and be recycled a decade later as embarrassing retro. Except I can't think what will tear it apart this time: the last two times it was some variant on Punk, and Punk is as much a packaged commodity these days as any other genre.

Then, during my lunch break, I went down to the local record store to pick up the new A Perfect Circle CD. Except I didn't, because the damn thing was copy controlled, and thus no use to me at all.

"Ah", I thought. "Now I remember why I haven't been buying as many CDs lately."

NAT and Security

  • 12:31 PM

What is NAT?

NAT is 'Network Address Translation', and is the solution to the problem: "I have one IP address, and I want to share it between 'n' hosts".

With a NAT'd network, the gateway (which hosts the network's one publicly routable IP address) divides the world into "public" and "private" address spaces. When a host on the private network tries to connect to the outside world, the NAT gateway silently rewrites the packet so that it appears to come from some port on the gateway, hiding the internal address. When a reply comes back from the public Internet, the gateway reverses its mapping, sending the packet to its true, hidden destination.

Connectionless protocols such as UDP are rewritten on a "best guess" basis (if they are handled at all).

Is NAT a Firewall?

Technically, yes. A NAT box is a special-case of a stateful firewall. Hosts inside the firewall can send packets out and establish connections. Once a connection is initiated from inside the network, data can flow freely until that connection is closed. Hosts outside the firewall, however, are unable to initiate connections to hosts inside unless a tunnel is specifically provided by the NAT box's administrator.

As such, it's important to note that you get almost no security from NAT that you can't get with a halfway-decent stateful firewall. Setting up such a firewall to give you identical protection without the address translation would take all of sixty seconds. The only benefit you get is that because NAT'd internal addresses come from the non-routable IP address space, the Internet might protect you from some obscure, creative exploitation of a bug in your firewall. But it probably won't.

On the other hand, NAT is very inflexible. It is possible to allow services within the network to be accessed from the outside by creating a tunnel so that when somebody connects to, say, port 80 of the NAT gateway, they are silently redirected to a webserver on an internal address. However, because this completely uses up port 80 on the gateway, you can't just add a second webserver later. Similarly, if you have 20 people in the private network all of whom want to use a network client that listens connections on port 3324, all but one of them will be out of luck.

What's Wrong with NAT?

The downsides of NAT can be quite subtle. Essentially, what it creates is a two-tier1 Internet: one group of people who can both establish and receive connections, and another group who can only establish them. (Thinking of it as a read-write or read-only Internet connection is a useful metaphor, even if it's completely inaccurate)

Don't get me wrong. Blocking the ability for the outside world to establish connections to your private network is particularly useful, and something most firewall administrators would approve of in the general case. It's just that NAT is really bad at handling exceptions. Trying to manage the protocols you want to work through the NAT box becomes impossible the moment you want them to apply to more than one host.

A few years ago, when I had a NAT'd network, I would turn on Napster just so I could be amused at all the people trying (and failing) to access my ripped CD collection. I could connect to the directory, but nobody could connect back to me to get the songs. Similarly today, back behind NAT again, I have endless problems with IM clients and file-transfers.

Creators of new Internet protocols end up with two choices: either exclude the (significant, and constantly growing) population of NAT users, or over-complicate your protocol by having it try connecting in both directions in the hope that one will allow the link to take place. The prevelance of NAT gives us the choice between sacrificing simplicity or sacrificing users.

Some protocols, such as KaAaA's new voice-chat application, Skype, even go as far as having the connection piggyback on a non-NAT'd intermediary. While this is clever, it's still a work-around to the underlying problem, not a solution.

NAT is also used as an excuse for hanging on to IPv4. “We're not running out of IP addresses”, is the cry "We have enough to last us until 2020!" Certainly we do, at our current level of usage. But if there's no scarcity, why is it so hard for Joe Average to get a block of them for his home network off the cable provider? The idea of there being a scarcity of numbers is, frankly, ridiculous2. We justify the continuation of the address-space status quo through the wide availablility of NAT as a 'cheap alternative'.

Resist NAT

As you may have gathered, I'm not a big fan of Network Address Translation. Its security benefits are minimal compared to a similarly configured firewall, and its disadvantages are legion. It's a tool that's blocking the free flow of data, segmenting the Internet and bloating protocols, while at the same time being misused as an excuse not to improve the infrastructure surrounding it. This is just a bad thing.

1 Actually, the Internet has been developing tiers all over the place. There's also the class distinction of static vs dynamic IP addresses: another thing that should3 go away if we had a larger address space.
2 Dear Internet-at-large. It's spelled 'ridiculous', not 'rediculous'. Also, while I'm at it, the opposite of 'win' is 'lose'. It's not 'loose'.
3 I say "should", because service providers have a significant interest in maintaining the dynamic-IP system, many even cycling IP addresses on otherwise always-on connections like cable modems, so they can create an artificial price differential between home and business accounts.

Procrastination

  • 12:06 PM

It's interesting what you can go without when you're too lazy to do anything about it.

Probably the most glaring of my many character flaws is my ability to procrastinate pretty much forever. Over the years, this has led to me putting up with situations that most people would just do something about, because I was always going to "do something about it tomorrow".

Aside from the usual things: getting behind with the rent, having my phone/Internet cut off for a weekend, spending an evening in the dark waiting for my (previously unpaid-for) electricity to be reconnected and so on; over the last few years I have endured:

  1. two months without television
  2. six months (half of it Summer) without a fridge
  3. six months with the electrical circuit that governs my apartment's lighting not working
  4. one year without hot water

You'd think I would learn.

I've been having one or two problems with my web-hosting provider. This time six months ago, I was lavishing praise upon them for being very reliable, but not long after that things started getting worse. Aside from a general increase in unreliability, there were a number of pretty significant problems.

  1. I discovered, the hard way, that their disk quotas were hard limits
  2. The machine I was being hosted on was very slow for a week, which meant half the time my weblog wouldn't rebuild properly when I posted
  3. Their solution to this problem was to move me to another machine... without letting me know the IP address or timing in advance. Which meant the morning after the switch, I woke up to find myself stuck with a severely broken DNS record.
  4. For several weeks, I could only sporadically access my own webpage from home. The problem turned out to be Qwest over-enthusiastically blocking ICMP, which killed Path MTU Discovery. As far as I can tell, they only worked this out after I politely suggested it might be the problem.
  5. Only days after they'd recovered from this long and significant outage they sent me an invoice.

There's bad timing, and there's bad timing. So yes, I think it's time to move on. I'm in the process of setting up MT on the new provider, and taking the opportunity to fix a few of the things that are more egregiously wrong with my blog. Hopefully I can drop the TTL down and get the DNS to switch over more smoothly this time, except for Javablogs which will, of course, continue caching the old IP until it is next rebooted, regardless of TTL..

Anyway, today my blog was unreachable from about 9am, and continued to be so all day. At about 6:30pm, I gave their support number a ring, ready to finally tell them just how pissed off I was at the service, outages, and general pile of annoyance I had experienced.

Well, I was about to. Then the lady on the other end of the line told me about the fire that completely shut down their co-lo. Ouch. I feel a bit guilty now.

I recently bought a keyring-sized 256Mb USB flash drive. These things are really cheap right now, and it quite handily solves the "data I want both on my desktop and laptop, whether I have net access or not" problem. I could use my iPod, I suppose, but the keyring is just that bit more convenient.

Alan mentioned just after I bought it that one common use for such gadgets is as a boot disk: you stick one of the smaller Knoppix distributions on it, and then you can plug it into any PC that is able to boot from a USB drive. Voila, Linux in your pocket.

This, of course, got me thinking. Right now, the keyrings seem to max out around 512Mb. This means in a few years, you'll be able to get affordable several-gigabyte keyrings. This opens all sorts of possibilities as far as data-transfer goes. You'll essentially be able to carry around not only your data, but your software with you wherever you go. Think of going into a net.cafe, and instead of sitting down at a strange computer, you're sitting down at your own computer, set up exactly how you want it.

It solves all the annoying problems related to "roaming profiles", too.

We were told that the network was going to solve this problem for us, but the network is a bumpy thing. It's not always there when you need it, and when it is there it's full of impediments to the free-flow of data like firewalls or backhoes. Sneaker-net is still a lot more efficient in many cases.

It doesn't surprise me that Intel have been having ideas along the same lines, although their idea is more iPod-sized than keyring-sized. Intel are looking at the "Personal Server", a portable hard-drive/low-powered server/WiFi combination that (ignoring all the standard sci-fi marketing dithering about walking past things and having them recognise you) finally separates the "computer as the thing you sit in front of and type on" from the "computer as the thing that holds all your information".

I also love the assumption in that article that in the future we'll all be walking around with video cameras mounted on our shoulders, recording everything that happens to us. Does that mean we'll also one day be recording ourselves watching ourselves watching a recording of ourselves?

Mind Your Step

  • 8:58 AM

On the wide concrete stairs, they've painted &ldquo;Mind Your Step&rdquo; in big yellow letters at regular intervals.

These signs recently appeared at Milson's Point station. I was going to rant about them, but any comment seems superfluous.

Finally, Apple have updated the 15 inch Powerbook, giving it the same features as the 12-- and 17-inch models, and finally replacing the Titanium (with its easily-chipped paint) with brushed Aluminium.

It occurred to me, as I was sitting on IRC watching the news filter across various networks, through some people who deliberately stayed up through the night to be awake while Steve Jobs gave his keynote in Paris, that Apple is really the only hardware company in the world that can get this level of interest in what are essentially incremental upgrades.

Then it occurred to me that I was being just as fan-boyish as everyone else. Grrr. It's at least six months before I can afford to replace my Powerbook. Damn you, Apple, for making such cool things.

(In other news, they have a new Bluetooth Keyboard and Mouse (yes, Alan, a one-button mouse), Panther still has no firm release date beyond "before the end of the year", and iChat AV will interoperate with Windows videoconferencing software soon)

One reason Microsoft Internet Explorer annoys me is the typo feature. When you mistype a domain, rather than give you back an error-message, it redirects you to MSN's search site. I don't like this for two reasons: firstly it adds significantly to the time it takes to just correct the typo and load the right page. Secondly, it makes an annoying assumption at which search engine I might want to use. (Hint: it's not MSN)

This is one reason I try not to use MSIE. Mozilla throws up an error message when it can't find a domain, and makes it very easy for me to choose Google as my default search-engine.

Verisign, it seems, have the trump-card. By putting a wildcard DNS on '.net' and '.com', they are redirecting every single domain typo to their own search page. I can't even begin to describe how much this whole idea annoys me.

It's disreputable. I've always considered typo-squatting--the practise of registering domains that are similar to popular sites so as to get hits from typos--to be a pretty underhand tactic: something you'd expect from the second-hand car salesman school of marketing. Now Verisign are planning to typo-squat probably half the Internet.

It's technically reprehensible. It's breaking the DNS. In one fell swoop it removes the technical distinction between an unregistered domain and a registered domain. It's part of this stupid assumption that the whole Internet is just the World Wide Web with a few unimportant bits bolted on the side. So obviously it's OK to break a fundamental feature of the DNS just so that one company can exploit a few more web users.

It's vulnerable to cross-site scripting

It's an abuse of monopoly. If a web browser of an operating system plays this sort of trick, you can stop using it as I avoid MSIE. You can't avoid the DNS,1 and you can't just choose to go with some provider of the .com domain who isn't a scum-sucking bottom-feeder.2

The body that should slap Verisign down won't, of course. Verisign should be the caretaker of .com, they shouldn't own the whole namespace. Verisign are abusing the fact that they've been put in charge of a significant public resource, with too few checks on what they are permitted to do with it.

I'm just going to blackhole sitefinder.verisign.com

Update: This Bind8 patch allegedly fixes the issue (I haven't tested it), checking for the IP address that the wildcard resolves to.

This Linux ld_preload patch (again allegedly) intercepts calls like gethostbyname() and substitutes a 'domain not found' response for the IP address of the Verisign server.

Update: Overheard: "Verisign: We put the * in .com"

1 Yes, I'm aware of the existence of alternative TLD registries. Wake me when they are relevant to the real world.
2 There are alternative registrars, but Verisign still are own all your base.

Yesterday, I spent most of the day with Alan and David hacking on a Robocode-style game in Python. It was a good excuse to get my hands dirty with the language, and get some idea of what it's like to code in it.

(I am completely un-apologetically a Ruby hacker, which means that every time I try to write a Python app, I end up writing it in Ruby instead because that's what I'm more familiar with, and I can't really see any particular advantage in switching.)

Anyway, as part of the game, there are some pretty simple mathematics that need to be handled: coordinate geometry for the map, and predicting the movements of ships based on their thrust and speed of rotation and acceleration.

I sat staring at the screen for a very long time. It seems that ten years after leaving school (and five since I last studied mathematics in any meaningful way) my mathematical skills have atrophied to the point where I'd forgotten how the hell sine and cosine really worked.

Needless to say, this annoys me. I'll probably go out and buy one of those Physics for Games Developers books next week and re-learn a bunch of this stuff, but the very concept that my mathematics knowledge has atrophied back to grade 8 level just really, really shits me.

Cedric thinks he's found 64,000 bugs in the Java Character class. He ran the following test against all the available chars, and found there were about 64,000 characters that were not lower-case after you've called toLowerCase() on them:

if (! Character.isLowerCase(Character.toLowerCase(i)))

If you want to really trip out, try this test:

if (Character.isUpperCase(Character.toLowerCase(i)))

On my Mac that still returns 63 matches. There are 63 characters that are still upper-case after you've called toLowerCase() on them!

This is not a bug. The Javadoc for the Character Class explains that the toLowerCase() method does not necessarily return a lower-case letter. It returns either a lower-case letter, or the original letter if it has no lower-case equivalent.

In addition, the definition of upper-case in the same Javadoc includes three classes of characters: those with lower-case equivalents, and those marked either "Capital Letter" or "Capital Ligature" in the Unicode spec. Thus, you can toLowerCase() a character, and have it still be upper-case, because it is one of the Unicode characters that is upper-case, but has no lower-case equivalent to convert to.

What this does mean is that if your application makes an assumption as to the alphabet its users are going to be using (in this case, the latin alphabet assumption that all alphabetic characters will be either upper- or lower-case), you're going to fall over when someone starts using a different alphabet.

The Mac OS X application &lsquo;StickyBrain&rsquo; challenges you with an alert-box when you attempt to delete a sticky note, asking you if that's what you really want to do. Perhaps its one redeeming feature is the option to never see the dialog again.

When designing a user interface, it is always better to beg forgiveness than to ask permission. Whenever you throw up an alert like the above from StickyBrain, you are far more likely to be delaying something the user means to do, than be protecting them from a mistake. Worse, your protective alert-box quickly becomes useless as the user either disables it (using the supplied option), or just learns to click through automatically without thinking.

Instead, your application should quietly do what it's told, but provide a way for the user to recover from mistakes. Make your actions undo-able. Make your deletions un-deletable. Even actions that are not reversible can be quietly delayed in case the user changes their mind (think of the way email clients delay deletion).

You'd think everyone knew this by now, but the existence of an irreversable delete and an intrusive warning alert in a program that is currently being promoted on Apple's .Mac service shows that obviously it's not as widely known as I thought.

I've visited this subject before, but it's worth repeating. In my experience, the most useless feature in any public bug-tracking system is the ability to vote on bugs.

As I mentioned in the linked article, users of bug databases aren't really representative of the user population at large, and the bugs that get large numbers of votes tend to be the "vocal niche-market" issues that stay unfixed for a long time because they're just not a priority for the people writing the software.

For commercial products, or Open Source products with corporate backing, the features that the paid developers work on will be dictated by the people doing the paying. While votes in the bug database might be some factor in their decisions, it's likely to be a very, very minor one. You're much better off making your requests through the sales channel: ask them what the best way is to get your bug fixed or feature in.

Volunteers for Open Source projects will work first on problems that effect them directly, and then on problems that don't necessarily effect them, but that their pride won't let them leave unfixed. Votes are unlikely to change this schedule. Calls from users to have a bug fixed that none of the developers think is important will be met with replies of "if you think it's that important, we'll be happy to accept a patch".

After my dream of last week, I decided to watch the movie again today. I'll probably read the book again some time soon, too. As a disclaimer, Fight Club is probably my favourite movie of the last decade.

I was also reminded of a bunch of fan-boy writeups I had read a while back on everything2.

Dear world. Fight Club was not a grand endorsement of nihilism. It was not there to reassure you that it was OK to hate the world and your life. Fight Club was a satire!

A self-delusional man with a fractured personality spouts glib philosophy, and gathers to himself a band of incredibly stupid people to help overthrow the society that they feel they are the victims of. People band together in a pact of mutual self-destruction, and every time they assert their individuality, they're really just subsuming themselves into another fad that will rid them of the need to think for themselves. The whole thing is like that moment in Life of Brian.

Brian: You're all individuals!
Crowd: Yes! We're all individuals!
Brian: You've all got to think for yourselves!
Crowd: Yes! We've all got to think for ourselves!
Crowd: Tell us more!

In the movie, this is even more pronounced. For Christ's sake, we have Brad Pitt telling us we've been lied to by society because we'll never be movie stars, Brad Pitt telling us how we're being sold men on billboards. If you can't see the massive degree of tongue-in-cheek going on there, you need a pretty big reality-check.

As one of the slightly more perceptive everything2 commenters notes

At first, the more times I watched Fight Club, the more this line bothered me. Ed Norton's character had just gotten everything he wanted: freedom from his job, revenge on his boss. "We now had corporate sponsorship, and this is how Tyler and I were able to have Fight Club every night of the week." How triumphant! Everything he wanted. That which did not matter, at last sliding.

We pan to him lurking amidst the throng of grunting, cheering men, surrounded by sweat.

"I am Jack's wasted life."

The self-aggrandizing construct that the Narrator has developed to compensate for his dissatisfaction with life, his nihilist philosophy, his cadre of brainless space-monkeys, all are naught. The Narrator's moment of clarity in that scene is that he is achieving everything he thought he wanted: the cameraderie and adrenaline of Fight Club, freedom from his dead-end job, liberation from the possessions that were weighing him down... and it's all "Jack's wasted life". He's just sinking further, finding bottom, wallowing deeper in his victim's mentality.

In the end, the narrator's painful redemption comes through the only real human connection in the entire story. In a suitably twisted way, he has to defeat Tyler so he can reconnect with the world, even as it blows up to that cool Pixies track in the background.

For several years now, my Slashdot (no, honest, I don't read slashdot!) signature has been:

The more I learn about the Internet, the more amazed I am that it works at all.

It's important to maintain that sense of wonder. The net is a pretty amazing thing, held together by a bunch of people who are probably massively unappreciated in their attention to the plumbing.

For the past... well at least six months, probably more, some company has been chalking their URL onto the pavements around Sydney. I've seen them in the CBD, North Sydney, Newtown, Kings Cross, and they're probably lots of other places too. They're not laying down works of art, just chalking a few lines:

example.net
...pithy phrase here...

I'd say it's quite significant that I, someone who spends more of his life than he will comfortably admit to online, have never felt even remotely like visiting the URL in question. The same goes for the plethora of URLs that appear in print and television ads. Never bothered with them.

Most web-browsing is purposeful. Even when we're doing what looks like random browsing, we're doing it purposefully: starting at some point we know has interesting links and branching out from there. The web is such an enormous place, there's little incentive for me to visit some site just because I saw the URL on the pavement.

URLs are cheap. Give me a reason to want to go there.

eviction televises acquiring humbler pompous excresence bangor teleprocessing cousins plush pollock admirations microword abner excavating powderpuff postorder armata braveness scorers european ideology humerus portrays accrues schizophrenic exasperate cotyledons plot borderland $RANDOM IZE adopters savoring hours excavations postpone brakes playwright eutectic hospitalized poorly adorns termwise crisis expirations seaman scowls adrenaline militia scoria

That's a portion of the content of a spam I just received. The subject-line of the spam was "credulity", and the body was an HTML file containing two long lists of words such as the above (styled to be invisible in a web browser) surrounding an inline image, which I assume would have contained the spam's real payload, if I had viewed it.

As described in Paul Graham's A Plan for Spam, Bayesian spam filtering determines the probability that a particular email is spam by examining the words used in the email: words that are used often in spam but rarely in normal email (like "viagra") push the score up, words that are rarely used in spam but common in normal email push the score down.

This email was obviously written with such filters in mind. It's seeded with around 150 random, long words of the kind that you'd not expect to find in a spam, but that are varied enough that it's quite likely that at least some of them will be in your filter as high indicators of non-spamness.

Even worse, marking such emails as spam will increase the probability of false positives in the future. If you receive a lot of these mails, certain rare words will be associated very highly with spam by your filter. Then, when you get an innocent-seeming email from a friend that happens to contain the words "schizophrenic pompous playwright", that will be enough to get it black-holed.

Amusingly enough, this email would likely be shot down by a filter because of the method it uses to hide the words in the HTML file. It's only a first attempt, though. Spammers will get better at it. So long as the spammer's dictionary is big enough, and they regularly rotate their words, this could be an effective technique to weaken Bayesian filtering on two fronts: increasing both its false-negatives and false-positives.

One problem (one that Graham admits to) is that Bayesian classification assumes that the elements of the object being classified (in this case, the words of the email) are independent of each other. This is a troublesome assumption to make about language, in that the filter can't tell that the list of words does not match the patterns we would normally recognise as the language of a legitimate email.

On the other hand, if we had a more sophisticated linguistic filter, it wouldn't be hard for spammers to come up with a program that generated random, but grammatically correct sentences.

It's a war of escalation.

Norman Richards has already blogged about the "Ethics of Decompilation" thread currently clogging up the Apple java-dev mailing-list. A programmer wrote to the list saying that he had decompiled some code to see how it worked, and wondered what the list-members thought were the ethics of the situation.

The list exploded, much of it outrage, and most of it completely failing to understand where Copyright law ends, and the owner's rights begin.

Copyright does not prohibit reverse-engineering, except (thanks to the DMCA) where the thing being reverse-engineered is a copy-prevention mechanism. This is because copyright law is all about the making and distributing of copies. Under regular property law, once you have bought something, you're perfectly within your rights to take it apart and see how it works.

Of course, most software doesn't subject itself to such things. Thanks to a massive land-grab early in our industry's existence, back before anyone really knew what "software" was, nobody actually buys software. Instead, you licence the use of it under ludicrously draconian terms. Anyone who reads an EULA closely (and nobody really does) and looks at the masses of restrictions and arbitrary termination clauses ends up wondering whether they've bought anything substantial at all.

Hence what we normally call "IP" in the non-Free software industry is really nothing to do with IP law, and everything to do with contract law. It is these contracts that usually prohibit reverse-engineering. Some countries actually have laws that limit contracts so they can't prevent revese-engineering. Your mileage may vary.

Regardless, reverse-engineering is not "theft", as theft implies a breach of property law. At worst, it's "breach of contract". That, however, is only the law, which is orthogonal to ethics. Is decompiling a program to see how it works unethical?

Novels are, mostly, copyright material. Every author, however, has in his1 time read an enormous number of books, and incorporated their "source code" into his body of knowledge. When that author comes to write another book, he can't help but use the things he's learned about the craft of book-writing from reading the copyrighted works of others.

Does that mean any author must keep careful track of everyone he's read so that he can apportion royalties? Of course not. So long as he's not lifting the words directly from another book, we recognise that it's neither theft, nor un-ethical. Even "homage" is acceptable. My domain pastiche.org, is named after the technique of creating art deliberately in the style of another artist, one of the common techniques of post-modernist era2.

English departments at universities spend inordinate amounts of time examining the techniques that went into writing copyrighted works, and are even allowed to write their own books on the subject! And it's generally accepted that doing this benefits the art of writing.

My brother is a playwright. He has a couple of books on screenwriting on his shelf. Each is full of examples of various story-telling techniques that have been used in movies (which are copyrighted). This is perfectly legitimate, and further, if they include snippets of movie "source-code" (scripts) in the book as examples, that's considered fair-use for the purpose of education.

Frankly, I think developers should be encouraged to decompile code they find interesting. So long as they use that knowledge to learn the underlying techniques, and don't then do a cut-and-paste job into their own code, they're not doing anything unethical.

Ultimately, I think that the big hue and cry over decompilation (and the related fetish for bytecode obfuscators) is a case of "methinks the lady doth protest too much." The fact is, most code doesn't do anything particularly interesting or original. Software is the result of an investment of time. It may contain some nifty techniques here or there, but overall it's not going to damage the value of the product if those techniques are publicly known.

For 99% of the code out there, the effort required to decompile it and then understand what you've just decompiled is greater than that which was required to write it in the first place.

There are a few, very few, exceptions. If you've come up with something that is truly valuable and innovative, copyright is not the right tool. Copyrights protect works, not ideas. Patents protect ideas, and the first premise of a patent is that you publish the idea, not hide it. Then again, the sad state of software patents is fodder for a lot more abuse than praise.

If I had my way, all software would come with source. Some forward-thinking companies do this already. Many programming environments pretty much force you to provide the source anyway. Having the source allows you more freedom to customise the product, fix it if it's going wrong, and offers you some slight protection if it ever becomes abandonware. Short of reaching this utopia, I'm going to hang on to my decompiler until you pry it from my cold, dead hands.

Dan would later learn that there was a time when anyone could have debugging tools. There were even free debugging tools available on CD or downloadable over the net. But ordinary users started using them to bypass copyright monitors, and eventually a judge ruled that this had become their principal use in actual practice. This meant they were illegal; the debuggers' developers were sent to prison. -- Richard Stallman: The Right to Read.

1 Yes, I'm using the masculine pronoun. It's just convenient.
2 I chose the domain-name when I was at university, after having the word hammered into my brain in every lecture of a modern literature course.

I was in the pub last night playing in the pool competition. At some point during the game, my opponent mentioned something about Bronski Beat (after a few beers, you're lucky I remember this much), and my father and I both got stuck trying to remember who the Hell the lead singer was. I could remember he was also in the Communards, but I couldn't for the life of me remember his name.

"Aha!" I thought, and whipped out my trusty WAP phone. "I'll look it up on Google!" I figured this was a highly impressive example of how cool technology has become: having the world's biggest store of information literally at my fingertips, even when I'm down in the pub.

So I sat and wrestled with the terrible GPRS reception: waiting to establish my connection, and then sitting through the slow, griding wait for Google to load. While I was doing this, my father walked up to me and said "Jimmy Sommerville". He'd phoned his partner and asked her.

Social networks 1, data networks nil.

One Sun Way?

  • 6:50 PM

One of the disadvantages that is often cited when comparing the Microsoft .Net vision to the Java vision is the fragmented nature of the Java vision. With .Net, there really is only one "product" you can use, which makes purchasing decisions a lot, lot simplier. -- Lee Walton, on the possibility of Sun ditching NetBeans for Eclipse.

Famously, Microsoft's address is also it's philosophy: "One Microsoft Way". Bill Gates' vision1, almost completely realised these days, is everyone running the same software: his software. As we have seen lately, monocultures are a dangerous thing. Where a monoculture is weak, that weakness is amplified because everyone suffers from it. On top of that, reliance on a single vendor leaves you subject to the whims of that vendor. If their goals diverge from yours, you have nowhere left to turn.

"Fragmented" is a spin word. It has powerful negative connotations, implying that something that was once whole has broken into pieces. I would like to replace it with a better word: "competitive". The Java landscape isn't fragmented because we have more than one IDE. The Java landscape is competitive because our IDE vendors know that if they don't keep up, we have alternatives to switch to.

Don't like Eclipse? Buy Idea. Is your particular application slow in Sun's VM? IBM has one you could try instead. Want an application server? Writing a web application? Persisting data? Parsing XML? We have a plethora of different options, all of which want to be better than the others, all pushing each other to improve, and cross-pollinating ideas amongst themselves.

There is a downside to this: in that it's harder to keep up with what's new, and you're less likely to be able to grab developers off the street with skills in exactly the tools you are using, but on any project of a few months or more, any toolkit education developers might have needed becomes a pretty small fraction of your total cost.

There are two ways to produce something people want to use. The first is to have ideas, try them, see how the market reacts, refine them and repeat. The second is to sit back, wait until everyone else is done having their ideas, and then produce an amalgam of what has worked for everyone else. The first requires a competitive environment. The second requires someone else having a competitive environment from which you can cherry-pick the best ideas.

So here's to fragmentation, and here's hoping we never have to put up with One Sun Way.

1 "I want to have a computer on every desk and in every home, all running Microsoft software." -- Bill Gates. I think he said it in the late 70's, but I can't get an official date for the quotation off the net. It's an amusing reflection on Microsoft's antitrust concerns that all official Microsoft records seem to have removed the last four words from the quote.

Chemical Burn

  • 11:26 AM

I had the weirdest dream the other night.

There was a lot more of it, but this is the bit that stuck.

The gun-barrel was being held to my forehead. I could feel the cold metal pushing into my skin, pushing my head back. Vivid doesn't quite describe it. I was gibbering incoherently, I think the words "no.. please no.." were in there somewhere amongst the frightened noises. And I still said "no.. please no.." as the trigger was pulled. I heard the clicking of the hammer in an empty chamber, again and again, praying that the next one didn't hold a bullet.

At precisely the same time, I was somebody else. I was standing, holding a gun to another person's head. But that person was myself. And the gun was a plastic toy. As I pulled the trigger, it made that cheap plastic clicking sound that toy guns make, and I thought "Hey, this is fun".

And this wasn't revisiting the same situation from two different angles, and it wasn't really holding a gun to my own head. I was two people, at the same time, in the one dream.

Maybe I should change my name to "Tyler"

I would like to personally thank all the cat-lovers (even those without cats) for making Post Pictures of Your Cat To Javablogs Day such an enormous success. Join us same time next year, when we post candid snapshots of our favourite celebrity elbows!

Honour Roll (with Javablogs click-throughs as at midnight)

Alan Green (75), Matt Quail (39), Lee Walton (42), Danny Ayers (23), Cameron Purdy (39), Bob McWhirter (67), Carlos Villela (17), and me (106)

Special Mention (Cats not posted to Javablogs)

Sam Stevens (there were one or two others, but they didn't post a trackback...)

Symptom: when you're trying to access a webserver, you can connect fine, send the HTTP request fine, but then the client waits forever for a reply. Interestingly enough, if you upload a really small (Say, 10 byte) file to the server, you can retrieve it without fail.

Possible Cause: something is blocking Path MTU Discovery.

What is Path MTU Discovery?

Every hop on the Internet has an MTU, or Maximum Transmission Unit. This is the maximum size that IP packets sent over that link are allowed to be. The MTU for something like ethernet will be quite high (usually 1500 bytes), but other transmission media might run more efficiently with smaller packets.

If a router receives a packet that is larger than the MTU of the hop it needs to send it over, the only way it can send the packet is to break it into fragments. The problem with fragmented packets is that they're rather inefficient. One host has to break them up, another host has to knit them back together, and you end up transmitting far more packets than you'd need to if they were just the right size in the first place.

Path MTU Discovery is a way of calculating the largest packet that will traverse a particular path between hosts. The algorithm is simple. Hosts send the largest packets they can, but with the "Don't Fragment" bit set. That way, if the packet turns out to be too big, routers don't just break it up and keep going. Instead they drop the packet, and send back an ICMP Destination Unreachable (Datagram Too Big) message, which tells the originating host the largest MTU the next hop will allow. On receiving this ICMP, the originating host creates a new MTU for that specific destination (known as the Path MTU) at the lower value. Then it resends all the lost data.

It's all pretty simple, really.

So What Goes Wrong?

Sometimes, overzealous firewall administrators decide that ICMP is a bad thing, and block it. This is fair enough on the surface: ICMP can be used both as a convenient flooding tool and a way to map networks. The thing is, you have to be careful which ICMP you block. If, specifically, you block the Datagram Too Big ICMP, then any attempt at MTU path discovery will fail quietly: packets will be dropped on the floor, and the request to re-send a smaller packet will never get back to the originating host.

You end up with a really weird error condition that tends to drop in and out as routes change, and is very hard to track down unless you know exactly what you're looking for, because by definition all of the evidence of the problem is being either blocked, or dropped on the floor.

My Cat Rocks

  • 5:22 PM

Cassie is a large, spoiled black cat, bearing a strange resemblance to a handbag.

This is Cassie. She's around 12 years old now, which I guess means at least this isn't kitty porn. Cassie is helping me celebrate post pictures of your cat to Javablogs day, my rather infantile gesture towards people who worry about whether Javablogs is on-topic or not.

My mother has a large black handbag. Out of the corner of your eye, you really can't tell the difference between the cat and the handbag.

After I abandoned Cassie and went to live on my own, she started pining for me, developed a nervous complaint, and had to have cortizone injections (although the hippy Fremantle vet did try aromatherapy first). I'd like to say I have this effect on women, but that would be stretching the truth a little.

So you've had a really great technical idea that's going to change the world? Here's a very simple sanity-check.

First, you're going to need some use-cases. Think of the precise situations in which your idea is going to be used. Each use-case should feature a user in a specific situation, using your new technology to reach a plausible goal. (Note: plausible goal. I'm assuming here that you've already determined that this use case is something that people will actually want to do, provided there is a convenient way to do it. If it isn't, stop and think of another one.)

Except... use-cases aren't supposed to dictate the form of the solution. Delete your technology from the story, and ask yourself the big question. "If my new technology didn't exist, how would the user go about reaching their goal?"

Now:

  • What are the advantages to the user of doing it your way? (This is important. To the user. Not to you, not to your paying clients if they're not the user.)
  • What level of adoption would you need before your way was sufficiently better than the other way for people to want to switch?
  • How will you manage to bootstrap your way to that level of adoption?

Make sure you closely examine your "advantages" for hidden assumptions. If one of your advantages is that a user might not know how to do it the existing way, how are they going to learn to do it your way?

If you can't confidently say your way is better, or you can't see a way of bootstrapping your technology to the point that its users see solid returns, then you are a solution in search of a problem. It may be a clever and elegant solution, but without a problem to solve, it's just going to sit on the shelf as an idle curiosity, gathering dust.

Note, I'm not talking about business models. Finding out how to profit from the technology is the next step, and one I'm totally unqualified to pursue. I'm just looking for the answer of whether the technology's existence actually makes the world a better place.

Here's an example: Sun's JXTA P2P architecture.

  • Use case: A developer wants to create a P2P application.
  • Alternative solution: The developer comes up with a new P2P architecture from scratch, or builds their application on another architecture such as Gnutella.
  • Advantage to the user: JXTA allows the developer to write their application on top of an existing P2P framework, saving them time and duplicated effort.
  • Necessary level of adoption: The architecture must be considered proven in the real-world, and thus trusted as a basis for new applications. Preferably, some widely-adopted P2P brand should run on top of JXTA.
  • Bootstrapping: Evangelise JXTA to the developer community, until such applications exist.

The fourth point seems a bit based on blind hope, doesn't it? You can't know the framework is worthwhile unless someone creates a large-scale, successful application. People won't want to risk a large-scale application on an unproved framework.

Of course, the killer app of P2P is file-sharing, and Sun couldn't really get into that arena themselves without causing some serious Marketing headaches. :)

Sometimes, it's not so much of a problem with frameworks, because they're written for the developers to use themselves. How about WebWork (I'm not a WW developer, so the 'we' here actually refers to people other than myself)

  • Use case: a developer wants to write a web application in Java
  • Alternative: use Struts
  • Advantage to user: we find Struts annoying
  • Necessary level of adoption: very low. So long as it's better than Struts, it's already made our lives easier
  • Bootstrapping: we use it ourselves and tell our friends

Sometimes, the test can fail very early. A lot of the dot-com flops could have done with this sort of self-examination.

  • Use case: A user wants to find a particular company on the web
  • Alternative: They try the obvious URLs, copy the URL from advertising material, or failing that they look it up on Google
  • Advantage to user: Erm... er...

I loathe CSS with a passion.

Correction. I loathe the fact that every web browser supports a different, incompatible subset of CSS2. W3C standards were supposed to save us from having to test pages in every single browser under the sun, but we're travelling at high speed in the opposite direction. We can blame Internet Explorer for getting the box model completely wrong, but even the more well-behaved browsers such as Safari and Mozilla don't support the whole standard, and have significant incompatibilities where they do.

CSS is great so long as you stick to a small number of heavily tested recipes. Stick with those and you're fine. Try to do something stupid like, say, build your own layout from first principles, and even if you spend the requisite day testing in multiple browsers and tweaking around the minor bugs, you'll still probably end up completely screwed because you've ended up relying on some property that one of the major browsers just doesn't support. Bastards.

(This post is the result of me banging my head against the fact that the only way to do liquid-layout block elements of variable height side-by-side works wonderfully in Safari, and isn't supported in Mozilla. Any relation it bears to the opinion I hold of CSS at other times is, frankly, purely coincidental.)