Power Laws in Self-Organising Collaboratively Filtered Decentralized Social Networks over the Semantic Web. With a side-order of Inversion of Control/Dependancy Injection Bayesian Classification.
And yes, I'd like fries with that.
<Charles> My mother just rang to tell me she can't have lunch with me tomorrow, because she has to have lunch with a Hungarian dwarf.
<Candi> Surreal, meet Charles. Charles.. Surreal. I'm sure you two know each other.
<Charles> We seem to encounter each other regularly.
<Candi> I've noticed this.
Here's a funny story.
About six years after I wrote this blog post, detailing an idea I had for managing persistent login cookies, a post that was linked to from all over the place, implemented in a couple of high profile libraries and that still gets me referrers (mostly these days from Stack Overflow. Hi!); after six years of being embarrassed that the products I worked on used a "remember me" mechanism I felt was demonstrably inferior, I finally got around to implementing the algorithm in real code.
It didn't work. It quite spectacularly didn't work. Because of concurrency.
I think there's a lesson here about the difference between theory and practice, and the hubris of a young blogger naming something "Best Practice" that he hadn't even tried yet.
And that concurrency is, and always will be, a bastard.
Persistent login cookies are the cookies that are stored with your browser when you click the "remember me" button on the login form. I would like to be able to say that such cookies are obselete, and we have a better way of handling user logins, but they aren't, and we don't.
The following recipe for persistent cookies requires no crypto more powerful than a good random number generator.
- Cookies are vulnerable. Between common browser cookie-theft vulnerabilities and cross-site scripting attacks, we must accept that cookies are not safe
- Persistent login cookies are on their own sufficient authentication to access a website. They are the equivalent of both a valid username and password rolled into one
- Users reuse passwords. Hence, any login cookie from which you can recover the user's password holds significantly more potential for harm than one from which you can not
- Binding persistent cookies to a particular IP address makes them not particularly persistent in many common cases
- A user may wish to have persistent cookies on multiple web browsers on different machines simultaneously
The cookie should consist of the user's username, followed by a separator character, followed by some large random number (128 bits seems mind-bogglingly large enough to be acceptable). The server keeps a table of number->username associations, which is looked up to verify the validity of the cookie. If the cookie supplies a random number and username that are mapped to each other in the table, the login is accepted.
At any time, a username may be mapped to several such numbers. Also, while incredibly unlikely, it does not matter if two usernames are mapped to the same random number.
A persistent cookie is good for a single login. When authentication is confirmed, the random number used to log in is invalidated and a brand new cookie assigned. Standard session-management handles the credentials for the life of the session, so the newly assigned cookie will not be checked until the next session (at which point it, too, will be invalidated after use).
The server need not make the effort of deliberately trying to avoid re-assigning random numbers that have been used before: the chance of it happening is so low that even if it did, nobody would know to make use of it.
When a user logs out through some deliberate logout function, their current cookie number is also invalidated. The user also has an option somewhere to clear all persistent logins being remembered by the system, just in case.
Periodically, the database is purged of associations older than a certain time-period (three months, perhaps: the size of the table would be far more an issue than any possibilities of collision in a 128 bit random space).
The following user functions must not be reachable through a cookie-based login, but only through the typing of a valid password:
- Changing the user's password
- Changing the user's email address (especially if email-based password recovery is used)
- Any access to the user's address, payment details or financial information
- Any ability to make a purchase
If the login cookie is compromised, the attacker has access to the common functions of the site as that user. This is inevitable whatever the cookie contains. However, the attacker can not:
- Access sensitive user information
- Spend the user's money
- Recover the user's password and try it on other sites
- Prevent the user from receiving notifications from the site of things that may have been done in their name
- Share the stolen login with others
The mutating nature of the cookie also provides a much smaller window of opportunity for an attacker to exploit a stolen cookie, and means the attacker must be far more careful they don't end up with a useless set of credentials.
Update: Barry Jaspan suggests an addition to the protocol that would further reduce the window of opportunity for stolen cookies: if a cookie that has been known to be used before (and thus invalidated) is presented, treat it as evidence of an attack and invalidate all saved logins for that user.
Well, it's been exactly a month since Beta 1 was released, so today we finally got the Cannon-Brookes Seal of Approval for the release of... Confluence Beta 3! (product page) (documentation site) (public sandbox)
Beta 2 escaped last Friday, but didn't survive long. It was a mercy killing.
I hope you'll excuse me selling out my principles here and using my blog as advertising space, but lately I've spent a significant proportion of my days working on this thing, and I think it's pretty cool.
I was given the task of writing the release notes, and it turns out that in the last month -- well, less considering the interruptions of Christmas, New Year, and the week or so we all spent working on Javablogs -- we resolved something like 95 issues, 32 of which were either new features or enhancements.
Go us. Mad props to Ara, Armond, Dave, Mike and Ross. I am not worthy.
Every bottle of James Squire Original Amber Ale has a little story about the beer printed on the label. This one is all about how often the original James Squire got laid. I suppose it's fitting: maybe someone will be inspired by label and beer, and use it as an excuse to wake up the next morning and wonder how the hell they got drunk enough for that to look attractive (a question generally shared by the other party). However, its slightly less appropriate to me, having spent this Sunday night in my apartment with a six-pack of the aforementioned, and the X-Box.
Which is an interesting contrast to spending Friday night being rained on in an open-air cinema, watching Grease. I have this theory that every member of my generation (being, as I am, on the fading edge of Generation X, but not quite into the rise of Generation Y) knows this movie off by heart, whether they want to or not.
I didn't join in the singing. Although if I'd had more beer then instead of now, I probably would have.
Anyway, on with the Unit-testing stuff.
I don't write my tests first. I know very few people who do. It's awkward, fragile and unintuitive. For those who enjoy it and find that it's none of those things, great! At best, I'll do a mixture of the two, and have something vaguely functional, then write a testcase and make sure it tests all the conditions I'd like it to pass under even though I know that it'll fail.
I like to write my tests first, often to the point of writing the test, and then using the IDE shortcuts to create the classes and methods I'm testing. Not writing tests first is generally a sign I'm getting lazy or complacent. I find that writing a test first focuses my mind on the precise problem I'm trying to solve, and gives me a nice indication of when I've solved it. In a way, it keeps my honest.
I also like the feeling of certainty. You write something that deliberately doesn't work. You make it work. You are certain that the state of your code has changed. If you write a test that works, change some code and the test still works, have you really done anything?
Worse, if you have some code, and then write a test against it that works, are you sure the test is really testing what you want it to?
I also find that the need for code to be testable colours the way I do design. I'd much rather write code that is easily testable, because then I spend less time writing (and running) tests. Experience has taught me that as soon as you have tests that rely on the container being up to run, you're quite likely to slow down by an order of magnitude. So I try to avoid writing tests that involve a container.
I wouldn't, however, say I do test-driven design. Tests are one of the factors that influence my design, but they certainly aren't some overwhelming force that drives everything else.
Tool Boy (TB) gave me a rant about unit tests saving time. I asked the sorry sod, "if they save so much time then how come you're so late on all of your work?" He talked all about the bugs that the unit tests would find... The truth is that we're not held accountable for that too much. There are always too many parts, so its never clear why the system is unstable... Get your stuff done and shut up about your ficticious unit tests. Time to deployment is job #1, cost is #2, quality is for people not working in IT.
This is what finally convinced me that Saunders is indeed satire. Quality is cost, is time to deployment. I've seen attempts to separate them, and they've all ended in tears.
As requirements complexity increases, test case code becomes a duplicate of the code being tested
This is the big testing danger-sign to watch out for, as I mentioned in a previous rant about mock objects. It's the reason I continue to fight a rear-guard action against the direct testing of private methods. Tests should be stimulus-response. I poke the object this way, it gives me back this result. I poke the object this other way, and the resulting state of the application model is this. Moving in so close that you're testing each individual neural pulse, rather than the aggregate response, leads to tests that are either meaningless, unmaintainable, or both.
One of the less visible improvements that was made to Javablogs over the New Year was an extensive refactoring of the component that retrieves and parses RSS feeds. Aside from the important task of... well... making it work reliably, we made it do a few useful things like update feed and site URLs automatically when they move.
Anyway, we encountered one problem with a certain blog host that will remain nameless. Actually, we encountered two problems, but the fact that half the time it would hang indefinitely when asked to serve a feed isn't relevant to this rant.
The problem lay with what happened when you tried to retrieve the RSS feed of a blog that no longer exists. This unnamed blog host redirects you to a nice HTML page that tells you the blog isn't there any more. From a human useability perspective, this is fine.
From the point of view of a bot, though, it sucks.
The RSS reader expects to find an XML document at the given URL. Instead it gets HTML back, which causes the poor little XML parser to shit itself. Sure, we could check the
Content-type, but nobody gives a meaningful content-type for RSS anyway so that would just cause more problems. Regardless, it's still a fundamental misrepresentation of the real error: the document isn't the wrong type, it's not there any more.
HTTP already has a perfectly good way to tell any agent a resource isn't there any more. It's called the 410 response-code. 410 means "Gone". The resource used to be there. It isn't any more. We don't know where else it might be. Deal. If our bot got a 410 response-code, it would know categorically that it shouldn't bother to check that RSS feed again, because it's gone for good.
Well it could, but it wouldn't because we never bothered to program that bit. Nobody actually uses the 410 response-code. But even the somewhat less accurate "404 Not Found" response would be more useful than sending us back a redirect to a deceptive "200 OK" and then dumping the wrong document in our laps1.
HTTP has a bunch of useful response codes. It's even got a few that nobody's worked out how to use yet, like "402 Payment Required". As web application developers, we should be familiar with the codes and use them when they are appropriate. It makes our applications more friendly to any clients that may want to visit them, whether they have human eyes or not.
The only caveat is related to Internet Explorer. IE is traditionally very unfriendly to any page that is served up as an error. It assumes that any error page below a certain length can't possibly contain enough information to explain to the user what really went wrong. If it finds a too-short error page, it will replace it with the standard Internet Explorer error page, which still completely fails to explain to the user what really went wrong, but uses much more words to do it.
If you find you're getting that problem with IE, just pad your page with a few hundred characters of <!-- HTML comment --> until your own page appears once more.
1 The bot doesn't stop checking 404'd feeds entirely. The web being what it is, pages vanish temporarily all the time. It does, however, check for them less often the longer they're not there.
People sometimes wonder why there are so many sharp edges in the average PC case. adri wonders at the inability to insert PCI cards without sustaining injuries
There's a secret, dark reason for all this.
The computer wants your blood.
This isn't just a couple of lazy PC manufacturers not caring enough to smooth the edges in their cases. The simple fact is that getting a computer to work is a dark rite, and the components may very well need your blood to bring them to life.
The noteable exception, of course, is Apple computers. Apple seem to go to great lengths with the Powermac to make the case easily accessible, and far less dangerous.
This is because by the time you've bought one, Apple already own your soul.
As all Extreme Bridge Builders know, it is unrealistic to expect bridge builders to learn their chosen vocation to this level of proficiency.
In 480BC, Xerxes, King of Persia, spanned the Hellespont with over 600 ships, creating a gigantic pontoon bridge over which he could invade Greece. That might give you some idea of just how long the human race has been building bridges. We're still in our first century of programming, and we're still not very good at it. We're nailing planks together and seeing what happens.
Those areas of Computer Science that can be proved mathematically have a head-start, because we've been doing maths for a long time as well. Unfortunately, most of what we do is Psychology, not maths. Once the algorithms have been worked out by the mathematically inclined, there's only so many times they need to be written. The rest is twisting our brains, and the brains of our co-workers, into the right shape to tell the computer what to do and not get too much wrong on the way: all the time trying to interface with the bizarre world of management theory. One redeeming feature of management consultants: unlike us, at least they don't pretend what they do is a science.
Science is about measuring things. Once you can measure something, you can change a few factors, measure it again, and try to work out why it's changing. Do that for long enough, you'll be able to model the whole bridge without building it. Right now, though, we can't even get two practicioners to agree on metrics for measuring deveoper productivity, code quality, or even the success of an entire software project:
"The bridge sank into the river! Half our troops drowned!"
"Yes, but if we'd built it to be unsinkable, it would have taken three times as long and the Greeks would have been ready for us!"
From that perspective, the "scientific method" of Software Engineering, at least as it is practiced at the coal-face of development1, is as follows:
- Gather anecdotal evidence from your experience, and the experience of people whose opinions are like yours.
- Come up with a mental model that accounts for around 80% of your anecdotal evidence.
- Come up with plausable reasons to ignore the remaining 20%.
- Extrapolate your mental model until it applies generally. You will find carefully-chosen metaphors especially useful at this point.
You can then evangelise your theory, in competition with the hundreds of other theories being thrown around by your colleagues in the field.
These methods apply both to development methodologies (like Agile Development™, and whatever catchy word we can come up with to describel stuff that isn't Agile™), and to trends in programming tools and techniques like patterns, IOC, MVC, object-orientation, structured programming, and so on.
Proof is largely impossible. There are just too many variables to isolate in horizontal surveys, conducting an experiment using real programmers in a plausible task is too expensive, and nobody has the time anyway.
But boy, don't we enjoy pretending we have all the answers? I know I do: this blog is full of blanket pronouncements on what is are the Right and Wrong ways to do things: some of them contradictory. In that way, I'm a microcosm of the programming world.
Object domain models are a good thing. Except, of course, we all know that object orientation has failed. Well, no, we're just perversely ignoring Smalltalk in favour of inferior object systems. Although really, pervasive OO is just a bad substitute for a real LISP environment.
Watch the writings of programmers for long enough, and you won't be able to code without doing something that's considered harmful, but that's preached as gospel truth by others. You can read an oft-revered article like worse-is-better, without knowing that even its author changes his mind about every couple of years
And whatever you do, don't mention Postel's Law!
All these ideas fight in the bizarre landscape of the computing market. It's like watching evolution at work: being forced to realise that Darwinism is a statistical process that doesn't apply to individual species. You have to have faith the general trend is for the better despite the fact that the most efficient carnivore can have a bad run of luck and die out, while some completely unremarkable scavenger can find itself in a lucky niche and plod along forever.
Except this is evolution played at maximum fast-forward, with an ice-age every couple of years and meteorites hitting the planet constantly from every angle.
Sometimes, very rarely, some idea survives long enough and is generally applicable enough that it is no longer challenged. The list is particularly short: maybe Fred Brooks "Mythical Man Month" may qualify: the book being, once again, anecdotal: Brooks' personal experiences on projects coalesced into a book.
So what do we do?
Well, the obvious answer is to always be critical of both new ideas and accepted wisdom. How well does our own anecdotal experience meet with others? Are the reasons to ignore the contradictory evidence really that convincing? What are the risks involved?
That said, you're never going to find certainty, because there is none. Worse, ignoring anything that you're not completely sure about is the equivalent of stagnation. If you wait until everyone else is doing it successfully, you'll always be behind the game. So that means you have to keep your eyes open for good-sounding ideas, and you have to take some risks.
But still, choose your battles wisely, and make sure you have an escape-plan if things go pear-shaped.
If we keep doing this for long enough, we might end up as good at writing software as the Persians were at building bridges.
1 Being a two-time University drop-out, I'm not sure how this is done in academia, but I suspect from memories of Pascal-evangelism from my first-year lecturers that they do pretty much the same thing, just a lot slower.
note: After hitting 'save' on this post, and getting to the 'assign secondary categories' stage in MT, I realised that where this really belonged was in an as-yet-non-existent 'rambling aimlessly' category. A category which, as you should see below, I now have created.
Everyone has, at some point, encountered the notion that opinions can be neither true nor false. Which is true to some extent, in my opinion.
An opinion is a statement of belief. The only truth that can be drawn fron an opinion is that the person stating it holds the belief. You can't infer from the opinion that there is any fact behind it. Thus, you can honestly hold the opinion that "X is true", even when X is, in fact, false.
The problem is, though, that having half-heard and not understood this concept, people get it into their heads that because an opinion can neither be true nor false, this means they're allowed to hold any opinion unchallenged. "It's just my opinion", they say. "Opinions can't be false, they just are!"
"Don't hassle me with your... facts!"
While an opinion may not be false, it can be irrational.
Opinions generally represent some objective truth that can be true or false. That objective truth can be argued. And if you are left with an opinion that you have no justification for holding, then while you can continue to hold that opinion as long as you want, you are no longer rationally holding that opinion.
I last ran into this arguing over a beer with a couple of strangers about the Apollo moon landing. Now on one side of this argument, we have a bunch of plausible-sounding objections, none of which bear up to close scrutiny, and all of which have been thoroughly countered. On the other side of the conspiracy theory, you have to hold the belief that of the hundreds of people who would have had to be involved in the hoax, most of whom were committed scientists rather than the usual gang of black-clad government agents, and who are now getting to the "setting the record straight" kind of age, none have actually come forward and admitted they made it up.
So after going through the usual gang of of objections—the stars and shadows, and so on—and extracting grudging admissions that the chance of a conspiracy that big and with that many people involved staying secret so long is incredibly low, we ended up back at the same old place. "It's just my opinion that it never happened. Opinions can't be false, right?"
People believe they have some right, some moral imperative to hold any stupid opinion they want. I'm sure in the USA, it's even Constitutionally protected. You have the God-given right to hold any damn opinion you want, even if it runs counter to every single fact you've encountered.
Where's my God-given right not to be subjected to stupidity?
I've started getting lots of spam with similar-looking subject-lines. Luckily, this makes them easy for me to filter out even when Mail.app misses them:
- Re: JHAL, dropping the ladle
- Re: ZOQRLR, somewhere far away
- Re: XEBK, where's the devil
- Re: GIWOVHU, wishes! for success!
- Re: HVRV, last prisoners were
The recovering Star Trek TNG nerd in me is waiting for:
- Re: SOKATH, His eyes open!
- Re: SHAKA, When the walls fell
and, of course
- Re: DARMOK, and Jalad at Tanagra.
(Yes, I had to look the exact phrases up. I'm not that much of a Star Trek nerd)
This particular episode of TNG stuck in my mind because the premise always annoyed me. Supposedly, the Universal Translator couldn't understand what this race were saying because they spoke entirely in metaphor. Huh? Since when has any language been anything but a metaphor? No word is its subject, words are all indirect references: metaphors. Ceci n'est pas une pipe and all that.
In the Livejournal Macosx community, one user noted some interesting behaviour in Mac OS X. When you get to the bottom of what's going on, it's an interesting insight into the way a couple of unrelated design decisions can turn around to produce unexpected behaviour that's only really predictable in hindsight. There's a bit of a lesson in here.
The user in question was going through O'Reilly's Mac OS X Hacks book, and tried out the
whoami command, which is there to tell you who you are. It produces the following output:
gnosis:~ cmiller$ whoami cmiller
However, he also discovered that the undocumented command
whoamI (with an upper-case I), gives much more interesting output:
gnosis:~ cmiller$ whoamI uid=501(cmiller) gid=501(cmiller) groups=501(cmiller), 79(appserverusr), 80(admin), 81(appserveradm)
Furthermore, this enhanced command only seemed to be available in the
bash shell. In
tcsh, it was only available if you addressed it by its full path.
[gnosis:~] cmiller% whoamI tcsh: whoamI: Command not found. [gnosis:~] cmiller% /usr/bin/whoamI uid=501(cmiller) gid=501(cmiller) groups=501(cmiller), 79(appserverusr), 80(admin), 81(appserveradm)
The explanation is pretty simple, if a bit long-winded.
- The HFS+ filesystem under OS X is case-preserving, but not case-sensitive. Thus,
whoamIend up addressing exactly the same file.
- A user familiar with Unix will recognise the output of
whoamIto be identical to that of the
idcommand. Deeper investigation shows that the
groupscommands are all hard-links to the same binary, which checks the value of
argvto see what it should be running as. Since this is a Unix program, and thus is case-sensitive, it doesn't recognise that
whoamIis a legitimate way to call it, and falls back on its default behaviour:
tcshshell maintains an internal hash of the programs that are on your default search-path. Often, if you're messing with the contents of your path, you need to call the internal shell-command
rehashto have it rebuild the hash. Once again,
tcshis a Unix program, and thus assumes case-sensitivitiy.
whoamIisn't on its search-path, and thus it's not found unless you specify explicitly where to find the file.
bash, on the other hand, either doesn't maintain such a hash, or doesn't trust it. It's quite happy to ask the filesystem if
whoamIexists, and run it for you.
So there it is. A series of rational design-decisions in four unconnected components combines to produce unpredictable results. So where's the lesson in all this for programmers?
Each component aside from
bash is making an assumption about the behaviour of another component. The filesystem, by definition, is the final arbiter of whether two filenames are identical or not. On the other hand,
tcsh and the
whoami/id/groups binary each believe that they already know how the filesystem functions, and replicate little bits of its behaviour internally as optimisations and shortcuts.
So when the behaviour of the filesystem changes from the Unix default of being case-sensitive to the OS X default of just being case-preserving, it causes unpredictable behaviour in those applications.
It's really just a practical example of the value of Once and Only Once. A system is both more robust and more flexible if each question has an authoritative answer from only one place.
As some of you may have noticed, I'm messing with Apache's mod_rewrite so I can make my site unavailable to various spammers, annoying bots and (I must admit to my main motivation) the worst of the anonymous-comment lusers who have started infesting the Java blogosphere.
I mean seriously. Anonymous comment-flames? "Attack-blogs" written specifically to insult one person? Don't you think we could progress just a little beyond the Slashdot mentality? For fuck's sake people, grow up.
I have nothing against anonymity per se, and I personally loathe the "We won't talk to you if you don't join our club!" sites that want you to sign up before you can leave a comment. I'm just rather saddened by the fact that anonymity is inevitably used as an excuse by a certain proportion of the population to run around acting like five-year olds who have discovered that saying "bum" is hilarious, and want to share that with as much of the world as often as possible.
Anyway, the reason some of you may have noticed that is because my prototype ban-manager script didn't know how to deal with an empty banlist, went slightly insane and banned the whole world.
It should all be fixed now.
On the way to work today, I realised quite unexpectedly that Blame it on the Boogie by the Jackson Five and The Milkshake by the Village People are exactly the same song. I haven't checked, but I'm pretty sure they're even in the same key.
Aren't you glad you don't have my brain?
Addendum: In my opinion, the funniest line in King Missile's Detachable Penis is: “He wanted twenty-two bucks, but I talked him down to seventeen.” For some reason, that's the line that pushes the absurdity over the edge.
I find it annoying that in order to get my Mac to check my spelling against an English dictionary instead of an American English dictionary, I must select from the following list:
That's right. American English has been promoted to just "English". Sure, I know this is an American program, but why not be consistent and label it "American English"?
For those who aren't familiar with this particular bit of history: when Noah Webster was first compiling his American dictionary, he decided that it was a great opportunity to impose his theory of simplified spelling on the public. Some of his ideas stuck in future editions (such as the changing of most –our and –re words to end in –or and –er respectively), others he later relented on (such as spelling 'determine' without its final 'e', or replacing 'crowd' with 'croud').
As such, I find it rather offensive that on my computer, Noah Webster's whim has become plain "English" whereas its parent, the English language, has been ghettoised into "British English".
On a related note, when working on some code yesterday, I was forced into a spelling corner. The CSS standard spells 'colour' the American way: 'color'. If I'm writing a method to customise CSS files, should I spell it the way that is natural to both me and my colleagues, or should I go with the way it's spelled inside CSS?
After some internal debate, and a short rant to a cow-orker, I decided it was probably best to go with 'color' for the sake of external consistency. An hour later, of course, I was doing a global search-and-replace for everywhere my fingers had typed 'colour' for me without being asked.
I've said before that every Java project ends up writing (or using) a StringUtil class: a bunch of static methods to do things to Strings that Sun's class doesn't cater for.
Today, I found myself wanting to do a pretty basic String operation that wasn't on the main class, so I sent IDEA off hunting... and there were seven classes in my Classpath called either
StringUtils, all of them from different projects.
So I wrote the method myself. Finding the one I should have been using amongst that lot was just too much effort :)
Sometimes I think there should be another level of class visibility: "only visible to classes loaded from the same location (jar, file classpath base, URL codebase)". That would make it easier to navigate a library: all the glue classes, helpers, utils and impls would be safely stowed away, and you'd just be browsing the real library interface.
Then, I think how much this could be abused by overzealous information hiders to make it impossible to do anything with a library that wasn't explicitly forseen by its author, and I realise it's a bad idea after all.