Yesterday this account of a serious vulnerability in most major Java application servers crossed my Twitter feed a few times. The description, while thorough, is written in security researcher, so since it’s an important thing for developers to understand, I thought I would rewrite the important bits in developer.
What is the immediate bug?
A custom deserialization method in Apache commons-collections contains reflection logic that can be manipulated to execute arbitrary code. Because of the way Java serialization works, this means that any application that accepts untrusted data to deserialize, and that has commons-collections in its classpath, can be exploited to run arbitrary code.
The immediate fix is to patch commons-collections so that it that does not contain the exploitable code, a process made more difficult by just how many different libraries and applications use how many different versions of commons.
The immediate fix is also utterly insufficient. It’s like finding your first XSS bug in a program that has never cared about XSS before, patching it, and then thinking “Phew, I’m safe.”
So what is the real problem?
The problem, described in the talk the exploit was first raised in — Marshalling Pickles — is that arbitrary object deserialization (or marshalling, or un-pickling, whatever your language calls it) is inherently unsafe, and should never be performed on untrusted data.
This is in no way unique to Java. Any language that allows the “un-pickling” of arbitrary object types can fall victim to this class of vulnerability. For example, the same issue with YAML was used as a vector to exploit Ruby on Rails.
The way this kind of serialization works, the serialization format describes the objects that it contains, and the raw data that needs to be pushed into those objects. Because this happens at read time, before the surrounding program gets a chance to verify these are actually the objects it is looking for, this means that a stream of serialized objects could cause the environment to load any object that is serializable, and populate it with any data that is valid for that object.
This means that if there is any object reachable from your runtime that declares itself serializable and could be fooled into doing something bad by malicious data, then it can be exploited through deserialization. This is a mind-bogglingly enormous amount of potentially vulnerable and mostly un-audited code.
Deserialization vulnerabilities are a class of bug like XSS or SQL Injection. It just takes one careless bit of code to ruin your day, and far too many people writing that code aren’t even aware of the problem. Combine this with the fact that the code being exploited could be hiding inside any of the probably millions of third-party classes in your application, and you’re in for a bad time.
Your best fix is just not to risk it in the first place. Don’t deserialize untrusted data.
The mitigation for this class of vulnerability is to reduce the
surface area available to attack. If only a limited number of objects can be
reached from deserialization, those objects can be carefully audited to make
sure they’re safe, and adding a new random library to your system won’t
unexpectedly make you vulnerable. For example, Python’s YAML implementation
safe_load method that limits object deserialization to
a small set of known objects, essentially reducing it to a JSON-like format.
Your best bet
in Java is not to use Java serialization unless you absolutely trust whoever
is producing the data. If you really want to use serialization, you can
the objects available to be deserialized by overriding the
resolveClass method on
This way you can ensure only objects you have verified are safe will be
populated during deserialization.
Or just don't use serialization for data transfer. Nine times out of ten, tightly coupling your wire format with your object model isn’t something future maintainers of your system are going to thank you for.
Edited November 9 to add the reference to the developerWorks Look-Ahead Deserialization article, after it was pointed out to me by a couple of different people.
My friends on Facebook are generally a tech-literate and cynical bunch, so the ratio of people who fell for the recent spate of “Repost this legalese to regain control of your content” chain-mail hoaxes vs the people who have posted sarcastic reactions to it is about one to twelve.
And that bugs me.
We (the tech industry, but more broadly society) have created these Internet agoras. To members, these sites are vital means of maintaining contact with friends and loved ones, of not feeling left out of important parts of their lives. But the same people will grasp at the most tenuous of straws if it gives them a slight hope that they might claw back some sense of ownership, safety and control.
Every time a social media site changes its defaults, loosens its privacy settings or tightens its licensing, we tend to take lack of action by its members as tacit acceptance that privacy and ownership just don't matter. Hoaxes like this tell us otherwise. People feel trapped and helpless in a complex, baffling system. They want a way to assert control over their online lives, and they don't understand why it's not as simple and obvious as saying “I wrote this. I took these photos. They are mine.”
- Write first draft
- Find a dozen things wrong with published post, frantically fix them before too many people read the article.
- GOTO 3
Number of post-publication edits for this post: 4
Remember back in 2003 when blogging was going to take over the world? When we were writing odes to blogging, building popular tools to map the blogsphere, actually using the word blogosphere with a mostly straight face, and wringing our hands over every new entrant in the field and every Google index update?
Sure, the component parts of blogging are everywhere now. The Internet is drowning in self-publishing, link-sharing, articles scrolling by in reverse-chronological order. It's no coincidence that the most popular CMS on the public Internet, by a pretty ridiculous margin is a blogging platform.
But somewhere around a decade ago, the soul of blogging died. The heterogeneous community using syndication technologies to create collaboratively-filtered networks of trust and attention between personally-curated websites, forming spontaneous micro-communities in the negative space between them? That’s the thing we were all saying would take over the world, and instead “blogging” dwindled back to being a feature of corporate websites, a format for online journalism, and a hobby of techies who like running their own web pages.
Going back over fourteen years of my own blog history was an interesting lesson in how this blog changed over the years. There are entire classes of post that filled the pages of this site in 2002, but that were not to be seen five years later. Some of this was due to me changing behind the blog. Many were due to the Internet changing around it.
So what happened to blogging?
Digg stole its community.
And then reddit and Hacker News, but Digg did it first.
Kuro5hin demanded users share substantial things they wrote themselves, everything else was “Mindless Link Propagation”. Digg took MLP and changed the shape of the Internet with it.
In doing so, Digg created a devoted platform for one of the core activities, and most common entry-points of blogging: holding conversations about things written elsewhere. Their platform was far easier to get involved in, far easier to set up, and solved that one big question of blogging newbies: “How do I get anyone to even read what I’m writing?” with centralisation and gamification.
Bloggers didn't jump ship for Digg, but equally Digg didn't contribute to blogging. Visitors from aggregation sites notoriously never looked deeper into the sites they were visiting than the single article that was linked, and the burst of syndication subscribers a blogger would normally get if one of the hubs of their community linked to them just never came from aggregation sites.
Bloggers did, however, find themselves having to take part in these communities. At first because more often than not aggregators were where the conversation was happening about the things they were writing, and writing about. Later, because they’re where readers come from. For many people trying to make money writing on the Internet today, links from reddit are how you survive.
For their part, aggregation site users tend to hold bloggers in the lowest of low esteem, even when linking to them. Blogging is narcissistic. Who are they to remain aloof from the community like that, to share links and posts on their own website instead of contributing them to the centralised collective?
It is this sense of community that even turned some aggregators into creators, beyond the surfacing of links or crowdsourced comments about them. Like “Ask Slashdot” before it, some of the most popular communities on reddit are built around user-contributed posts. Overall, though, links still rule the site.
Users of aggregators tend to reserve their greatest vitriol for sites that aggregate or republish things from their website, whether it be something that was original to the site, or even if it’s just a link they found “first”. For sites built around monetising other sites’ labour, aggregator users get mighty tetchy when the same thing is done to them.
Twitter stole its small-talk.
Bloggers might not have jumped ship for aggregators, but they dove into Twitter head first.
It takes a lot of time and inspiration to write a long-form article, so most blogs filled the gaps between with links, funny pictures they had found around the Internet, short pithy commentary, snippets of conversation, interesting quotes, jokes, and in one case from a blogger now worth more money than you can count, an enthusiastic two sentence review of the porn site “Bang Bus”.
With Twitter you could do that on your phone, have it pushed to your friends/subscribers in real time, and have the same done back to you with equal ease. It wasn't even a competition.
Twitter still has the “How do I get people to notice me?” problem, and later developed the even more disturbing “How do I get people to stop noticing me?” problem, but that didn't stop it sucking the remaining air out of the blogosphere in the course of surprisingly few months.
What about Facebook, Instagram, Pinterest and the like? Well, from my perspective they weren't so much the successors to blogging as they were the successors to Livejournal.
Tumblr stole its future.
A curmudgeon might say I should also file Tumblr under “successors to Livejournal”, but I disagree. Tumblr sites tend far less towards being amorphous personal diaries aimed square at the author’s existing social network, and far more towards expressing the author’s interests in public, and joining the larger community that arises around them.
From one perspective, Tumblr is blogging. At today’s count they host 244 million blogs making a total of 81 million posts per day. That’s about four posts per year for every human being on Earth. Users can contribute their own posts, but just as importantly they can reblog and comment, forming spontaneous, distributed communities of interest around (and in the spaces between) the things they share from others.
From another perspective, Tumblr stole blogging. The syndication and sharing tools, the communities built within Tumblr, everything stops dead at the website's border. The tools seem almost contemptuous of the web as it exists outside Tumblr. To quote JWZ:
[Tumblr pioneered] showing the entire thread of attributions by default, and emphasizing the first and last -- but stopping cold at the walls of the Tumblr garden. To link to an actual creator, you have to take an extra step, so nobody bothers.
These may seem like small glitches, but the aggregate effect is huge. They’re what makes the “Tumblr Community” a real thing people talk about in a way you'd never hear about, say, people who happen to host their sites with Wordpress.
Centralisation and lock-in won.
In the end, the distributed, do-it-yourself web was just too hard. Not just for newcomers facing a mountainous barrier to entry, but even to incumbents looking to shave a few sources of frustration from their day. Just ask anyone who excitedly built RSS/Atom syndication into their product in the early 2000s, only to deprecated the feature gradually into the power-user margin over the ensuing decade.
In every case, a closed, proprietary system took some ingredient of the self-publishing crack bloggers discovered in the early 2000s and distilled it into a product that was easier to use, and that people were willing to adopt even though it meant losing the freedom of openness, interoperability and owning your own words.
Leaving behind a landscape of those for whom that sacrifice either was not commercially attractive, or those of us who are just sufficiently set in our ways that the idea of not running our own website feels alien.
Ask me ten years ago, and I'd say a blog entry, once published, should remain that way. Oh wait, I actually did say that:
I try never to delete anything substantive. Attempting to un-say something by deleting it is really just a case of hiding the evidence. I'd much rather correct myself out in the open than pretend I was never wrong in the first place.
The reasons not to delete come down to:
- Not wanting to break the web by 404-ing a page
- Wanting to be honest about what you’ve said in public
- Keeping a record of who you were at some moment in time.
The counter-arguments are:
- The web was designed to break. And anyway, the stuff worth deleting is usually the stuff nobody’s linking to.
- Just how long does a mea culpa have to stand before it becomes self-indulgent?
- Unless you’re noteworthy and dead, or celebrity and alive, the audience for your years-old personal diaries is particularly limited.
- Publishing on the web isn’t just something you do, and then have done. It’s an ongoing process. A website isn’t just a collection of pages, it’s a work that is both always complete, and always evolving. And every work can do with the occasional read-through with red pen in hand.
That last point is the most compelling one. I was publishing a website full of things that, however apt they were at the time to the audience they were published for, just aren’t worth reading today.
So to cut a long story short, last weekend I un-published about 700 of the previously 1800 posts on this blog; things that were no longer correct, things that were no longer relevant, things that were no longer interesting even as moments in time, and things that I no longer feel comfortable being associated with. I don't think anything that was removed will be particularly missed, and as a whole the blog is a better experience for readers without them.
The weirdest thing about deleting 700 blog posts is realising you had 1800 to start with. Although to be fair, 1750 of them were Cure lyrics drunk-posted to Livejournal.
Under the hood
It's a testament to the resilience of Moveable Type that in the eleven years since I first installed it to run this blog, I've upgraded it exactly twice. If I’d tried that with the competition, I doubt I’d have had nearly as smooth a ride.
Moveable Type got me through multiple front-page appearances on Digg, reddit, Hacker News and Daring Fireball without a hitch, or at least would have if I hadn't turned out to be woefully incompetent at configuring Apache for the simple task of serving static files.
But as they say, all good things must come to end. Preferably with Q showing up in a time travel episode.
I replaced Moveable Type with a couple of scripts that publish a static site from a git repo, fully aware that I’m doing this at least five years after it became trendy. The site should look mostly identical, except comments and trackbacks haven't been migrated. They’re in the repo, but I'm inclined to let them stay there.
Look, bad things happen to people in fiction just like bad things happen in real life. And at least the people in fiction aren't real so it didn't really happen to them.
I get that.
And you can have great entertainment where bad things happen to bad people, or bad things happen to good people, or bad things happen to indifferent people who just happened to be in the wrong place at the wrong time.
I get that too.
But at some point you find yourself sitting on a couch watching a drawn-out scene where a child is burned alive screaming over and over for her parents to save her, and you think “Why the fuck am I still watching this show?”
Bad things happen in real life. Bad things have happened throughout history. So what, I'm watching television. If I wanted to experience the reality of a brutal, lawless campaign for supremacy between tribal warlords, there are plenty of places in the world I could go to see that today. I wouldn't survive very long, but at least I'd get what I deserved for my attempt at misery tourism.
Bad things happen in good drama, too. But drama comes with a contract. The bad things are there because they are contributing to something greater. Something that can let you learn, or understand, or experience something you otherwise wouldn't have; leading you out the other side glad that you put yourself through the ordeal, albeit sometimes begrudgingly.
To refresh our memories, here's how George R. R. Martin explained the Red Wedding:
I killed Ned in the first book and it shocked a lot of people. I killed Ned because everybody thinks he's the hero and that, sure, he's going to get into trouble, but then he'll somehow get out of it. The next predictable thing is to think his eldest son is going to rise up and avenge his father. And everybody is going to expect that. So immediately [killing Robb] became the next thing I had to do.
There are increasingly flimsy justifications for the horrors of Game of Thrones. They motivate character A. Or they open up space for character B. But in the end it's obvious that it's really about providing the now-mandated quota of shock, and giving the writers some hipster cred for subverting fantasy tropes.
I did not enjoy watching Sansa Stark’s rape. I did not enjoy watching Shireen Baratheon burned at the stake.
If that's what you want to watch TV for, go for it. But I'm out.
Seen on Twitter:
Either and Promises/Futures are useful and I’ll use them next time they’re appropriate. But outside Haskell does their monad-ness matter?
All code below is written in some made-up Java-like syntax, and inevitably contains bugs/typos. I'm also saying "point/flatMap" instead of "pure/return/bind" because that's my audience. I also use "is a" with reckless abandon. Any correspondance with anything that either be programatically or mathematically useful is coincidental
What is a monad? A refresher.
A monad is something that implements "point" and "flatMap" correctly.
I just made a mathematician scream in pain, but bear with me on this one. Most definitions of monads in programming start with the stuff they can do—sequence computations, thread state through a purely functional program, allow functional IO. This is like explaining the Rubiks Cube by working backwards from how to solve one.
A monad is something that implements "point" and "flatMap" correctly.
So if this thing implements point and flatMap correctly, why do I care it's a monad?
Because "correctly" is defined by the monad laws.
- If you put something in a monad with point, that's what comes out in flatMap. point(a).flatMap(f) === f(a)
- If you pass flatMap a function that just points the same value into another monad instance, nothing happens. m.flatMap(a -> point(a)) === m
- You can compose multiple flatMaps into a single function without changing their behaviour. m.flatMap(f).flatMap(g) === m.flatMap(a -> f(a).flatMap(g))
If you don't understand these laws, you don't understand what flatMap does. If you understand these laws, you already understand what a monad is. Saying "Foo implements flatMap correctly" is the same as saying "Foo is a monad", except you're using eighteen extra characters to avoid the five that scare you.
Because being a monad gives you stuff for free.
If you have something with a working point and flatMap (i.e. a monad), then you know that at least one correct implementation of map() is map(f) = flatMap(a -> point(f(a)), because the monad laws don't allow that function to do anything else.
You also get join(), which flattens out nested monads: join(m) = m.flatMap(a -> a) will turn Some(Some(3)) into Some(3).
You get sequence(), which takes a list of monads of A, and returns you a monad of a list of A's: sequence(l) = l.foldRight(point(List()))((m, ml) -> m.flatMap(x -> ml.flatMap(y -> point(x :: y)))) will turn [Future(x), Future(y)] into Future([x, y]).
And so on.
Knowing that Either is a monad means knowing that all the tools that work on a monad will work on Either. And when you learn that Future is a monad too, all the things you learned that worked on Either because it's a monad, you'll know will work on Future too.
Because how do you know it implements flatMap correctly?
If something has a flatMap() but doesn't obey the monad laws, developers no longer get the assurance that any of the things you'd normally do with flatMap() (like the functions above) will work.
There are plenty of law-breaking implementations of flatMap out there, possibly because people shy away from the M-word. Calling things what they are (is a monad, isn't a monad) gives us a vocabulary to explain why one of these things is not like the other. If you're implementing a flatMap() or its equivalent, you'd better understand what it means to be a monad or you'll be lying to the consumers of your API.
But Monad is an opaque term of art!
So, kind of like "Scrum", "ORM" or "Thread"?
Or, for that matter, "Object"?
As developers, we do a better job when we understand the abstractions we're working with, how they function, and how they can be reused in different contexts.
Think of the most obvious monads that have started showing up in every language1 over the last few years: List, Future, Option, Either. They feel similar, but what do they all have in common? Option and Either kind of do similar things, but not really. An Option is kind of like a zero-or-one element list, but not really. And even though Option and Either are kind of similar, and Option and List are kind of similar, that doesn't make Either and List similar in the same way at all! And a Future, well, er…
The thing they have in common is they're monads.
1 Well, most languages. After finding great branding success with Goroutines, Go's developers realised they had to do everything possible to block any proposed enhancement of the type system that would allow the introduction of "Gonads".
The weekend after I posted this article about the pitfalls hiding in a simple shell command-line, I had some free time and decided it might be fun to see what the same functionality would look like in Haskell.
My Haskell experience is limited to tutorials, book exercises, and reading other people’s code, so I wanted to go a little bit off the paved road for once.
I didn't expect to just knock out a working program in an unfamiliar language in minutes, and I didn't. Nonetheless I was pretty impressed by how, once I had muddled through understanding the pieces I was working with, they joined together in a pleasingly logical way. (Although looking at the code a month later I can get what the various bits do—the types are informative enough for that—but it would take some unravelling for me to remember how.)
Then I ran my resulting program on a real directory full of photos and it instantly died, because the exif parser was built on lazy I/O, and my script was running out of available file-handles before it was getting around to closing them.
Leaky abstractions 1, Charles 0.
This is the kind of post that has a small but annoyingly non-zero chance of setting off somebody's "wrong on the Internet" buzzer. Please don't. It's an anecdote of how I encountered an unshaven yak on a lazy Sunday afternoon, as developers are wont to do, nothing more.
Twitter is a crowded bar. Most of the people who are there, are there to relax, hang out and chat with the friends or colleagues they showed up with.
Twitter is a crowded bar. You're having conversations in a public place. You are surrounded by people having conversations within your earshot. But none of that makes what you are doing or what they are doing "public debate" by any reasonable definition.
Twitter is a crowded bar, not an alternate universe, so try not to say anything that would get you in trouble if it was repeated outside the bar.
Twitter is a crowded bar. Sometimes you run into people you know. And if they're cool with the friends you went to the bar with, everything will be cool.
Twitter is a crowded bar. Some of the people there are being paid to promote a product. Some are there to network. Some are there because they're famous and it's a "place to be seen". You can hang around them if that's your thing, but they're at the bar because everybody else makes it worth them being there, not the other way around.
Twitter is a crowded bar. People can go to bars to meet strangers, to meet new people in a new town, or because they know it's where other people "like them" hang out. But it's never anyone else's responsibility to make sure a stranger feels welcome with them personally, or with their social circle.
Twitter is a crowded bar. If you hear something interesting from a nearby conversation there's nothing stopping you introducing yourself and joining in. But again, it's your job to make sure you're not an unwelcome interruption.
Twitter is a crowded bar. If you're that person who always ends up cornering a stranger who doesn't particularly want to talk to you in a one-on-one conversation, you're not fooling anyone.
Twitter is a crowded bar. If someone says "I didn't come here to argue with you about that", even if it's something they brought up in the first place, you should probably find another conversation. There's plenty of them around.
Twitter is a crowded bar. If you go there to pick fights with strangers, you're an asshole.
Twitter is a crowded bar, but they've kind of skimped on hiring bouncers.
The most educational part of this recent reiteration of the “your software should be like Unix pipes” trope isn’t that it shows how Unix command line tools are actually rather complicated, and can easily turn into baroque magical invocations. Although that's certainly true. The man-page for ‘find’ is 3,700 words. The manual for grep is a comparatively light 1,600 words, but that's because the 3,000 word explanation of regular expressions is in a different file.
The most educational part is the addendum:
xargsto properly handle spaces in file names.
Firstly, this is a really dangerous class of bug. Unsafe handling of spaces in filenames is the kind of shell scripting mistake that will eventually end up deleting half the files on your computer when you just wanted to prune a directory.
It’s no accident that “The day I accidentally
rm -rf /’d as root, but recovered because I still had an emacs process running in another terminal.” is the archetypal Unix admin war-story.
Secondly, this is the kind of bug that appears as an emergent behaviour of component-based systems. Every component in the pipeline is working entirely correctly, in the sense that they're all performing exactly the operation they were instructed to perform. The bug comes from the way the pieces have been joined together.
Joining simple components together doesn't guarantee you simplicity. Hook a machine that does three things to a machine that does three things, and you've got a bigger machine that does nine things. Any one of those nine paths could conceal a bug that doesn't live in either component, but in the assumptions made when those components are joined together.
The Unix pipe model, where complex operations are composed out of single-function pieces that consume one stream of bytes and emit another, is magically simple. Every component speaks the same language—bytes in, bytes out—and thus every component is compatible with each other. The components can be developed to a uniform simple flow of common input and output APIs. Complex things like flow control are handled for you: shells can buffer those bytes so if you send too fast your writes will eventually block until the next component is free to receive.
At this point I must defer to Jamie Zawinski:
…the decades-old Unix “pipe” model is just plain dumb, because it forces you to think of everything as serializable text, even things that are fundamentally not text, or that are not sensibly serializable.
For a program that produces or consumes a list of items, the problem of how that list is communicated doesn't go away by saying “everything is a stream of bytes”. All that happens is that each program producing or consuming lists has to pick a delimiter, and hope that the other program in the chain doesn't pick a different delimiter and delete all your files.
And then there are the assumptions about how a stream of bytes might map to text that are rooted in the 1970s. Or the way programs that want to support pretty-printing to the terminal must do so by silently varying their output based on the identity of the stream they are writing to.
Simplicity is prerequisite for reliability. — Edsger Dijkstra, How do we tell truths that might hurt?
The Unix pipe model is actually a great example of how a complex system can be made to look simple by pushing complexity downstream, and how doing so can give you a very narrowly defined kind of simplicity at the expense of reliability—the simplicity of a system that mostly does the right thing most of the time.
The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex. Besides, programmers could easily insert this extra test and loop. The MIT guy pointed out that the implementation was simple but the interface to the functionality was complex. — Richard Gabriel, The Rise of Worse Is Better
If you only have to worry about mostly doing the right thing most of the time, your components can be simpler because they can pretend edge-cases don't exist or don't matter. For users, the default “happy path” can be simpler because they don’t have to cater to those edge-cases except when they happen and they either remember to insert that extra test for the unhappy path, or are left cleaning up the mess afterwards. And if things do screw up, it’s easy to blame yourself because you forgot you needed a
-print0 in there.
There is an obvious analogy to programming language type systems, or pure functions vs side-effects here. Feel free to print out this blog post and scribble one in the margins.