The most educational part of this recent reiteration of the “your software should be like Unix pipes” trope isn’t that it shows how Unix command line tools are actually rather complicated, and can easily turn into baroque magical invocations. Although that's certainly true. The man-page for ‘find’ is 3,700 words. The manual for grep is a comparatively light 1,600 words, but that's because the 3,000 word explanation of regular expressions is in a different file.
The most educational part is the addendum:
Update: added
-print0
tofind
and-0
toxargs
to properly handle spaces in file names.
Firstly, this is a really dangerous class of bug. Unsafe handling of spaces in filenames is the kind of shell scripting mistake that will eventually end up deleting half the files on your computer when you just wanted to prune a directory.
It’s no accident that “The day I accidentally rm -rf /
’d as root, but recovered because I still had an emacs process running in another terminal.” is the archetypal Unix admin war-story.
Secondly, this is the kind of bug that appears as an emergent behaviour of component-based systems. Every component in the pipeline is working entirely correctly, in the sense that they're all performing exactly the operation they were instructed to perform. The bug comes from the way the pieces have been joined together.
Joining simple components together doesn't guarantee you simplicity. Hook a machine that does three things to a machine that does three things, and you've got a bigger machine that does nine things. Any one of those nine paths could conceal a bug that doesn't live in either component, but in the assumptions made when those components are joined together.
The Unix pipe model, where complex operations are composed out of single-function pieces that consume one stream of bytes and emit another, is magically simple. Every component speaks the same language—bytes in, bytes out—and thus every component is compatible with each other. The components can be developed to a uniform simple flow of common input and output APIs. Complex things like flow control are handled for you: shells can buffer those bytes so if you send too fast your writes will eventually block until the next component is free to receive.
At this point I must defer to Jamie Zawinski:
…the decades-old Unix “pipe” model is just plain dumb, because it forces you to think of everything as serializable text, even things that are fundamentally not text, or that are not sensibly serializable.
For a program that produces or consumes a list of items, the problem of how that list is communicated doesn't go away by saying “everything is a stream of bytes”. All that happens is that each program producing or consuming lists has to pick a delimiter, and hope that the other program in the chain doesn't pick a different delimiter and delete all your files.
And then there are the assumptions about how a stream of bytes might map to text that are rooted in the 1970s. Or the way programs that want to support pretty-printing to the terminal must do so by silently varying their output based on the identity of the stream they are writing to.
Simplicity is prerequisite for reliability. — Edsger Dijkstra, How do we tell truths that might hurt?
The Unix pipe model is actually a great example of how a complex system can be made to look simple by pushing complexity downstream, and how doing so can give you a very narrowly defined kind of simplicity at the expense of reliability—the simplicity of a system that mostly does the right thing most of the time.
The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex. Besides, programmers could easily insert this extra test and loop. The MIT guy pointed out that the implementation was simple but the interface to the functionality was complex. — Richard Gabriel, The Rise of Worse Is Better
If you only have to worry about mostly doing the right thing most of the time, your components can be simpler because they can pretend edge-cases don't exist or don't matter. For users, the default “happy path” can be simpler because they don’t have to cater to those edge-cases except when they happen and they either remember to insert that extra test for the unhappy path, or are left cleaning up the mess afterwards. And if things do screw up, it’s easy to blame yourself because you forgot you needed a -print0
in there.
There is an obvious analogy to programming language type systems, or pure functions vs side-effects here. Feel free to print out this blog post and scribble one in the margins.