Measure twice
“Measure twice, cut once.” I can’t recall exactly where I was when I first heard that: perhaps a school carpentry lesson? Or for some reason I’m now thinking it was a physics lesson instead, but no matter. What is true is that I recently discovered that it applies to software engineering as much as carpentry.
Here’s the background: this blog is generated as a set of static files, by transforming an input tree into an output tree. The input tree contains a mixture of static resources (images, etc) alongside ‘posts’, text files containing metadata and Markdown content. The output tree contains exactly the same files, except that the posts are rendered to HTML, and there are some additional generated files like the front page and Atom feed.
There are tools to do this kind of thing now (we use
Jekyll for Noda Time’s web site, for example), but I
wrote my own (and then recently rewrote it in Python)1. My version
does three things: it scans the input and output trees, works out what the
output tree should look like, then makes it so. It’s basically make
(with
a persistent cache and content-hash-based change detection) plus a dumb
rsync
, and it’s not particularly complex.
For my most-recent post, I needed to add support for math rendering, which I did by conditionally sourcing a copy of MathJax from their CDN. So far, so good, but then I wanted to be able to proof the post while I was on a plane, so I decided to switch to a local copy of MathJax instead.
Problem: a local install of MathJax contains nearly 30,000 files, and my
But, you know, optimisation opportunity! I carried out some basic profiling and figured out that the 12 seconds I was seeing was taken up with:
- a little over five seconds scanning the input and output trees;
- about four seconds working about what to change; and
- about three seconds writing out the persistent cache.
The second of those I couldn’t see any obvious way to improve, but the first and last surprised me.
The input and output tree scanning is done by using os.walk()
to walk the
tree, and os.stat()
to grab each file’s mtime and size (which I use as
validators for cache entries).
Clearly that was an inefficient way to do it: I’m calling stat(2) about 30,000 times, when I should be reading that information in the same call as the one that reads the directory, right? Except that there’s no such call: the Linux VFS assumes that a file’s metadata is separate from the directory entry2; this isn’t DOS.
Perhaps I was thrashing the filesystem cache, then? Maybe I should be sorting (or not sorting) the directory entries, or stat-ing all the files before I recursed into subdirectories? Nope; doesn’t make a difference.
Well, I guess we’re blocking on I/O then. After all, git
doesn’t take
long to scan a tree like this, so it must be doing something clever; I
should do that. Ah, but git
is multi-threaded, isn’t it?3
I’ll bet that’s how it can be fast: it’s overlapping the I/O operations so
that it can make progress without stalling for I/O.
So I wrote a parallel directory scanner in Python, trying out both the
multiprocessing
and threading
libraries. How long did each
implementation take? About five seconds, same as before. (And raise your
hand if that came as a surprise.)
The next thing I tried was replicating the scan in C, just to double-check
that readdir
-and-stat
was a workable approach. I can’t recall the
times, but it was pretty quick, so Python’s at fault, right?
Wrong. It’s always your fault.
I realised then that I’d never tried anything outside my tool itself, and ported the bare C code I had to Python. It took exactly the same amount of time. (At which point I remembered that Mercurial, which I actually use for this blog, is largely written in Python; that should have been a clue that it wasn’t likely to be Python in the first place.)
So finally I started taking a look at what my code was actually doing with
the directory entries. For each one, it created a File
object with a
bunch of properties (name, size, etc), along with a cache entry to store
things like the file’s content hash and the inputs that produced each
output.
Now the objects themselves I couldn’t do much about, but the cache entry
creation code was interesting: it first generated a cache key from the name,
mtime, and size (as a dict
), then added an empty entry to my global cache
dict
using that key. The global cache was later to be persisted as a JSON
object (on which, more later), and so I had to convert the entry’s cache key
to something hashable first.
And how to generate that hashable key? Well, it turned out that as I
already had a general-purpose serialiser around, I’d made the decision to
reuse the JSON encoder as a way to generate a string key from my cache key
dict
(because it’s not like performance matters, after all). Once I’d
replaced that with something simpler, tree scans dropped from 5.2s to 2.9s.
Success!
I’d also noticed something odd while I was hacking about: when I removed
some of my DEBUG
level logging statements, things sped up a bit, even
though I was only running at INFO
level. I briefly considered that
perhaps Python’s logging was just slow, then decided to take another look at
how I was setting up the logging in the first place:
logging.basicConfig(
level=logging.DEBUG,
format=('%(levelname).1s %(asctime)s [%(filename)s:%(lineno)s] '
'%(message)s'))
logging.getLogger().handlers[0].setLevel(
logging.DEBUG if args.verbose else logging.INFO)
Python’s logging is similar to Java’s, so this creates a logger that logs
everything, then sets the default (console) log handler to only display
messages at INFO
level and above. Oops.
I’d stolen the code from another project where I’d had an additional
always-on DEBUG
handler that wrote to a file, but here I was just wasting
time formatting log records I’d throw away later. I changed the logging to
set the level of the root logger instead, and sped things up by another
second. More success!
Finally, I decided to take a look at the way I was writing out my cache.
This is a fairly large in-memory dict
mapping string keys to dict
values
for each file I knew about. I’d known that Python’s json
module wasn’t
likely to be particularly fast, but almost three seconds to write an 11MB
file still seemed pretty slow to me.
I wasn’t actually writing it directly, though; the output was quite big and
repetitive, so I was compressing it using gzip
first:
with gzip.open(self._cache_filename, 'wb') as cache_fh:
json.dump(non_empty_values, cache_fh, sort_keys=True,
default=_json_encode_datetime)
I noticed that if I removed the compression entirely, the time to write the cache dropped from about 2900ms to about 800ms, but by this point I was assuming that everything was my fault instead, so I decided to measure the time taken to separately generate the JSON output and write to the file.
To my surprise, when I split up the two (using json.dumps()
to produce a
string instead of json.dump()
), the total time dropped to just 900ms. I
have no real idea why this happens, but I suspect that something is either
flushing each write to disk early or that calls from Python code to Python’s
native-code zlib module are expensive (or json.dump()
is slow).
In total, that brought my no-change runtime down from twelve seconds to just about six.
So, in summary, once I realised that I should actually measure where my time was spent rather than guessing what it might be spent on, I was able to reduce my runtime by about half, quite a big deal in an edit-compile-run cycle. It took it from “I’m bored waiting; maybe read Slashdot instead” to something that was tolerable.
And so success, and scene.
But.
But there’s actually a larger lesson to learn here, too (and in fact I very
nearly titled this post “The \sqrt
of all evil” in light of that): I
didn’t need to do any of this at all.
A few weeks after the above, I realised that I could solve my local editing problem an entirely different way. I moved MathJax out of the content tree entirely, and now (on local runs) just drop a symlink into the output tree after the tool is done. So if you take a look at the page as it serves now, you’ll see I’m back to sourcing MathJax via their CDN.
This means that I’m back down to O(200) input files rather than O(30k), and my no-change builds now take 30ms. It was a fun journey, but I’m not entirely sure that cutting 45ms from a 75ms run was worth all the effort I put in…
-
Why? Mostly because I can: it’s fun. But also because it gives me an excuse to play around: originally with C and SQLite, and more recently with Python, a language I don’t use much otherwise. I could also say that static blog generators weren’t as common in 2006, but to be honest I don’t think I bothered looking. ↩
-
This is a good thing: hard links would be pretty tricky otherwise. ↩
-
Nope.
git
is multi-threaded when packing repositories, but as far as I’m aware that’s the only time it is. ↩