Measure twice

“Measure twice, cut once.” I can’t recall exactly where I was when I first heard that: perhaps a school carpentry lesson? Or for some reason I’m now thinking it was a physics lesson instead, but no matter. What is true is that I recently discovered that it applies to software engineering as much as carpentry.

Here’s the background: this blog is generated as a set of static files, by transforming an input tree into an output tree. The input tree contains a mixture of static resources (images, etc) alongside ‘posts’, text files containing metadata and Markdown content. The output tree contains exactly the same files, except that the posts are rendered to HTML, and there are some additional generated files like the front page and Atom feed.

There are tools to do this kind of thing now (we use Jekyll for Noda Time’s web site, for example), but I wrote my own (and then recently rewrote it in Python)¹. My version does three things: it scans the input and output trees, works out what the output tree should look like, then makes it so. It’s basically make (with a persistent cache and content-hash-based change detection) plus a dumb rsync, and it’s not particularly complex.

For my most-recent post, I needed to add support for math rendering, which I did by conditionally sourcing a copy of MathJax from their CDN. So far, so good, but then I wanted to be able to proof the post while I was on a plane, so I decided to switch to a local copy of MathJax instead.

Problem: a local install of MathJax contains nearly 30,000 files, and my no-change edit-render-preview cycle shot from 75ms to just over 12 seconds. Most of the time I write posts in vim, and only proof the post when I’m done editing, but since this was a math-heavy post (and my LaTeX is… rusty), I was having a swordfight every few minutes.

But, you know, optimisation opportunity! I carried out some basic profiling and figured out that the 12 seconds I was seeing was taken up with:

a little over five seconds scanning the input and output trees;
about four seconds working about what to change; and
about three seconds writing out the persistent cache.

The second of those I couldn’t see any obvious way to improve, but the first and last surprised me.

The input and output tree scanning is done by using os.walk() to walk the tree, and os.stat() to grab each file’s mtime and size (which I use as validators for cache entries).

Clearly that was an inefficient way to do it: I’m calling stat(2) about 30,000 times, when I should be reading that information in the same call as the one that reads the directory, right? Except that there’s no such call: the Linux VFS assumes that a file’s metadata is separate from the directory entry²; this isn’t DOS.

Perhaps I was thrashing the filesystem cache, then? Maybe I should be sorting (or not sorting) the directory entries, or stat-ing all the files before I recursed into subdirectories? Nope; doesn’t make a difference.

Well, I guess we’re blocking on I/O then. After all, git doesn’t take long to scan a tree like this, so it must be doing something clever; I should do that. Ah, but git is multi-threaded, isn’t it?³ I’ll bet that’s how it can be fast: it’s overlapping the I/O operations so that it can make progress without stalling for I/O.

So I wrote a parallel directory scanner in Python, trying out both the multiprocessing and threading libraries. How long did each implementation take? About five seconds, same as before. (And raise your hand if that came as a surprise.)

The next thing I tried was replicating the scan in C, just to double-check that readdir-and-stat was a workable approach. I can’t recall the times, but it was pretty quick, so Python’s at fault, right?

Wrong. It’s always your fault.

I realised then that I’d never tried anything outside my tool itself, and ported the bare C code I had to Python. It took exactly the same amount of time. (At which point I remembered that Mercurial, which I actually use for this blog, is largely written in Python; that should have been a clue that it wasn’t likely to be Python in the first place.)

So finally I started taking a look at what my code was actually doing with the directory entries. For each one, it created a File object with a bunch of properties (name, size, etc), along with a cache entry to store things like the file’s content hash and the inputs that produced each output.

Now the objects themselves I couldn’t do much about, but the cache entry creation code was interesting: it first generated a cache key from the name, mtime, and size (as a dict), then added an empty entry to my global cache dict using that key. The global cache was later to be persisted as a JSON object (on which, more later), and so I had to convert the entry’s cache key to something hashable first.

And how to generate that hashable key? Well, it turned out that as I already had a general-purpose serialiser around, I’d made the decision to reuse the JSON encoder as a way to generate a string key from my cache key dict (because it’s not like performance matters, after all). Once I’d replaced that with something simpler, tree scans dropped from 5.2s to 2.9s. Success!

I’d also noticed something odd while I was hacking about: when I removed some of my DEBUG level logging statements, things sped up a bit, even though I was only running at INFO level. I briefly considered that perhaps Python’s logging was just slow, then decided to take another look at how I was setting up the logging in the first place:

logging.basicConfig(
    level=logging.DEBUG,
    format=('%(levelname).1s %(asctime)s [%(filename)s:%(lineno)s] '
            '%(message)s'))
logging.getLogger().handlers[0].setLevel(
    logging.DEBUG if args.verbose else logging.INFO)

Python’s logging is similar to Java’s, so this creates a logger that logs everything, then sets the default (console) log handler to only display messages at INFO level and above. Oops.

I’d stolen the code from another project where I’d had an additional always-on DEBUG handler that wrote to a file, but here I was just wasting time formatting log records I’d throw away later. I changed the logging to set the level of the root logger instead, and sped things up by another second. More success!

Finally, I decided to take a look at the way I was writing out my cache. This is a fairly large in-memory dict mapping string keys to dict values for each file I knew about. I’d known that Python’s json module wasn’t likely to be particularly fast, but almost three seconds to write an 11MB file still seemed pretty slow to me.

I wasn’t actually writing it directly, though; the output was quite big and repetitive, so I was compressing it using gzip first:

with gzip.open(self._cache_filename, 'wb') as cache_fh:
    json.dump(non_empty_values, cache_fh, sort_keys=True,
              default=_json_encode_datetime)

I noticed that if I removed the compression entirely, the time to write the cache dropped from about 2900ms to about 800ms, but by this point I was assuming that everything was my fault instead, so I decided to measure the time taken to separately generate the JSON output and write to the file.

To my surprise, when I split up the two (using json.dumps() to produce a string instead of json.dump()), the total time dropped to just 900ms. I have no real idea why this happens, but I suspect that something is either flushing each write to disk early or that calls from Python code to Python’s native-code zlib module are expensive (or json.dump() is slow).

In total, that brought my no-change runtime down from twelve seconds to just about six.

So, in summary, once I realised that I should actually measure where my time was spent rather than guessing what it might be spent on, I was able to reduce my runtime by about half, quite a big deal in an edit-compile-run cycle. It took it from “I’m bored waiting; maybe read Slashdot instead” to something that was tolerable.

And so success, and scene.

But.

But there’s actually a larger lesson to learn here, too (and in fact I very nearly titled this post “The \sqrt of all evil” in light of that): I didn’t need to do any of this at all.

A few weeks after the above, I realised that I could solve my local editing problem an entirely different way. I moved MathJax out of the content tree entirely, and now (on local runs) just drop a symlink into the output tree after the tool is done. So if you take a look at the page as it serves now, you’ll see I’m back to sourcing MathJax via their CDN.

This means that I’m back down to O(200) input files rather than O(30k), and my no-change builds now take 30ms. It was a fun journey, but I’m not entirely sure that cutting 45ms from a 75ms run was worth all the effort I put in…

Why? Mostly because I can: it’s fun. But also because it gives me an excuse to play around: originally with C and SQLite, and more recently with Python, a language I don’t use much otherwise. I could also say that static blog generators weren’t as common in 2006, but to be honest I don’t think I bothered looking. ↩
This is a good thing: hard links would be pretty tricky otherwise. ↩
Nope. git is multi-threaded when packing repositories, but as far as I’m aware that’s the only time it is. ↩