Pangrammatic performance

This is the third in a series of posts about searching for pangrammatic windows in a corpus of documents. I’ve previously talked about what I found in the Gutenberg corpus and the code I used, and ended the last post with a question: can we make it faster?

Well, that’s what this post is about.

Where were we?

I’ve expanded the contents of the Project Gutenberg 2010 DVD image into the current directory, and I’m compiling and running pangram.c as follows:

$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out

real    2m56.795s
user    0m57.155s
sys     0m6.667s

In other words, it currently takes about three minutes. We’re going to try to reduce that. Some facts and figures:

My laptop has a Core i5 CPU, reporting a total of four processors (one physical CPU with two physical cores and two HT cores).
My disk is an SSD. I’m using whole-disk encryption via dm-crypt, and then LVM2 over the top of that, so I have a bunch of different devices that all relate to my root partition.
I’m running Linux 3.13.11 (and, for what it’s worth, using gcc 4.6.3).
We’re searching a little over 32,000 files totalling 11.6GB of text.

Repeatable reads

Before we do anything, we really need to make sure we can get repeatable measurements. The first few experiments I tried ended up with nonsensical results — it turns out that “all the Gutenberg text” is less than the total memory on my laptop (16GB), so all the runs after the first just read from the filesystem cache.

Fixing that is easy: we ask the kernel to drop the cache before we run a test:

# echo 3 > /proc/sys/vm/drop_caches

or, since we’re probably not running as root,

$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null

/proc/sys/vm/drop_caches is documented in Documentation/sysctl/vm.txt; the command above will drop both the file content cache and the inode cache (containing directory contents).

One other source of variation I ran into was caused by how long find took to run (and note that it runs concurrently with xargs): most of the time it completed quickly, but in a few situations it was (inconsistently) delayed by what else was going on, causing the whole search to take much longer than usual. This was also easy to avoid: we capture the list of files in advance:

$ find -name '*.txt' > filelist
$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
$ time <filelist xargs -n 100 ./pangram > pangram.out

real    2m58.310s
user    0m55.787s
sys     0m6.630s

Change the algorithm

So where to start? As I see it, there are at least three things we can try:

Change the algorithm to reduce the work we need to do.
Change the implementation to do that work more efficiently.
Make better use of the resources we have.

Let’s take a look at the algorithm first.

I’m not going to repeat the whole thing here (see the previous post for that), but in summary: we read through each file until we’ve seen enough letters that we might have found a pangram, then scan backwards until we find one, or until we hit a limit (I used 200 bytes); we then resume scanning forwards from where we left off.

Can we improve on this? Perhaps. We clearly need to visit all the letters at least once, but — as suggested to me by Jesse Sheidlower, who wrote the PangramTweets Twitter bot that kicked this all off — we can avoid backtracking during the initial search if we keep some additional state¹.

Does this help? Unfortunately, not really: it takes exactly the same amount of time as the backtracking version.

Perhaps that’s not too surprising, though. It’s fairly clear that the problem should be I/O-bound, and so — unless the backtracking causes additional I/O (which appears not to be the case) — we should see if we can perhaps spend less time waiting on the I/O we have.

Change the implementation

As it stands, the total cost of our I/O is pretty much unavoidable: we need to read each file completely into RAM.

We could reduce the overall I/O by changing what we read². For example, we could:

Read the original zipfiles directly, trading I/O for CPU.
Combine the separate text files into a small number of large files, removing many small reads (and making better use of read-ahead, which can’t help with the first read from a file).
Perform the reads upfront by copying the files to a tmpfs.

However, these fundamentally change the problem we’re trying to solve, not just the way we’re solving it, and so I’m going to stick with what I have for now.

So far, we’re using mmap() to read the file. This gives us a memory range into which the kernel will read the file’s contents as-needed, using some amount of asynchronous read-ahead. If we try to read a page that hasn’t been read from disk yet, we’ll block³.

At least in my case, using mmap() to map the file on-demand isn’t much different to using read() to read the whole file in one go, which is itself a bit of a surprise: the former’s asynchronous read-ahead should mean that we can get started more quickly. I ran across an email about mmap() on the linux-kernel mailing list where Linus explains that mmap() is actually quite expensive to start with. In any case, most of our files are in the range 256–512KB⁴, and so perhaps there’s just not a lot of read-ahead to do.

One thing we could try is reducing the time we spend waiting for I/O by providing hints to the kernel about our usage of an area of memory or a file. For example, to hint that we’re about to read a buffer sequentially, we can write madvise(buf, len, MADV_SEQUENTIAL).

In theory, this should allow us to optimise the file I/O based on our usage. In practice (at least in my case), it turns out that these are actually pessimisations.

While we have several different ways to hint to the kernel, as far as I can see, they boil down to just two choices: whether or not we need the data immediately, and what the access pattern is for the data in memory.

If we need the data “now” (MADV_WILLNEED, MAP_POPULATE for posix_fadvise(), etc), then the kernel will issue a synchronous read there-and-then, returning once the file’s data is in the page cache. This can be no faster than issuing a blocking read() for the whole file, and — I assume due to the overhead of mmap() — actually ends up a bit slower in practice.

Otherwise, the access pattern is one of “normal”, “random”, or “sequential”. “Normal” is what we get by default; it triggers some amount of read-ahead.

“Random” (MADV_RANDOM, etc) is straightforward: it switches off read-ahead, so every new page access causes a single-page blocking read. This is terrible for performance, as you might expect — in our case, it roughly doubles the runtime.

“Sequential” (MADV_SEQUENTIAL) is less well-defined. It’s not completely clear to me what it does in practice — the posix_fadvise() man page says that it doubles the read-ahead window (on Linux), but the kernel source implies that it might be a bit more complex that that — but in any case, it seems to have a small negative effect overall.

madvise() and friends provide just one way to tune I/O scheduling, but they are the easiest to use. We could also look at overlapped or threaded I/O, but that’s significantly more complex — and perhaps there’s an easier way to improve our utilization anyway.

Make better use of the resources we have

In this case, I’m talking about disk and CPU utilization. While some of the disk reads will occur while we’re searching the buffer we’ve already read into memory, most won’t, and so the CPU should often be waiting for an I/O operation to complete.

It’d be nice if we could get a bit more detail about how we’re doing than simple wallclock time, so I’m going to measure the current CPU and disk utilization using iostat.

iostat is a whole-machine profiler, so it’s probably a good idea not to have much else going on at the time (though that said, I didn’t see too much impact from the copy of Chrome I had running). Alternatively, we could look into per-process monitoring via iotop or pidstat, or an event/tracing approach like SystemTap or iosnoop⁵. (Incidentally, Brendan Gregg’s Linux Performance page is a great resource for finding out more about Linux performance tools.)

Running iostat — with -N to print meaningful names for the device mapper devices, and -x to print extended disk statistics — produces something like the following (rather wide, sorry) output:

$ iostat -N -x
Linux 3.13.0-30-generic (malcolmr)      06/09/14        _x86_64_        (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.28    0.07    0.77    0.11    0.00   97.77

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.17     0.04   11.92    4.62   239.54    25.25    32.03     0.01    0.33    0.29    0.43   0.17   0.28
sda5_crypt        0.00     0.00   11.81    4.66   238.68    25.24    32.05     0.04    2.43    0.91    6.29   0.29   0.48
sysvg-root        0.00     0.00   11.65    4.48   238.02    25.24    32.65     0.04    2.48    0.92    6.54   0.30   0.48
sysvg-swap_1      0.00     0.00    0.12    0.00     0.47     0.00     8.00     0.00    0.15    0.15    0.00   0.15   0.00

$

It’s important to note that if you run iostat this way, you actually get a running average since boot, which isn’t very useful at all. What I chose to do instead was to start the run and then execute iostat -N -x 30 3, which outputs three reports separated by 30 seconds. The first is the average-since-boot, which we can ignore, but the other two are averages over the 30 seconds since the prior report.

Having two good reports allows us to check how variable the numbers we’re seeing are (in my case, fairly reliable). Here’s the kind of output I got.

First, we start a run:

$ time <filelist xargs -n 100 ./pangram > pangram.out

and then concurrently run iostat:

$ iostat -N -x 30 3
[...]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.19    0.00   19.54   15.99    0.00   56.28

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00  655.57    2.37 68680.93    23.33   208.85     0.26    0.39    0.39    0.39   0.35  22.93
sda5_crypt        0.00     0.00  654.43    2.37 68666.40    23.33   209.16     1.19    1.82    1.82    0.73   1.31  86.00
sysvg-root        0.00     0.00  654.43    2.17 68666.40    23.33   209.23     1.19    1.82    1.82    0.80   1.31  86.05
sysvg-swap_1      0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

The bottom line is that the CPU is idle most of the time (%iowait is “idle but there are runnable processes waiting for I/O”), and the disk is also idle some of the time (%util, on the far right, which is the proportion of the time that there was at least one outstanding I/O operation to the device).

Note that %util of 100% does not mean that the device cannot take any more requests, just that there was at least one request pending for the device at all times. On the other hand, a device with %util less than 100% (as we have here) is definitely under-utilized.

I’m not entirely sure what the discrepancy between the reported %util for sda5_crypt and sda (the dm-crypt volume and raw disk) is due to, but I wasn’t able to get the latter above 50%. It looks to me like perhaps dm-crypt is unwilling to forward more than one request at a time to the raw device: avgqu-sz (the average request queue depth, including active requests) for sda never got to 0.9, no matter how much load I added⁶.

Now that we have evidence that both the CPU and disk really are underutilized, how do we improve things? Well, the easiest way is to run over more than one file at a time:

$ time <filelist xargs -n 100 -P 2 ./pangram > pangram.out

real    1m37.855s
user    0m51.309s
sys     0m5.786s

Well, that’s much better already. The options to xargs tell it to run pangram with at most 100 files, and execute two copies in parallel. (It’s important to limit the number of files per invocation, otherwise xargs will just pass everything to a single copy.)

At this point, we’re keeping the disk (sda5_crypt) a lot busier: %util is up from 86% to nearly 99%, and rkB/s (the read throughput) is up from 69MB/s to 122MB/s. In fact, we can continue to increase the number of concurrent processes to reach a peak of about 149MB/s:

`rkB/s` for different values of xargs -P

You’ll see that it tops out somewhere around P=8. I’m not sure how to explain the drop in read throughput from around P=32: it seems to correspond to the point at which %idle drops to zero (i.e. there’s a task waiting on I/O at all times), but I don’t see why that would necessarily cause things to run more slowly (and all the metrics apart from r/s and rkB/s are linear in the amount of load).

Whatever the reason, we’re done here: we’ve reduced the runtime of this task from three minutes to about 1m20s, a little over half the time it took originally.

Summary

While this was something of an artificial example (we really didn’t need to run this more than once, after all), and while some of the above is surely specific to my setup, I hope it was an interesting exercise.

If I had to tl;dr the above, it would be:

When running performance tests, have a metric to optimise, and make sure you can produce it reliably. Try to remove any sources of variation that you can’t avoid.
Focus first on the algorithm you’re using and the number of expensive operations you’re doing: moving from O(N²) to O(N log N) or removing an RPC or disk read is almost certainly better that any amount of micro-optimisation.
Consider whether you can change the problem to make it easier to do things more efficiently.
If things are still slow, dig into the details until you understand what’s going on.

In many cases, it isn’t necessary to spend the time on performance tuning, but when it is, the above is probably a good roadmap. Typically, you’d only bother when you’re doing something like handling a user request, where the latency is a target by itself, or when you’re running something repeatedly, perhaps because it’s a core library function, or perhaps because you’re processing a lot of data.

Briefly: we track the byte- and letter-offset in the file at which we most-recently saw each letter, and also the minimum letter-offset over all letters (which will change infrequently, thanks to the non-uniform letter distribution); when that minimum letter-offset changes, we have a new pangram, and immediately know the number of letters it contains. This is O(N) in the number of letters in the file (for the search; we still have to backtrack to find the text to output), and does have one other significant advantage: it avoids any artificial limit on the number of non-letter bytes (whitespace, etc) that can appear within a sequence. ↩
I haven’t actually tested any of these, by the way, so they may not actually help, but they all sound reasonable. ↩
These two cases can be distinguished via /proc/vmstat: read-ahead reads are counted as page faults (pgfault), while synchronous (blocking) reads are counted as major page faults (pgmajfault). ↩
The filesize distribution of Gutenberg texts is actually pretty close to a log-normal distribution centred around 2¹⁸. ↩
Oddly, Linux doesn’t appear to have a good solution for iostat-like accounting for cgroups (though pidstat comes close) — or if it does, I couldn’t find it. ↩
Perhaps this is a red herring, but I would have expected that passing those requests down to the raw disk device could only help (and I note that Intel use a queue depth of 32 when measuring SSD performance). As it stands, the maximum read throughput I can get from the raw device (via dm-crypt) is about 150MB/s, a little under half of the real-world sequential read performance I see in reviews. ↩