This is the third in a series of posts about searching for pangrammatic windows in a corpus of documents. I’ve previously talked about what I found in the Gutenberg corpus and the code I used, and ended the last post with a question: can we make it faster?
Well, that’s what this post is about.
Where were we?
$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram $ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out real 2m56.795s user 0m57.155s sys 0m6.667s
In other words, it currently takes about three minutes. We’re going to try to reduce that. Some facts and figures:
- My laptop has a Core i5 CPU, reporting a total of four processors (one physical CPU with two physical cores and two HT cores).
- My disk is an SSD. I’m using whole-disk encryption via dm-crypt, and then LVM2 over the top of that, so I have a bunch of different devices that all relate to my root partition.
- I’m running Linux 3.13.11 (and, for what it’s worth, using gcc 4.6.3).
- We’re searching a little over 32,000 files totalling 11.6GB of text.
Before we do anything, we really need to make sure we can get repeatable measurements. The first few experiments I tried ended up with nonsensical results — it turns out that “all the Gutenberg text” is less than the total memory on my laptop (16GB), so all the runs after the first just read from the filesystem cache.
Fixing that is easy: we ask the kernel to drop the cache before we run a test:
# echo 3 > /proc/sys/vm/drop_caches
or, since we’re probably not running as root,
$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
/proc/sys/vm/drop_caches is documented in
Documentation/sysctl/vm.txt; the command above will drop both the
file content cache and the inode cache (containing directory contents).
One other source of variation I ran into was caused by how long
to run (and note that it runs concurrently with
xargs): most of the time
it completed quickly, but in a few situations it was (inconsistently)
delayed by what else was going on, causing the whole search to take much
longer than usual. This was also easy to avoid: we capture the list of
files in advance:
$ find -name '*.txt' > filelist $ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null $ time <filelist xargs -n 100 ./pangram > pangram.out real 2m58.310s user 0m55.787s sys 0m6.630s
Change the algorithm
So where to start? As I see it, there are at least three things we can try:
- Change the algorithm to reduce the work we need to do.
- Change the implementation to do that work more efficiently.
- Make better use of the resources we have.
Let’s take a look at the algorithm first.
I’m not going to repeat the whole thing here (see the previous post for that), but in summary: we read through each file until we’ve seen enough letters that we might have found a pangram, then scan backwards until we find one, or until we hit a limit (I used 200 bytes); we then resume scanning forwards from where we left off.
Can we improve on this? Perhaps. We clearly need to visit all the letters at least once, but — as suggested to me by Jesse Sheidlower, who wrote the PangramTweets Twitter bot that kicked this all off — we can avoid backtracking during the initial search if we keep some additional state1.
Does this help? Unfortunately, not really: it takes exactly the same amount of time as the backtracking version.
Perhaps that’s not too surprising, though. It’s fairly clear that the problem should be I/O-bound, and so — unless the backtracking causes additional I/O (which appears not to be the case) — we should see if we can perhaps spend less time waiting on the I/O we have.
Change the implementation
As it stands, the total cost of our I/O is pretty much unavoidable: we need to read each file completely into RAM.
We could reduce the overall I/O by changing what we read2. For example, we could:
- Read the original zipfiles directly, trading I/O for CPU.
- Combine the separate text files into a small number of large files, removing many small reads (and making better use of read-ahead, which can’t help with the first read from a file).
- Perform the reads upfront by copying the files to a tmpfs.
However, these fundamentally change the problem we’re trying to solve, not just the way we’re solving it, and so I’m going to stick with what I have for now.
So far, we’re using
mmap() to read the file. This gives us a memory range
into which the kernel will read the file’s contents as-needed, using some
amount of asynchronous read-ahead. If we try to read a page that hasn’t
been read from disk yet, we’ll block3.
At least in my case, using
mmap() to map the file on-demand isn’t much
different to using
read() to read the whole file in one go, which is
itself a bit of a surprise: the former’s asynchronous read-ahead should mean
that we can get started more quickly. I ran across an email about mmap()
on the linux-kernel mailing list where Linus explains that
mmap() is actually quite expensive to start with. In any case, most of our
files are in the range 256–512KB4, and so perhaps there’s just
not a lot of read-ahead to do.
One thing we could try is reducing the time we spend waiting for I/O by
providing hints to the kernel about our usage of an area of memory or a
file. For example, to hint that we’re about to read a buffer sequentially,
we can write
madvise(buf, len, MADV_SEQUENTIAL).
In theory, this should allow us to optimise the file I/O based on our usage. In practice (at least in my case), it turns out that these are actually pessimisations.
While we have several different ways to hint to the kernel, as far as I can see, they boil down to just two choices: whether or not we need the data immediately, and what the access pattern is for the data in memory.
If we need the data “now” (
posix_fadvise(), etc), then the kernel will issue a synchronous read
there-and-then, returning once the file’s data is in the page cache. This
can be no faster than issuing a blocking
read() for the whole file, and
— I assume due to the overhead of
mmap() — actually ends up a bit
slower in practice.
Otherwise, the access pattern is one of “normal”, “random”, or “sequential”. “Normal” is what we get by default; it triggers some amount of read-ahead.
MADV_RANDOM, etc) is straightforward: it switches off
read-ahead, so every new page access causes a single-page blocking read.
This is terrible for performance, as you might expect — in our case, it
roughly doubles the runtime.
MADV_SEQUENTIAL) is less well-defined. It’s not completely
clear to me what it does in practice — the
posix_fadvise() man page says
that it doubles the read-ahead window (on Linux), but the kernel source
implies that it might be a bit more complex that that — but in any case,
it seems to have a small negative effect overall.
madvise() and friends provide just one way to tune I/O scheduling, but
they are the easiest to use. We could also look at overlapped or threaded
I/O, but that’s significantly more complex — and perhaps there’s an easier
way to improve our utilization anyway.
Make better use of the resources we have
In this case, I’m talking about disk and CPU utilization. While some of the disk reads will occur while we’re searching the buffer we’ve already read into memory, most won’t, and so the CPU should often be waiting for an I/O operation to complete.
It’d be nice if we could get a bit more detail about how we’re doing than
simple wallclock time, so I’m going to measure the current CPU and disk
iostat is a whole-machine profiler, so it’s probably a good idea not to
have much else going on at the time (though that said, I didn’t see too much
impact from the copy of Chrome I had running). Alternatively, we could look
into per-process monitoring via iotop or pidstat, or an
event/tracing approach like SystemTap or
iosnoop5. (Incidentally, Brendan Gregg’s Linux
Performance page is a great resource for finding out more about
Linux performance tools.)
iostat — with
-N to print meaningful names for the device mapper
-x to print extended disk statistics — produces something
like the following (rather wide, sorry) output:
$ iostat -N -x Linux 3.13.0-30-generic (malcolmr) 06/09/14 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.28 0.07 0.77 0.11 0.00 97.77 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.17 0.04 11.92 4.62 239.54 25.25 32.03 0.01 0.33 0.29 0.43 0.17 0.28 sda5_crypt 0.00 0.00 11.81 4.66 238.68 25.24 32.05 0.04 2.43 0.91 6.29 0.29 0.48 sysvg-root 0.00 0.00 11.65 4.48 238.02 25.24 32.65 0.04 2.48 0.92 6.54 0.30 0.48 sysvg-swap_1 0.00 0.00 0.12 0.00 0.47 0.00 8.00 0.00 0.15 0.15 0.00 0.15 0.00 $
It’s important to note that if you run
iostat this way, you actually get a
running average since boot, which isn’t very useful at all. What I chose to
do instead was to start the run and then execute
iostat -N -x 30 3, which
outputs three reports separated by 30 seconds. The first is the
average-since-boot, which we can ignore, but the other two are averages over
the 30 seconds since the prior report.
Having two good reports allows us to check how variable the numbers we’re seeing are (in my case, fairly reliable). Here’s the kind of output I got.
First, we start a run:
$ time <filelist xargs -n 100 ./pangram > pangram.out
and then concurrently run
$ iostat -N -x 30 3 [...] avg-cpu: %user %nice %system %iowait %steal %idle 8.19 0.00 19.54 15.99 0.00 56.28 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.20 0.00 655.57 2.37 68680.93 23.33 208.85 0.26 0.39 0.39 0.39 0.35 22.93 sda5_crypt 0.00 0.00 654.43 2.37 68666.40 23.33 209.16 1.19 1.82 1.82 0.73 1.31 86.00 sysvg-root 0.00 0.00 654.43 2.17 68666.40 23.33 209.23 1.19 1.82 1.82 0.80 1.31 86.05 sysvg-swap_1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The bottom line is that the CPU is idle most of the time (
%iowait is “idle
but there are runnable processes waiting for I/O”), and the disk is also
idle some of the time (
%util, on the far right, which is the proportion of
the time that there was at least one outstanding I/O operation to the
%util of 100% does not mean that the device cannot take any
more requests, just that there was at least one request pending for the
device at all times. On the other hand, a device with
%util less than
100% (as we have here) is definitely under-utilized.
I’m not entirely sure what the discrepancy between the reported
sda (the dm-crypt volume and raw disk) is due to, but I
wasn’t able to get the latter above 50%. It looks to me like perhaps
dm-crypt is unwilling to forward more than one request at a time to the raw
avgqu-sz (the average request queue depth, including active
sda never got to 0.9, no matter how much load I
Now that we have evidence that both the CPU and disk really are underutilized, how do we improve things? Well, the easiest way is to run over more than one file at a time:
$ time <filelist xargs -n 100 -P 2 ./pangram > pangram.out real 1m37.855s user 0m51.309s sys 0m5.786s
Well, that’s much better already. The options to
xargs tell it to run
pangram with at most 100 files, and execute two copies in parallel. (It’s
important to limit the number of files per invocation, otherwise
will just pass everything to a single copy.)
At this point, we’re keeping the disk (
sda5_crypt) a lot busier:
is up from 86% to nearly 99%, and
rkB/s (the read throughput) is up from
69MB/s to 122MB/s. In fact, we can continue to increase the number of
concurrent processes to reach a peak of about 149MB/s:
You’ll see that it tops out somewhere around P=8. I’m not sure how to
explain the drop in read throughput from around P=32: it seems to correspond
to the point at which
%idle drops to zero (i.e. there’s a task waiting on
I/O at all times), but I don’t see why that would necessarily cause things
to run more slowly (and all the metrics apart from
linear in the amount of load).
Whatever the reason, we’re done here: we’ve reduced the runtime of this task from three minutes to about 1m20s, a little over half the time it took originally.
While this was something of an artificial example (we really didn’t need to run this more than once, after all), and while some of the above is surely specific to my setup, I hope it was an interesting exercise.
If I had to tl;dr the above, it would be:
- When running performance tests, have a metric to optimise, and make sure you can produce it reliably. Try to remove any sources of variation that you can’t avoid.
- Focus first on the algorithm you’re using and the number of expensive operations you’re doing: moving from O(N2) to O(N log N) or removing an RPC or disk read is almost certainly better that any amount of micro-optimisation.
- Consider whether you can change the problem to make it easier to do things more efficiently.
- If things are still slow, dig into the details until you understand what’s going on.
In many cases, it isn’t necessary to spend the time on performance tuning, but when it is, the above is probably a good roadmap. Typically, you’d only bother when you’re doing something like handling a user request, where the latency is a target by itself, or when you’re running something repeatedly, perhaps because it’s a core library function, or perhaps because you’re processing a lot of data.
Briefly: we track the byte- and letter-offset in the file at which we most-recently saw each letter, and also the minimum letter-offset over all letters (which will change infrequently, thanks to the non-uniform letter distribution); when that minimum letter-offset changes, we have a new pangram, and immediately know the number of letters it contains. This is O(N) in the number of letters in the file (for the search; we still have to backtrack to find the text to output), and does have one other significant advantage: it avoids any artificial limit on the number of non-letter bytes (whitespace, etc) that can appear within a sequence. ↩
I haven’t actually tested any of these, by the way, so they may not actually help, but they all sound reasonable. ↩
These two cases can be distinguished via
/proc/vmstat: read-ahead reads are counted as page faults (
pgfault), while synchronous (blocking) reads are counted as major page faults (
The filesize distribution of Gutenberg texts is actually pretty close to a log-normal distribution centred around 218. ↩
Perhaps this is a red herring, but I would have expected that passing those requests down to the raw disk device could only help (and I note that Intel use a queue depth of 32 when measuring SSD performance). As it stands, the maximum read throughput I can get from the raw device (via dm-crypt) is about 150MB/s, a little under half of the real-world sequential read performance I see in reviews. ↩