In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used.
I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”).
I can imagine asking that kind of question in an interview1. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself.
Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not.
Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters.
The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up.
What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1.
In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.)
tl;dr, where’s the code?
pangram.c. Compile and run using something
$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram $ ./pangram file1 file2 file3
The compile flags just define the maximum window size,
MAX_PANGRAM (to 200
bytes, the figure I chose in the end), and enable optimisations (which I was
surprised to see make a noticeable difference to the runtime).
The implementation maps to the algorithm I described above:
mmap()to reads the contents of each file in turn into memory, then invokes
pangram()walks through the file byte-by-byte, calling
seen_all()to update the letters we’ve seen in
seen_all()returns true, we call
try_scan_backwards()to check whether we have a pangram in the last
MAX_PANGRAMbytes, and also to update
seenwith the new set of letters that are actually within that window (as described above).
- If we find a pangram,
output_pangram()prints the file and contents to
I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad.
Loading the whole file into memory in one go isn’t particularly great (we
only really need a sliding window of
MAX_PANGRAM bytes, so we’re wasting a
lot of memory), but it makes the code much simpler, and memory pressure
isn’t something I need to worry about here. The largest file I’m dealing
with is 43MB (Webster’s Unabridged Dictionary,
pgwht04.txt), and my laptop
has 16GB of RAM, so there’s no reason to try anything cleverer:
simple, and it works.
How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me…
$ find -name '*.txt' | wc -l 32473
… just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1).
From the directory containing the unzipped texts, we can run a search by using something simple like this:
$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out real 2m56.795s user 0m57.155s sys 0m6.667s
-n 100 so that
xargs will run our program with 100 files at a
time, rather than all 32,000. That’ll be important later, though initially
I was a little worried about command-line length limits, probably
The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences:
$ sort -n pangram.out | head -n3 26 ./10742.txt: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 26 ./10742.txt: B C D E F G H I J K L M N O P Q R S T U V W X Y Z & a 26 ./10742.txt: C D E F G H I J K L M N O P Q R S T U V W X Y Z & a b
Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough.
The only thing I’m not entirely sure about here is the safety of combining
the results from
stdout if I run more than one copy of
pangram at a time
(spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears
to work in practice. Mostly.
stdout, which I’d thought was line-buffered. However,
without an explicit
fflush(stdout) after the
printf() (which output
always finishes with a newline anyway), a small fraction of the output is
lost when I concurrently append to a single output file: I’m missing some
lines (a few hundred in 42,000 or so), and I get the ends of a few others.
fflush(stdout), I seem to get the right results again, unless I
spawn a large number of concurrent processes (say, 300), so I’m guessing
there’s a race somewhere that I’m occasionally losing. The reason that I’m
a little confused is that I expected this to either work fine, because
writes of less than
PIPE_BUF bytes (512 by POSIX; in practice, at least
4KB) are atomic — or if that didn’t apply in this situation, I’d expected
it to interleave the results completely.
Make it go faster?
Three minutes is a bit long to wait; can we make it run faster?
Yes, we can. But that’s a post for another day.
Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) ↩
Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see
xargs --show-limits. ↩