Pangrams in C
In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used.
I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”).
I can imagine asking that kind of question in an interview1. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself.
Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not.
Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters.
The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up.
What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1.
In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.)
tl;dr, where’s the code?
Download pangram.c
. Compile and run using something
like:
$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ ./pangram file1 file2 file3
The compile flags just define the maximum window size, MAX_PANGRAM
(to 200
bytes, the figure I chose in the end), and enable optimisations (which I was
surprised to see make a noticeable difference to the runtime).
The implementation maps to the algorithm I described above:
main()
simply usesmmap()
to reads the contents of each file in turn into memory, then invokespangram()
.pangram()
walks through the file byte-by-byte, callingseen_all()
to update the letters we’ve seen inseen
.- When
seen_all()
returns true, we calltry_scan_backwards()
to check whether we have a pangram in the lastMAX_PANGRAM
bytes, and also to updateseen
with the new set of letters that are actually within that window (as described above). - If we find a pangram,
output_pangram()
prints the file and contents tostdout
.
I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad.
Loading the whole file into memory in one go isn’t particularly great (we
only really need a sliding window of MAX_PANGRAM
bytes, so we’re wasting a
lot of memory), but it makes the code much simpler, and memory pressure
isn’t something I need to worry about here. The largest file I’m dealing
with is 43MB (Webster’s Unabridged Dictionary, pgwht04.txt
), and my laptop
has 16GB of RAM, so there’s no reason to try anything cleverer: mmap()
is
simple, and it works.
How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me…
$ find -name '*.txt' | wc -l
32473
… just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1).
From the directory containing the unzipped texts, we can run a search by using something simple like this:
$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out
real 2m56.795s
user 0m57.155s
sys 0m6.667s
I’ve used -n 100
so that xargs
will run our program with 100 files at a
time, rather than all 32,000. That’ll be important later, though initially
I was a little worried about command-line length limits, probably
unnecessarily2.
The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences:
$ sort -n pangram.out | head -n3
26 ./10742.txt: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
26 ./10742.txt: B C D E F G H I J K L M N O P Q R S T U V W X Y Z & a
26 ./10742.txt: C D E F G H I J K L M N O P Q R S T U V W X Y Z & a b
Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough.
The only thing I’m not entirely sure about here is the safety of combining
the results from stdout
if I run more than one copy of pangram
at a time
(spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears
to work in practice. Mostly.
We printf()
to stdout
, which I’d thought was line-buffered. However,
without an explicit fflush(stdout)
after the printf()
(which output
always finishes with a newline anyway), a small fraction of the output is
lost when I concurrently append to a single output file: I’m missing some
lines (a few hundred in 42,000 or so), and I get the ends of a few others.
With fflush(stdout)
, I seem to get the right results again, unless I
spawn a large number of concurrent processes (say, 300), so I’m guessing
there’s a race somewhere that I’m occasionally losing. The reason that I’m
a little confused is that I expected this to either work fine, because
writes of less than PIPE_BUF
bytes (512 by POSIX; in practice, at least
4KB) are atomic — or if that didn’t apply in this situation, I’d expected
it to interleave the results completely.
Make it go faster?
Three minutes is a bit long to wait; can we make it run faster?
Yes, we can. But that’s a post for another day.
-
Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) ↩
-
Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see
xargs --show-limits
. ↩