by Malcolm Rowe

Pangrams in C

In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used.

I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”).

I can imagine asking that kind of question in an interview1. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself.

Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not.

Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters.

The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up.

What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1.

In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.)

tl;dr, where’s the code?

Download pangram.c. Compile and run using something like:

$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ ./pangram file1 file2 file3

The compile flags just define the maximum window size, MAX_PANGRAM (to 200 bytes, the figure I chose in the end), and enable optimisations (which I was surprised to see make a noticeable difference to the runtime).

The implementation maps to the algorithm I described above:

I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad.

Loading the whole file into memory in one go isn’t particularly great (we only really need a sliding window of MAX_PANGRAM bytes, so we’re wasting a lot of memory), but it makes the code much simpler, and memory pressure isn’t something I need to worry about here. The largest file I’m dealing with is 43MB (Webster’s Unabridged Dictionary, pgwht04.txt), and my laptop has 16GB of RAM, so there’s no reason to try anything cleverer: mmap() is simple, and it works.

How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me…

$ find -name '*.txt' | wc -l

… just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1).

From the directory containing the unzipped texts, we can run a search by using something simple like this:

$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out

real    2m56.795s
user    0m57.155s
sys     0m6.667s

I’ve used -n 100 so that xargs will run our program with 100 files at a time, rather than all 32,000. That’ll be important later, though initially I was a little worried about command-line length limits, probably unnecessarily2.

The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences:

$ sort -n pangram.out | head -n3
26 ./10742.txt: A B C D E F G H I  J K L M N O P Q R  S T U V W X Y Z
26 ./10742.txt: B C D E F G H I  J K L M N O P Q R  S T U V W X Y Z &    a
26 ./10742.txt: C D E F G H I  J K L M N O P Q R  S T U V W X Y Z &    a b

Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough.

The only thing I’m not entirely sure about here is the safety of combining the results from stdout if I run more than one copy of pangram at a time (spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears to work in practice. Mostly.

We printf() to stdout, which I’d thought was line-buffered. However, without an explicit fflush(stdout) after the printf() (which output always finishes with a newline anyway), a small fraction of the output is lost when I concurrently append to a single output file: I’m missing some lines (a few hundred in 42,000 or so), and I get the ends of a few others.

With fflush(stdout), I seem to get the right results again, unless I spawn a large number of concurrent processes (say, 300), so I’m guessing there’s a race somewhere that I’m occasionally losing. The reason that I’m a little confused is that I expected this to either work fine, because writes of less than PIPE_BUF bytes (512 by POSIX; in practice, at least 4KB) are atomic — or if that didn’t apply in this situation, I’d expected it to interleave the results completely.

Make it go faster?

Three minutes is a bit long to wait; can we make it run faster?

Yes, we can. But that’s a post for another day.

  1. Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) 

  2. Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see xargs --show-limits