Over on Language Log, there’s a post about pangrammatic windows, and a bot that searches Twitter posts for them. Pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur within otherwise naturally-occurring text.
For example, the shortest known natural sequence is 42 letters, from Piers Anthony’s Cube Route, discovered in an article in Word Ways:
I thought it might be interesting to work out how you’d go about searching a given text for pangrammatic windows. A short chat at work and some quick hacking later, and I had a simple proof-of-concept, but no data to run against.
That was easily solved by downloading the Project Gutenberg April 2010 DVD image1 and unzipping everything within. That gave me 11.6GB of text files, ranging in size from 336 bytes (one of the chapters of Moby Dick) to a single 43MB file comprising Webster’s Unabridged Dictionary.
I’ll post about the technical side separately, but suffice to say that this search doesn’t exactly tax a modern PC: my laptop has enough RAM to load all of the Gutenberg text into memory, and even from cold, it takes only 80 seconds to search through it all.
So what did I find? Well, firstly, several thousand occurrences of “the alphabet”. In retrospect, that probably should have been obvious.
I did find another 42-letter sequence, but I don’t think it can really count, as it occurs during a discussion of pangrams itself: De Morgan (the mathematician), while snarking about numerology, writes about trying to construct a meaningful sentence using all the letters save ‘v’ and ‘j’ exactly once:
The shortest sequence that seems to fit within the rules is the following 53-letter sequence, from The Life of Charles Dickens:
However, this, and a similar 56-letter sequence (“Köckeritz! Where is the king?”) in Napoleon and the Queen of Prussia both still seem somewhat unnatural to me, since they depend upon proper names to work (and to be fair, the same is true of the Piers Anthony quote as well).
Given that, I think the contender for the shortest truly “natural” pangrammatic window in the Gutenberg corpus is the following 57-letter sequence, from Andre Norton’s YA-esque civil war adventure, Ride Proud, Rebel!:
Funnily enough, one thing that I did expect to find, but didn’t, were any common examples of pangrams — in fact, the word “pangram” does not appear (with that meaning) in the Gutenberg corpus at all! The closest I got were the two near-misses: “the quick, brown fox jumped over the lazy dog” and “the swift brown fox jumps over the lazy dog”, the former of which is, I think, a misquote (the latter isn’t, as it’s called out in the text as an almost-pangram).
That’s it for this post. I also have a separate post that goes into a little detail about the code itself.
Hey, 14-year-old me? Remember when you spent over an hour on the phone to download 150KB of BBS software on a 300 baud connection? I just took about the same time to download 8.4GB, and I have enough space to store an uncompressed copy too. The future rocks! But while we’re here: could you buy some Apple stock during 2002? Thanks! ↩