Back in June I wrote about a quick hack to search the Project Gutenberg text for pangrammatic windows; I then wrote a bit more about the implementation and used it as an example for performance tuning in Linux.
What I didn’t mention (because I’m only just now getting around to writing it up) is that I also ran the same analysis against the web.
Just to remind you what I’m talking about, pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur as substrings of otherwise naturally-occurring text. For example, the shortest known sequence in a published book is 42 letters, from Piers Anthony’s Cube Route:
Obviously, sequences such as “The quick brown fox jumps over the lazy dog” (35 letters) are shorter, but they aren’t naturally occurring, so don’t count for these purposes.
So, back in June, I decided to use some of my 20% time (and some weekends) to run a search to find some of the shortest pangrammatic windows on the web, using Google’s web index in much the same way as I’d earlier run a local search over the Gutenberg corpus of documents, except more so.
Searching the web
Even though I don’t work on Google’s web search itself, I knew that we had the ability to run analyses over the web at scale: Ian Hickson did something similar back in 2005 to produce a Web Authoring Statistics report on HTML structure on the web. The main difference was that I was hoping to do it for much more trivial reasons.
I was happy to find that doing this kind of analysis was pretty easy. The code in question is neither interesting nor open source, but let’s just note that, for search engines, the problems of ‘Do X for all web documents’ and ‘Extract the text from this web page’ are already fairly comprehensively solved.
I’d already restricted what I was looking at to English-language documents,
and (for hopefully obvious reasons) those documents that weren’t filtered by
SafeSearch, so that left me with only the easy bit to solve: writing a
matcher that would allow me to run a large-scale
grep over the web.
I started by writing something simple using the non-backtracking algorithm that Jesse Sheidlower had suggested1. This simply emitted one result for every unique pangrammatic window shorter than a certain number of letters (I think I started at 45 or so).
To work out whether two different windows were equivalent, I normalised the
window text by removing all non-alphabetic characters (apart from interior
single quotes) and collapsing all runs of whitespace to a single space. In
that way, “Fix Mr. Gluck’s hazy TV, PDQ” and “Fix Mr Gluck’s hazy TV,\n
\n is a newline) would both be normalised to “fix mr gluck’s
hazy tv pdq”, and I would pick just one to report.
That’s when I ran into something of a problem. When I’d run a simple search for short pangrammatic windows over the Gutenberg text, I’d had to skip through a thousand or so occurrences of ‘the alphabet’ and variations before getting to any real-world text. That clearly wasn’t going to scale up to the web.
To clean up these nonsense results, I started with some blacklisting: I’d already discarded entire documents based on a few regular expressions in order to exclude those documents that were specifically talking about pangrams2, so I tried adding another blacklist to remove individual results that contained ‘impossible’ words.
For example, if the normalised window contains the substring “qrs” (ignoring spaces), it can’t possibly be part of an English word: no word contains “qrs”, and none ends “qr” or starts “rs”, so there is no subdivision that would be valid3. This is very successful at removing a large proportion of the results that were variations on ‘the alphabet’.
However, it’s not good enough. I still needed to add “qwerty” and “ytrewq” (“qwerty” reversed) and “azertyu” (French keyboard layout; and note that “azerty” wouldn’t be valid, since some words do end in “azer”) and then “ytreza” and… clearly this isn’t going to scale either.
The internet follows rule 34 for misspellings, it seems: I’m fairly confident that even something as simple as the alphabet has been misspelled in almost every possible way.
I needed a better way to sort the real text from the nonsense text.
I thought about trying to do something clever — like trying to train a classifier to recognise English words — but then I realised that I could do something dumb instead, which is almost always a better approach.
I globbed together a few sources to make a large (100K words or so) dictionary that looked like it contained mostly plausible English words, removed a few words that were valid but problematic (“BC”, “def”, “wert”, etc), and wrote something that would compute a score based on the number of known words in the normalised window.
For example, if the input was “a b c d e … z”, we’d have 26 ‘words’, of which two (“a” and “i”) would be considered known words4, and so we’d give it a score of 2/26, or 0.077.
I knew that I wouldn’t want to set a minimum score of 1.0 (eliminating results that had any unknown words), both because I’d seen from the Gutenberg examples how common proper names were, and also because the nature of reporting a sub-sequence meant that I’d often be selecting a partial word for the first and last words in the window. However, playing around with the threshold showed that it was filtering out nonsense results pretty well, but that I still had a slightly different problem to solve.
While I’d managed to filter out windows with a high proportion of nonsense words, the four-word sequence “in the end. abc…xyz” still manages a reasonable score of 0.75 by that metric, since only the last of the four ‘words’ is an unknown word.
To fix this problem, I put together an alternative score that was computed from the number of letters covered by each known word. That works well for inputs such as the above (2 + 3 + 3 letters in known words, out of 34 letters total, for a score of 8/34, or 0.235).
I couldn’t just replace the known-word score with the by-letter score by itself, though: the letter coverage scorer gives low scores to windows with a long, but truncated, first or last word, and it doesn’t give low scores to windows containing a large number of short nonsense words (things like “wafting zephyrs vex bold p p p p p q jim”).
So I did what any reasonable person would do in that situation: I multiplied both scores together. This sounds ridiculously naïve (and I’m sure that anyone who actually does language processing professionally can tell me why this whole approach is idiotic), but it seems to have worked out reasonably well.
In case you’re interested, the distributions of scores (for windows of an acceptable length) ended up looking like this:
That approach is not without its problems, though: based on some initial trials, I decided that the final run would discard any result with a score lower than 0.55. I later spotted that this would also have discarded both examples quoted in the Wikipedia article, though it does accept all the examples I found in the Gutenberg text:
- Wikipedia: [from Xanth,” Cube said quickly. “Just visiting Phaze. W] (score: 0.492)
- Wikipedia: [view of The Yards: Mark Wahlberg, Joaquin Phoenix, Charliz] (score: 0.530)
- Gutenberg: [joyment of Fonblanque gave new zest; and when I expressed to Dick] (score: 0.566)
- Gutenberg: [xhausting blanket of mid-July heat which pressed to squeeze all the v] (score: 0.698)
- Gutenberg: [joined words: he gave me Phiz, styx, wrong, buck, flame, q] (score: 0.721)
In any case, I also ended up discarding any window with more than 38 letters, which eliminated all of the above anyway.
I read through a lot of results, about four thousand in all. By hand, and mostly in a coach to and from Wales. I may have missed something.
First, some brief observations about the things that weren’t good results (but that still scored highly enough that I had to look at them):
- It became pretty obvious very quickly that anything with a URL related to “keyboard practice”, “homework”, or “font” wasn’t going to be a natural sequence. In fact, font samplers were by far the biggest cause of problematic (non-natural) sequences that I would have liked to ignore.
- Lists of products for sale were the second-biggest problem: the combination of product codes (“ADX XLM1”) with strings like “free shipping” lead to a fairly large number of pangrams.
- Sequences containing Twitter short URLs (t.co) also showed up a fair amount, probably because they’re short enough that they didn’t affect the score too much.
- Surprisingly, results from YouTube watch pages showed up multiple times, apparently caused by the related videos list shown next to each video, which contains entries of the form “TITLE by AUTHOR NNN views”, and where (in these results) the author is commonly named something like “XquerbleX”.
- Honourable mentions go out to the 20 consonant poetry and New Sentences for the Testing of Typewriters pages. I hate you both.
Without further ado, the three best results I found (in reverse order).
In third place, a post on the “CrackBerry Forums”, which falls down slightly for using what turns out to be some very convenient product names, but wins for the story:
Second place goes to another forum post. The pangrammatic window here is from a list of words rather than a portion of a sentence, but I did learn the names of some dance styles:
Both of those were 38 letters, but the clear winner on both length and content is the following 36 letter pangrammatic window, from a review of the film Magnolia:
I’m pretty impressed by this result: it’s only one letter longer than “The quick brown fox…”, and while that’s not the shortest possible pangram by far, it is one of the more coherent ones.
As for me, I think I’m probably done with pangrams now. Although… I’ve been spending so much time lately reading examples of pangrams that I have started to wonder how I could get myself a sphinx. Preferably one made from black quartz…
The reason to pick this algorithm wasn’t performance, but because it has the advantage of being insensitive to the amount of non-alphabetic characters within a window, unlike the fixed-size sliding window used by the algorithm I’d previously used to search the Gutenberg text. ↩
I kept this blacklist in place later on, though I don’t think it actually had much effect. For posterity, the list of (non-case-sensitive) regexps I ended up using was: ‘pangram’, ‘quick.? brown (fox|dogs?) jump’, ‘silk pyjamas exchanged for blue quartz’, ‘DJs flock by when MTV’, ‘five boxing wizards jump’, ‘Pack my box with five dozen’, ‘love my big sphinx of quartz’, and ‘Grumpy wizards make toxic brew’. Only one of which is actually a regexp. ↩
Though it turns out that
/usr/share/dict/wordson my Ubuntu laptop seems to think that “rs” is itself a word. I suspect I just decided that it was wrong. ↩
/usr/share/dict/wordsseems to consider every letter of the alphabet to be a valid word. The dictionary I used didn’t, though you could probably make the case that it should perhaps have included “O” (and maybe even the txtspeak “U”). ↩