farblog

by Malcolm Rowe

Noda Time 1.3.0

Noda Time 1.3.0 came out today1, bringing a healthy mix of new features and bug fixes for all your date and time handling needs. Unlike with previous releases, the improvements in Noda Time 1.3 don’t really have a single theme: they add a handful of features and tidy up some loose ends on the road to 2.0 (on which more below).

So in no particular order…

Again, see the User Guide and 1.3.0 release notes for more information about all of the above.

You can get Noda Time 1.3.0 from the NuGet repository as usual (core, testing, JSON support packages), or from the links on the Noda Time home page.

Onward to 2.0

Meanwhile, development has started on Noda Time 2.0. Noda Time 2.0 will not be binary-compatible with Noda Time 1.x, but it will be mostly source-compatible: we don’t plan to make completely gratuitous changes.

Among other things, Noda Time 2.0 is likely to contain:

We don’t expect to have a release of Noda Time 2.0 until next year, so we may well make some additional releases in the 1.3.x series between now and then, but in general we’ll be focussing on 2.0. If you’re interested in helping out, come and talk to us on the mailing list.


  1. And once again, I’m going to plagiarise this post for the official Noda Time blog post. 

Pangrams in C

In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used.

I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”).

I can imagine asking that kind of question in an interview1. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself.


Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not.

Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters.

The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up.

What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1.

In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.)

tl;dr, where’s the code?

Download pangram.c. Compile and run using something like:

$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ ./pangram file1 file2 file3

The compile flags just define the maximum window size, MAX_PANGRAM (to 200 bytes, the figure I chose in the end), and enable optimisations (which I was surprised to see make a noticeable difference to the runtime).

The implementation maps to the algorithm I described above:

I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad.

Loading the whole file into memory in one go isn’t particularly great (we only really need a sliding window of MAX_PANGRAM bytes, so we’re wasting a lot of memory), but it makes the code much simpler, and memory pressure isn’t something I need to worry about here. The largest file I’m dealing with is 43MB (Webster’s Unabridged Dictionary, pgwht04.txt), and my laptop has 16GB of RAM, so there’s no reason to try anything cleverer: mmap() is simple, and it works.

How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me…

$ find -name '*.txt' | wc -l
32473

… just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1).

From the directory containing the unzipped texts, we can run a search by using something simple like this:

$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out

real    2m56.795s
user    0m57.155s
sys     0m6.667s

I’ve used -n 100 so that xargs will run our program with 100 files at a time, rather than all 32,000. That’ll be important later, though initially I was a little worried about command-line length limits, probably unnecessarily2.

The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences:

$ sort -n pangram.out | head -n3
26 ./10742.txt: A B C D E F G H I  J K L M N O P Q R  S T U V W X Y Z
26 ./10742.txt: B C D E F G H I  J K L M N O P Q R  S T U V W X Y Z &    a
26 ./10742.txt: C D E F G H I  J K L M N O P Q R  S T U V W X Y Z &    a b

Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough.

The only thing I’m not entirely sure about here is the safety of combining the results from stdout if I run more than one copy of pangram at a time (spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears to work in practice. Mostly.

We printf() to stdout, which I’d thought was line-buffered. However, without an explicit fflush(stdout) after the printf() (which output always finishes with a newline anyway), a small fraction of the output is lost when I concurrently append to a single output file: I’m missing some lines (a few hundred in 42,000 or so), and I get the ends of a few others.

With fflush(stdout), I seem to get the right results again, unless I spawn a large number of concurrent processes (say, 300), so I’m guessing there’s a race somewhere that I’m occasionally losing. The reason that I’m a little confused is that I expected this to either work fine, because writes of less than PIPE_BUF bytes (512 by POSIX; in practice, at least 4KB) are atomic — or if that didn’t apply in this situation, I’d expected it to interleave the results completely.

Make it go faster?

Three minutes is a bit long to wait; can we make it run faster?

Yes, we can. But that’s a post for another day.


  1. Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) 

  2. Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see xargs --show-limits

Pangrammatic windows

Over on Language Log, there’s a post about pangrammatic windows, and a bot that searches Twitter posts for them. Pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur within otherwise naturally-occurring text.

For example, the shortest known natural sequence is 42 letters, from Piers Anthony’s Cube Route, discovered in an article in Word Ways:

“We are all from Xanth,” Cube said quickly. “Just visiting Phaze. We just want to find the dragon.”
Piers Anthony’s Cube Route (pangrammatic window highlighted)

I thought it might be interesting to work out how you’d go about searching a given text for pangrammatic windows. A short chat at work and some quick hacking later, and I had a simple proof-of-concept, but no data to run against.

That was easily solved by downloading the Project Gutenberg April 2010 DVD image1 and unzipping everything within. That gave me 11.6GB of text files, ranging in size from 336 bytes (one of the chapters of Moby Dick) to a single 43MB file comprising Webster’s Unabridged Dictionary.

I’ll post about the technical side separately, but suffice to say that this search doesn’t exactly tax a modern PC: my laptop has enough RAM to load all of the Gutenberg text into memory, and even from cold, it takes only 80 seconds to search through it all.

So what did I find? Well, firstly, several thousand occurrences of “the alphabet”. In retrospect, that probably should have been obvious.

I did find another 42-letter sequence, but I don’t think it can really count, as it occurs during a discussion of pangrams itself: De Morgan (the mathematician), while snarking about numerology, writes about trying to construct a meaningful sentence using all the letters save ‘v’ and ‘j’ exactly once:

There is a kind of Cabbala Alphabetica which the investigators of the numerals in words would do well to take up: it is the formation of sentences which contain all the letters of the alphabet, and each only once. No one has done it with v and j treated as consonants; but you and I can do it. Dr. Whewell and I amused ourselves, some years ago, with attempts. He could not make sense, though he joined words: he gave me Phiz, styx, wrong, buck, flame, quid.
Augustus De Morgan, A Budget of Paradoxes

The shortest sequence that seems to fit within the rules is the following 53-letter sequence, from The Life of Charles Dickens:

[…] there was a second reading to which the presence and enjoyment of Fonblanque gave new zest; and when I expressed to Dickens […]
John Forster, The Life of Charles Dickens

However, this, and a similar 56-letter sequence (“Köckeritz! Where is the king?”) in Napoleon and the Queen of Prussia both still seem somewhat unnatural to me, since they depend upon proper names to work (and to be fair, the same is true of the Piers Anthony quote as well).

Given that, I think the contender for the shortest truly “natural” pangrammatic window in the Gutenberg corpus is the following 57-letter sequence, from Andre Norton’s YA-esque civil war adventure, Ride Proud, Rebel!:

They had turned off the road, which was now filled with men, horses, men, artillery, and men, all slogging purposefully forward. They composed an army roused out before daylight, on the move toward another army holed in behind a breastworks and waiting. And over all, the exhausting blanket of mid-July heat which pressed to squeeze all the vital juices out of both man and animal.
Andre Alice Norton, Ride Proud, Rebel!, 1961

Funnily enough, one thing that I did expect to find, but didn’t, were any common examples of pangrams — in fact, the word “pangram” does not appear (with that meaning) in the Gutenberg corpus at all! The closest I got were the two near-misses: “the quick, brown fox jumped over the lazy dog” and “the swift brown fox jumps over the lazy dog”, the former of which is, I think, a misquote (the latter isn’t, as it’s called out in the text as an almost-pangram).

That’s it for this post. I also have a separate post that goes into a little detail about the code itself.


  1. Hey, 14-year-old me? Remember when you spent over an hour on the phone to download 150KB of BBS software on a 300 baud connection? I just took about the same time to download 8.4GB, and I have enough space to store an uncompressed copy too. The future rocks! But while we’re here: could you buy some Apple stock during 2002? Thanks! 

Small-scale Compute Engine

Google Compute Engine is Google’s “run a virtual machine on Google infrastructure” product. It’s broadly similar to Amazon’s EC2, in that you get an unmanaged (Linux) virtual machine that you can run pretty much anything on, one difference being that it seems to be aimed at larger workloads: 16-core machines with hundreds of GB of RAM, 5TB disks, that kind of thing.

While I’d been meaning to look at it for a while, I didn’t think I had any reason to use it; I certainly don’t have any workloads of the scale people seem to be talking about. A short while ago, a friend at work mentioned he was using it to run a private Minecraft server, which seemed pretty small to me, so I thought perhaps I’d take another look.

It turns out that Compute Engine is just as suited to small-scale workloads as large ones, and while you do have to pay to use it, it works out to be pretty inexpensive. Having spent a little time with it now, I figured it was time to document what I found out.

Boring disclaimer time first, though: I don’t work on Compute Engine, so this isn’t anything official, just some guy on the internet. Also, in the interests of full disclosure: I’m getting an employee discount on the cost of using Compute Engine (though it’s cheap enough that I’d be happy paying full price anyway). With that in mind…

Stalkers and readers with good memories will recall that I started proxying this site via Google’s PageSpeed Service a little over two years ago. PageSpeed Service is a reverse proxy running in Google’s data centres that applies various performance rewrites to the original content (minifying CSS, and so on), and it does a pretty good job overall. As an additional benefit, it’s a (short TTL) caching proxy, so nobody needs depend directly on the copy of Apache running at the end of a DSL pipe on my server at home.

However, I’ve always been slightly bothered by the fact that that dependency still exists. There’s the usual “home network isn’t very reliable” problem1, but rather more importantly, that server’s on my home network, and given the choice, I’d rather not have it running a public copy of Apache as well as everything else.

Anyway, it turns out that I’m going to need to reinstall that server in a bit anyway, so I figured that it might be a good time to see whether Compute Engine was a good fit to run a simple low-traffic Apache server like the one that serves this site (spoiler: yes).

I was hoping that I’d have something clever to say about what I needed to do to set it up, but in truth the Compute Engine quickstart is almost embarrassingly easy, and ends up with a running copy of Apache, not far from where I needed to be.

One thing I did decide to do while experimenting was to script the whole install, so that a single script creates the virtual machine (the “instance”), installs everything I need, and sets up Apache to serve this site. Partly2 this was to make sure I recorded what I’d done, and partly so that I could experiment and reset to a clean state when I messed things up.

That may have been a bit excessive for a simple installation, but it does mean that I now have good documentation that I can go into some detail about.

First, create your instance

With Compute Engine, the first thing you need to do (assuming you’ve completed the setup in the quickstart) is to create an instance, which is what Compute Engine calls a persistent virtual machine. My script ended up using something like the following, which creates an instance together with a new persistent disk to boot from:

$ gcutil --project=farblog addinstance www \
         --machine_type=f1-micro \
         --zone=us-central1-b \
         --image=debian-7 \
         --metadata_from_file=startup-script:startup.sh \
         --authorized_ssh_keys=myuser:myuser.pub \
         --external_ip_address=8.35.193.150 \
         --wait_until_running

gcutil is the command-line tool from the Google Cloud SDK that allows you to configure and control everything related to Compute Engine (other tools in the SDK cover App Engine, Cloud Storage, and so on, but I didn’t need to use any of those).

Taking it from the top, --project specifies the Google Developers Console project ID (these projects are just a way to group different APIs for billing and so on; in this case, I’m only using Compute Engine). You can also ask gcutil to remember the project (or any other flag value) so that you don’t need to keep repeating it.

addinstance is the command to add a new instance, and www is my (unimaginative) instance name. Everything after this point is optional: the tool will prompt for the zone, machine type, and image, and use sensible defaults for everything else.

The machine type comes next: f1-micro is the smallest machine type available, with about 600MB RAM and a CPU reservation suitable to occasional “bursty” (rather than continuous) workloads. That probably wouldn’t work for a server under load, but it seems to be absolutely fine for one like mine, with a request rate measured in seconds between requests, rather than the other way around.

Next is the zone (us-central1-b), where the machine I’m using will be physically located. This currently boils down to the choice of few different locations in the US and Europe (at the time of writing, four different zones across two regions named us-central1 and europe-west1). As with Amazon, the European regions are slightly more expensive (by about 10%) than the US ones, so I’m using a zone in a US region.

While Compute Engine was in limited preview, the choice of zone within a region was a bit more important, as different zones had different maintenance schedules, and a maintenance event would shut down the whole zone for about two weeks, requiring you to bring up another instance somewhere else. However, the US zones no longer have downtime for scheduled maintenance: when some part of the zone needs to be taken offline, the instances affected will be migrated to other physical machines transparently (i.e. without a reboot).

This is pretty awesome, and really makes it possible to run a set-and-forget service like a web server without any complexity (or cost) involved in, for example, setting up load balancing across multiple instances.

After the zone, I’ve specified the disk image that will be used to initialise a new persistent root disk (which will be named the same as the instance). Alternatively, rather than creating a new disk from an image, I could have told the instance to mount an existing disk as the root disk (in either read/write or read-only mode, though a given disk can only be mounted read/write by one instance at any time).

The image really is just a raw disk image, and it appears can contain pretty much anything that can run as an x86-64 KVM guest, though all the documentation and tools currently assume you’ll be running some Linux distribution, so you may find it it a little challenging to run something else (though plenty of people seem to be).

For convenience, Google provides links to images with recent versions of Debian and CentOS (with RHEL and SUSE available as “premium” options), and above I’m using the latest stable version of Debian Wheezy (debian-7, which is actually a partial match for something like projects/debian-cloud/global/images/debian-7-wheezy-v20131120).

Continuing with the options, --metadata_from_file and --authorized_ssh_keys are two ways to specify key/value metadata associated with the instance. In this case, the first option sets the metadata value with the key startup-script to the contents of the file startup.sh, while the second sets the metadata value with the key sshKeys to a list of users and public keys that can be used to log into the instance (here, myuser is the username, and myuser.pub is the SSH public key file).

Both of these are specific to the instance (though it’s also possible to inherit metadata set at the project level), and can be queried — along with a host of other default metadata values — from the instance using a simple HTTP request that returns either a text string or JSON payload.

I’m not going to go into metadata in any detail other than the above built-ins, but it looks to be pretty powerful if you need any kind of per-instance customisation.

The startup-script metadata value is used to store the contents of a script run (as root) after your instance boots. In my case, I’m just using this to set the hostname of my instance, which was otherwise unset3, which in turn makes a bunch of tools throw warnings. I found the easiest way to fix this was to specify a startup script containing just hostname www.

The sshKeys metadata value is used to store a list of users and SSH public keys. This is read by a daemon (installed as /usr/share/google/google_daemon/manage_accounts.py) that ensures that each listed user continues to exist (with a home directory, .ssh/authorized_keys containing the specified key, etc), and also ensures that each listed user is a sudoer (present in /etc/sudoers)4.

Note that you don’t need to specify any of this at all. By default, Compute Engine creates a user account on your instance with a name set to your local login name, and creates a new ssh keypair that it drops into ~/.ssh/google_compute_engine{,.pub} on the machine you created the instance from. You can then simply use gcutil ssh instance-name to ssh into the instance.

This is helpful when you’re getting started, but it does mean that if you want to ssh from anywhere else, you either need to copy those keys around, or do something like the above to tell Compute Engine to accept a given public key. Since I wanted to be able to ssh programmatically from machines that didn’t necessarily have gcutil installed, I found it simpler to just create an ssh keypair manually, specify it as above, and use standard ssh to ssh to the instance.

--external_ip_address allows you to choose a “static” external IP address (from one you’ve previously reserved). Otherwise, the instance is assigned5 an ephemeral external IP address chosen when the instance is created. This is is reclaimed if you delete the instance, so you probably don’t want to rely on ephemeral IP addresses as the target of e.g. a DNS record.

However, you can promote an ephemeral address that’s assigned to an instance so that it becomes a static address, and there’s no charge for (in-use) static addresses, so there’s no problem if you start using an ephemeral address and later want to keep it. (Strictly speaking, external IP addresses are actually optional, as all of your instances on the same “network” can talk to each other using internal addresses, but this isn’t something simple installations are likely to use, I wouldn’t have thought.)

Compute Engine doesn’t currently have support for IPv6, oddly, though there’s a message right at the top of the networking documentation saying that IPv6 is an “important future direction”, so hopefully that’s just temporary. (EC2, for what it’s worth, doesn’t support IPv6 on their instances either, though their load balancers do, so you can use a load balancer as a [costly] way to get IPv6 accessibility.)

Finally (phew!), --wait_until_running won’t return until the instance has actually started booting (typically about 25 seconds; you can add a brand new instance and be ssh’d into a shell in less than a minute.) Note that the machine won’t have any user accounts until the initial boot has finished, so if you’re scripting this you’ll need to spin a bit until ssh starts working.

Configure your instance

I did need to spend a fair amount of time working out how to configure the instance once it existed, but that was mostly because I’m not too familiar with Debian.

There isn’t a great deal to say about this part (and obviously it’ll depend upon what you’re doing), but in my case I simply ran sudo apt-get install to get the packages I needed (apache2 and mercurial, and a few others like less and vim), downloaded and installed mod_pagespeed, the Apache module that does the same thing as PageSpeed Service, built my site, and set up Apache to serve it.

There are still two things I’m not quite happy with:

Cost

I’d estimate the monthly charge for my single instance to be around $15 + VAT (before the discount I’m getting). If I have the numbers right, that’s about what I’m paying for electricity to run my (not very efficient) server at home at present.

That price is dominated by the cost of the machine, which from the pricing documentation is currently $0.019/hour; the disk and network cost (for me) is going to end up at significantly less than a dollar a month.

I mention VAT above because something that’s not currently clear from the pricing documentation is that individuals in the EU pay VAT on top of the quoted prices (likewise true for EC2). Businesses are liable for VAT too, but they’re responsible for working out what to pay themselves, and do so separately.

One other aspect of the tax situation is a bit surprising (for individuals again, not businesses): VAT for Compute Engine is charged at the Irish VAT rate (23%), because when you’re in the EU, you’re paying Google Ireland. (This is in contrast to Amazon, who charge the UK rate even though you’re doing business with Amazon Inc. - tax is complicated.) Admittedly, the difference on the bill above is less than 30p/month, but it still took a little bit of time to figure out what was going on.

tl;dr

Despite all the talk of “big data” and large-scale data processing, is Compute Engine a viable option for running small-scale jobs like a simple static web server? Absolutely.

And while it’s easy to get started, it also looks like it scales naturally: I haven’t looked at load balancing (or protocol forwarding) in any detail, but everything else I’ve read about seems quite powerful and easy to start using incrementally.

From the management side, I’m impressed by the focus on scriptability: gcutil itself is fine as far as it goes, but the underlying Compute Engine API is documented in terms of REST and JSON, and the Developers Console goes out of its way to provide links to show you the REST results for what it’s doing (as it just uses the REST API under the hood). There are also a ton of client libraries available (gcutil is written against the Python API, for example), and support from third-party management tools like Scalr.

I still don’t think that I personally have any reason to use Compute Engine for large-scale processing, but I’m quite happy using it to serve this content to you.


  1. Funny story: I wasn’t sure whether to mention reliability, since it’s actually it’s been pretty good. Then a few hours after writing the first draft of this post, my router fell over, and it was half a day before I could get access to restart it. So there’s that. 

  2. There was another reason I originally wanted to script the installation: the preview version of Compute Engine that I started out using supported non-persistent (“scratch”) machine-local disks that were zero-cost. Initially I was considering whether I could get the machine configured in a way that it could boot from a clean disk image and set itself up from scratch on startup. It turned out to be a little more complicated than made sense, so I switched to persistent disks, but kept the script (and then the 1.0 release of Compute Engine came along and did away with scratch disks anyway). 

  3. It turns out this was caused by a bug in the way my Developers Console project was set up, many years ago; it doesn’t happen in the general case, and it’ll be fixed if I recreate my instance. 

  4. This is actually a bit of a pain. I’m using this to create a service user that is used both for the initial install and website content updates, but I could probably do with separating out the two roles and creating the non-privileged user manually. 

  5. Not actually assigned, but kinda: the instance itself only ever has an RFC 1918 address assigned to eth0 (by default, drawn from 10.240.0.0/16, though you can customise even that). Instead, it’s the “network” — which implements NAT6 between the outside world and the instance — that holds the external-IP-to-internal-IP mapping. The networking documentation covers this in extensive detail. 

  6. I think even the NAT aspect is optional: Protocol forwarding (just announced today, as I write this) appears to allow you to attach multiple external IP addresses directly to a single instance, presumably as additional addresses on eth0