farblog

by Malcolm Rowe

Small-scale Compute Engine

Google Compute Engine is Google’s “run a virtual machine on Google infrastructure” product. It’s broadly similar to Amazon’s EC2, in that you get an unmanaged (Linux) virtual machine that you can run pretty much anything on, one difference being that it seems to be aimed at larger workloads: 16-core machines with hundreds of GB of RAM, 5TB disks, that kind of thing.

While I’d been meaning to look at it for a while, I didn’t think I had any reason to use it; I certainly don’t have any workloads of the scale people seem to be talking about. A short while ago, a friend at work mentioned he was using it to run a private Minecraft server, which seemed pretty small to me, so I thought perhaps I’d take another look.

It turns out that Compute Engine is just as suited to small-scale workloads as large ones, and while you do have to pay to use it, it works out to be pretty inexpensive. Having spent a little time with it now, I figured it was time to document what I found out.

Boring disclaimer time first, though: I don’t work on Compute Engine, so this isn’t anything official, just some guy on the internet. Also, in the interests of full disclosure: I’m getting an employee discount on the cost of using Compute Engine (though it’s cheap enough that I’d be happy paying full price anyway). With that in mind…

Stalkers and readers with good memories will recall that I started proxying this site via Google’s PageSpeed Service a little over two years ago. PageSpeed Service is a reverse proxy running in Google’s data centres that applies various performance rewrites to the original content (minifying CSS, and so on), and it does a pretty good job overall. As an additional benefit, it’s a (short TTL) caching proxy, so nobody needs depend directly on the copy of Apache running at the end of a DSL pipe on my server at home.

However, I’ve always been slightly bothered by the fact that that dependency still exists. There’s the usual “home network isn’t very reliable” problem1, but rather more importantly, that server’s on my home network, and given the choice, I’d rather not have it running a public copy of Apache as well as everything else.

Anyway, it turns out that I’m going to need to reinstall that server in a bit anyway, so I figured that it might be a good time to see whether Compute Engine was a good fit to run a simple low-traffic Apache server like the one that serves this site (spoiler: yes).

I was hoping that I’d have something clever to say about what I needed to do to set it up, but in truth the Compute Engine quickstart is almost embarrassingly easy, and ends up with a running copy of Apache, not far from where I needed to be.

One thing I did decide to do while experimenting was to script the whole install, so that a single script creates the virtual machine (the “instance”), installs everything I need, and sets up Apache to serve this site. Partly2 this was to make sure I recorded what I’d done, and partly so that I could experiment and reset to a clean state when I messed things up.

That may have been a bit excessive for a simple installation, but it does mean that I now have good documentation that I can go into some detail about.

First, create your instance

With Compute Engine, the first thing you need to do (assuming you’ve completed the setup in the quickstart) is to create an instance, which is what Compute Engine calls a persistent virtual machine. My script ended up using something like the following, which creates an instance together with a new persistent disk to boot from:

$ gcutil --project=farblog addinstance www \
--machine_type=f1-micro \
--zone=us-central1-b \
--image=debian-7 \
--metadata_from_file=startup-script:startup.sh \
--authorized_ssh_keys=myuser:myuser.pub \
--external_ip_address=8.35.193.150 \
--wait_until_running

gcutil is the command-line tool from the Google Cloud SDK that allows you to configure and control everything related to Compute Engine (other tools in the SDK cover App Engine, Cloud Storage, and so on, but I didn’t need to use any of those).

Taking it from the top, --project specifies the Google Developers Console project ID (these projects are just a way to group different APIs for billing and so on; in this case, I’m only using Compute Engine). You can also ask gcutil to remember the project (or any other flag value) so that you don’t need to keep repeating it.

addinstance is the command to add a new instance, and www is my (unimaginative) instance name. Everything after this point is optional: the tool will prompt for the zone, machine type, and image, and use sensible defaults for everything else.

The machine type comes next: f1-micro is the smallest machine type available, with about 600MB RAM and a CPU reservation suitable to occasional “bursty” (rather than continuous) workloads. That probably wouldn’t work for a server under load, but it seems to be absolutely fine for one like mine, with a request rate measured in seconds between requests, rather than the other way around.

Next is the zone (us-central1-b), where the machine I’m using will be physically located. This currently boils down to the choice of few different locations in the US and Europe (at the time of writing, four different zones across two regions named us-central1 and europe-west1). As with Amazon, the European regions are slightly more expensive (by about 10%) than the US ones, so I’m using a zone in a US region.

While Compute Engine was in limited preview, the choice of zone within a region was a bit more important, as different zones had different maintenance schedules, and a maintenance event would shut down the whole zone for about two weeks, requiring you to bring up another instance somewhere else. However, the US zones no longer have downtime for scheduled maintenance: when some part of the zone needs to be taken offline, the instances affected will be migrated to other physical machines transparently (i.e. without a reboot).

This is pretty awesome, and really makes it possible to run a set-and-forget service like a web server without any complexity (or cost) involved in, for example, setting up load balancing across multiple instances.

After the zone, I’ve specified the disk image that will be used to initialise a new persistent root disk (which will be named the same as the instance). Alternatively, rather than creating a new disk from an image, I could have told the instance to mount an existing disk as the root disk (in either read/write or read-only mode, though a given disk can only be mounted read/write by one instance at any time).

The image really is just a raw disk image, and it appears can contain pretty much anything that can run as an x86-64 KVM guest, though all the documentation and tools currently assume you’ll be running some Linux distribution, so you may find it it a little challenging to run something else (though plenty of people seem to be).

For convenience, Google provides links to images with recent versions of Debian and CentOS (with RHEL and SUSE available as “premium” options), and above I’m using the latest stable version of Debian Wheezy (debian-7, which is actually a partial match for something like projects/debian-cloud/global/images/debian-7-wheezy-v20131120).

Continuing with the options, --metadata_from_file and --authorized_ssh_keys are two ways to specify key/value metadata associated with the instance. In this case, the first option sets the metadata value with the key startup-script to the contents of the file startup.sh, while the second sets the metadata value with the key sshKeys to a list of users and public keys that can be used to log into the instance (here, myuser is the username, and myuser.pub is the SSH public key file).

Both of these are specific to the instance (though it’s also possible to inherit metadata set at the project level), and can be queried — along with a host of other default metadata values — from the instance using a simple HTTP request that returns either a text string or JSON payload.

I’m not going to go into metadata in any detail other than the above built-ins, but it looks to be pretty powerful if you need any kind of per-instance customisation.

The startup-script metadata value is used to store the contents of a script run (as root) after your instance boots. In my case, I’m just using this to set the hostname of my instance, which was otherwise unset3, which in turn makes a bunch of tools throw warnings. I found the easiest way to fix this was to specify a startup script containing just hostname www.

The sshKeys metadata value is used to store a list of users and SSH public keys. This is read by a daemon (installed as /usr/share/google/google_daemon/manage_accounts.py) that ensures that each listed user continues to exist (with a home directory, .ssh/authorized_keys containing the specified key, etc), and also ensures that each listed user is a sudoer (present in /etc/sudoers)4.

Note that you don’t need to specify any of this at all. By default, Compute Engine creates a user account on your instance with a name set to your local login name, and creates a new ssh keypair that it drops into ~/.ssh/google_compute_engine{,.pub} on the machine you created the instance from. You can then simply use gcutil ssh instance-name to ssh into the instance.

This is helpful when you’re getting started, but it does mean that if you want to ssh from anywhere else, you either need to copy those keys around, or do something like the above to tell Compute Engine to accept a given public key. Since I wanted to be able to ssh programmatically from machines that didn’t necessarily have gcutil installed, I found it simpler to just create an ssh keypair manually, specify it as above, and use standard ssh to ssh to the instance.

--external_ip_address allows you to choose a “static” external IP address (from one you’ve previously reserved). Otherwise, the instance is assigned5 an ephemeral external IP address chosen when the instance is created. This is is reclaimed if you delete the instance, so you probably don’t want to rely on ephemeral IP addresses as the target of e.g. a DNS record.

However, you can promote an ephemeral address that’s assigned to an instance so that it becomes a static address, and there’s no charge for (in-use) static addresses, so there’s no problem if you start using an ephemeral address and later want to keep it. (Strictly speaking, external IP addresses are actually optional, as all of your instances on the same “network” can talk to each other using internal addresses, but this isn’t something simple installations are likely to use, I wouldn’t have thought.)

Compute Engine doesn’t currently have support for IPv6, oddly, though there’s a message right at the top of the networking documentation saying that IPv6 is an “important future direction”, so hopefully that’s just temporary. (EC2, for what it’s worth, doesn’t support IPv6 on their instances either, though their load balancers do, so you can use a load balancer as a [costly] way to get IPv6 accessibility.)

Finally (phew!), --wait_until_running won’t return until the instance has actually started booting (typically about 25 seconds; you can add a brand new instance and be ssh’d into a shell in less than a minute.) Note that the machine won’t have any user accounts until the initial boot has finished, so if you’re scripting this you’ll need to spin a bit until ssh starts working.

Configure your instance

I did need to spend a fair amount of time working out how to configure the instance once it existed, but that was mostly because I’m not too familiar with Debian.

There isn’t a great deal to say about this part (and obviously it’ll depend upon what you’re doing), but in my case I simply ran sudo apt-get install to get the packages I needed (apache2 and mercurial, and a few others like less and vim), downloaded and installed mod_pagespeed, the Apache module that does the same thing as PageSpeed Service, built my site, and set up Apache to serve it.

There are still two things I’m not quite happy with:

Cost

I’d estimate the monthly charge for my single instance to be around $15 + VAT (before the discount I’m getting). If I have the numbers right, that’s about what I’m paying for electricity to run my (not very efficient) server at home at present.

That price is dominated by the cost of the machine, which from the pricing documentation is currently $0.019/hour; the disk and network cost (for me) is going to end up at significantly less than a dollar a month.

I mention VAT above because something that’s not currently clear from the pricing documentation is that individuals in the EU pay VAT on top of the quoted prices (likewise true for EC2). Businesses are liable for VAT too, but they’re responsible for working out what to pay themselves, and do so separately.

One other aspect of the tax situation is a bit surprising (for individuals again, not businesses): VAT for Compute Engine is charged at the Irish VAT rate (23%), because when you’re in the EU, you’re paying Google Ireland. (This is in contrast to Amazon, who charge the UK rate even though you’re doing business with Amazon Inc. - tax is complicated.) Admittedly, the difference on the bill above is less than 30p/month, but it still took a little bit of time to figure out what was going on.

tl;dr

Despite all the talk of “big data” and large-scale data processing, is Compute Engine a viable option for running small-scale jobs like a simple static web server? Absolutely.

And while it’s easy to get started, it also looks like it scales naturally: I haven’t looked at load balancing (or protocol forwarding) in any detail, but everything else I’ve read about seems quite powerful and easy to start using incrementally.

From the management side, I’m impressed by the focus on scriptability: gcutil itself is fine as far as it goes, but the underlying Compute Engine API is documented in terms of REST and JSON, and the Developers Console goes out of its way to provide links to show you the REST results for what it’s doing (as it just uses the REST API under the hood). There are also a ton of client libraries available (gcutil is written against the Python API, for example), and support from third-party management tools like Scalr.

I still don’t think that I personally have any reason to use Compute Engine for large-scale processing, but I’m quite happy using it to serve this content to you.


  1. Funny story: I wasn’t sure whether to mention reliability, since it’s actually it’s been pretty good. Then a few hours after writing the first draft of this post, my router fell over, and it was half a day before I could get access to restart it. So there’s that. 

  2. There was another reason I originally wanted to script the installation: the preview version of Compute Engine that I started out using supported non-persistent (“scratch”) machine-local disks that were zero-cost. Initially I was considering whether I could get the machine configured in a way that it could boot from a clean disk image and set itself up from scratch on startup. It turned out to be a little more complicated than made sense, so I switched to persistent disks, but kept the script (and then the 1.0 release of Compute Engine came along and did away with scratch disks anyway). 

  3. It turns out this was caused by a bug in the way my Developers Console project was set up, many years ago; it doesn’t happen in the general case, and it’ll be fixed if I recreate my instance. 

  4. This is actually a bit of a pain. I’m using this to create a service user that is used both for the initial install and website content updates, but I could probably do with separating out the two roles and creating the non-privileged user manually. 

  5. Not actually assigned, but kinda: the instance itself only ever has an RFC 1918 address assigned to eth0 (by default, drawn from 10.240.0.0/16, though you can customise even that). Instead, it’s the “network” — which implements NAT6 between the outside world and the instance — that holds the external-IP-to-internal-IP mapping. The networking documentation covers this in extensive detail. 

  6. I think even the NAT aspect is optional: Protocol forwarding (just announced today, as I write this) appears to allow you to attach multiple external IP addresses directly to a single instance, presumably as additional addresses on eth0

10 January 2014

Noda Time 1.2.0

Noda Time 1.2.0 finally came out last week, and since I promised I’d write a post about it, here’s a post about it — which I’ve also just partially self-plagiarised in order to make a post for the Noda Time blog, so apologies if you’ve read some of this already. I promise there’s new content below as well.

While the changes in Noda Time 1.1 were around making a Portable Class Library version and filling in the gaps from the first release, Noda Time 1.2 is all about serialization1 and text formatting.

On the serialization side, Noda Time now supports XML and binary serialization natively, and comes with an optional assembly (and NuGet package) to handle JSON serialization (using Json.NET).

On the text formatting side, Noda Time 1.2 now properly supports formatting and parsing of the Duration, OffsetDateTime, and ZonedDateTime types.

We also fixed a few bugs, and added a some more convenience methods — Interval.Contains() and ZonedDateTime.Calendar, among others — in response to requests we received from people using the library2.

Finally, it apparently wouldn’t be a proper Noda Time major release without fixing another spelling mistake in our API: we replaced Period.Millseconds in 1.1, but managed not to spot that we’d also misspelled Era.AnnoMartyrm, the era used in the Coptic calendar. That’s fixed in 1.2, and I think (hope) that we’re done now.

There’s more information about all of the above in the comprehensive serialization section of the user guide, the pattern documentation for the Duration, OffsetDateTime, and ZonedDateTime types, and the 1.2.0 release notes.

You can pick up Noda Time 1.2.0 from the NuGet repository as usual, or from the links on the Noda Time home page.

That’s the summary, anyway. Below, I’m going to going into a bit more detail about XML and JSON serialization, and what kind of things you can do with the new text support.

XML serialization

Using XML serialization is pretty straightforward, and mostly works as you’d expect. Here’s a complete example demonstrating XML serialization of a Noda Time property:

using System;
using System.IO;
using System.Xml;
using System.Xml.Serialization;
using NodaTime;

public class Person
{
public string Name { get; set; }
public LocalDate BirthDate { get; set; }
}

static class Program
{
static void Main(string[] args)
{
var person = new Person {
Name = "David",
BirthDate = new LocalDate(1979, 3, 22)
};

var x = new XmlSerializer(person.GetType());
var namespaces = new XmlSerializerNamespaces(
new XmlQualifiedName[] { new XmlQualifiedName("", "urn:") } );

var output = new StringWriter();
x.Serialize(output, person, namespaces);
Console.WriteLine(output);
}
}

As you can see, there’s nothing special here, and the output is also as you’d expect:

<?xml version="1.0" encoding="utf-8"?>
<Person>
<Name>David</Name>
<BirthDate>1979-03-22</BirthDate>
</Person>

There are a couple of caveats to be aware of regarding XML serialization, though, most notably that the Period type requires special handling. Period is an immutable reference type, which XmlSerializer doesn’t really support, and so you’ll need to serialize via a proxy PeriodBuilder property instead.

The other notable issue (which also applies to binary serialization) is that .NET doesn’t provide any way to provide contextual configuration, and so when deserializing a ZonedDateTime, we need a way to find out which time zone provider to use.

By default, we’ll use the TZDB provider, but if you’re using the BCL provider (or any custom provider), you’ll need to set a static property:

DateTimeZoneProviders.Serialization = DateTimeZoneProviders.Bcl;

The serialization section in the user guide has more details about both of these issues.

There are also two other limitations of XmlSerializer that aren’t specific to Noda Time, but are good to know about if you’re just getting started:

JSON serialization

Noda Time’s JSON serialization makes use of Json.NET, which means that to use it, you’ll need to add references to both the Json.NET assembly (Newtonsoft.Json.dll) and the Noda Time support assembly (NodaTime.Serialization.JsonNet.dll).

The only setup you need to do in code is to inform Json.NET how to serialize Noda Time’s types (and again, which time zone provider to use). This can either be done by hand, or via a ConfigureForNodaTime extension method. Again, the user guide has all the details.

Once that’s done, using the serializer is straightforward:

using System;
using System.IO;
using Newtonsoft.Json;
using NodaTime;
using NodaTime.Serialization.JsonNet;

internal class Person
{
public string Name { get; set; }
public LocalDate BirthDate { get; set; }
}

static class Program
{
static void Main(string[] args)
{
var person = new Person {
Name = "David",
BirthDate = new LocalDate(1979, 3, 22)
};

var json = new JsonSerializer();
json.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb);

var output = new StringWriter();
json.Serialize(output, person);
Console.WriteLine(output);
}
}

Output:

{"Name":"David","BirthDate":"1979-03-22"}

Unlike the .NET XML serializer, the Json.NET serializer is significantly more configurable. The Json.NET documentation is probably a good place to start if you’re interested in doing that.

Better text support

Noda Time 1.2 adds parsing and formatting for the Duration, OffsetDateTime, and ZonedDateTime types, which previously only had placeholder ToString() implementations. Given a series of assignments like the following:

var paris = DateTimeZoneProviders.Tzdb["Europe/Paris"];

ZonedDateTime zdt = SystemClock.Instance.Now.InZone(paris);
OffsetDateTime odt = zdt.ToOffsetDateTime();
Duration duration = Duration.FromSeconds(12345);

the result of calling ToString() on each of the zdt, odt, and duration variables would produce something like the following in 1.1:

Local: 26/11/2013 19:35:28 Offset: +01 Zone: Europe/Paris
2013-11-26T19:35:28.00081+01
Duration: 123450000000 ticks

In 1.2, these types use a standard pattern by default instead: the general invariant pattern (‘G’), for ZonedDateTime and OffsetDateTime, and the round-trip pattern (‘o’) for Duration:

2013-11-26T19:35:28 Europe/Paris (+01)
2013-11-26T19:35:28+01
0:03:25:45

More usefully, we can now use custom patterns:

var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z')'", null);
Console.WriteLine(pattern.Format(zdt));

which will print “26/11/2013 19:35:28 (Europe/Paris)”.

The null above is an optional time zone provider. If not specified, as shown above, the resulting pattern can only be use for formatting, and not for parsing3. This is why the standard patterns are format-only: they don’t have a time zone provider.

If you do specify a time zone provider, however, you can parse your custom format just fine:

var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z')'", DateTimeZoneProviders.Tzdb);
var zdt = pattern.Parse("26/11/2013 19:35:28 (Europe/Paris)").Value;
Console.WriteLine(zdt);

which prints “2013-11-26T19:35:28 Europe/Paris (+01)”, as you would expect.

As well as formatting the time zone ID (the “z” specified in the format string above), you can also format the time zone abbreviation (using “x”), which given the above input would produce “CET”, for Central European Time.

Now, if you’ve seen Jon’s “Humanity: Epic fail” talk — or watched his recent presentation at DevDay Kraków, which covers some of the same content — then you’ll already know that time zone abbreviations aren’t unique. For that reason, if you include a time zone abbreviation when creating a ZonedDateTimePattern, the pattern will also be format-only.

In addition to the time zone identifiers, both ZonedDateTime and OffsetDateTime patterns accept a format specifier for the offset in effect. This uses a slightly unusual format, as Offset can be formatted independently: it’s “o<…>”, where the “…” is an Offset pattern specifier. For example:

var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z o<+HH:mm>')'", null);
Console.WriteLine(pattern.Format(zdt));

which will unsurprisingly print “26/11/2013 19:35:28 (Europe/Paris +01:00)”.

For OffsetDateTime, the offset is a core part of the type, while for ZonedDateTime, it allows for the disambiguation of otherwise-ambiguous local times (as typically seen during a daylight saving transition).

If the offset is not included, the default behaviour for ambiguous times is to consider the input invalid. However, this can also be customised by providing the pattern with a custom resolver.

Finally, to Duration. Duration formatting is a bit more interesting, because we allow you to choose the granularity of reporting. For our duration above, of 12,345 seconds, the round-trip pattern shows the number of days, hours, minutes, seconds, and milliseconds (if non-zero), as “0:03:25:45”.

We can also format just the hours and minutes:

var pattern = DurationPattern.CreateWithInvariantCulture("HH:mm");
var s = pattern.Format(duration);
Console.WriteLine(s);

which prints “03:25” — or we can choose to format just the minutes and seconds:

var pattern = DurationPattern.CreateWithInvariantCulture("M:ss");
var s = pattern.Format(duration);
Console.WriteLine(s);

which does not print “25:45”, but instead prints “205:45”, reporting the total number of minutes and a ‘partial’ number of seconds. Had we instead used “mm:ss” as the pattern, we would indeed have seen the former result; the case of the format specifier determines whether a total or partial value is used.

Once again, there’s more information on all of the above in the relevant sections of the user guide.


  1. or serialisation. I apologise in advance for the spelling, but the term turns up in code all the time (e.g. ISerializable), and I find it makes for awkward reading to mix and match the two. 

  2. There’s definitely a balance to be had between the Pythonesque “only one way to do it” maxim and providing so many convenience methods that they cloud the basic concepts, and I think for 1.0 we definitely tended towards the former — which isn’t that bad: it’s easy to expand an API, but hard to reduce it. Some things that were a little awkward should be easier with 1.2, though. 

  3. The error message you’ll see is “UnparsableValueException: This pattern is only capable of formatting, not parsing.” 

27 November 2013

Measure twice

“Measure twice, cut once.” I can’t recall exactly where I was when I first heard that: perhaps a school carpentry lesson? Or for some reason I’m now thinking it was a physics lesson instead, but no matter. What is true is that I recently discovered that it applies to software engineering as much as carpentry.

Here’s the background: this blog is generated as a set of static files, by transforming an input tree into an output tree. The input tree contains a mixture of static resources (images, etc) alongside ‘posts’, text files containing metadata and Markdown content. The output tree contains exactly the same files, except that the posts are rendered to HTML, and there are some additional generated files like the front page and Atom feed.

There are tools to do this kind of thing now (we use Jekyll for Noda Time’s web site, for example), but I wrote my own (and then recently rewrote it in Python)1. My version does three things: it scans the input and output trees, works out what the output tree should look like, then makes it so. It’s basically make (with a persistent cache and content-hash-based change detection) plus a dumb rsync, and it’s not particularly complex.

For my most-recent post, I needed to add support for math rendering, which I did by conditionally sourcing a copy of MathJax from their CDN. So far, so good, but then I wanted to be able to proof the post while I was on a plane, so I decided to switch to a local copy of MathJax instead.

Problem: a local install of MathJax contains nearly 30,000 files, and my no-change edit-render-preview cycle shot from 75ms to just over 12 seconds. Most of the time I write posts in vim, and only proof the post when I’m done editing, but since this was a math-heavy post (and my LaTeX is… rusty), I was having a swordfight every few minutes.

But, you know, optimisation opportunity! I carried out some basic profiling and figured out that the 12 seconds I was seeing was taken up with:

  1. a little over five seconds scanning the input and output trees;
  2. about four seconds working about what to change; and
  3. about three seconds writing out the persistent cache.

The second of those I couldn’t see any obvious way to improve, but the first and last surprised me.

The input and output tree scanning is done by using os.walk() to walk the tree, and os.stat() to grab each file’s mtime and size (which I use as validators for cache entries).

Clearly that was an inefficient way to do it: I’m calling stat(2) about 30,000 times, when I should be reading that information in the same call as the one that reads the directory, right? Except that there’s no such call: the Linux VFS assumes that a file’s metadata is separate from the directory entry2; this isn’t DOS.

Perhaps I was thrashing the filesystem cache, then? Maybe I should be sorting (or not sorting) the directory entries, or stat-ing all the files before I recursed into subdirectories? Nope; doesn’t make a difference.

Well, I guess we’re blocking on I/O then. After all, git doesn’t take long to scan a tree like this, so it must be doing something clever; I should do that. Ah, but git is multi-threaded, isn’t it?3 I’ll bet that’s how it can be fast: it’s overlapping the I/O operations so that it can make progress without stalling for I/O.

So I wrote a parallel directory scanner in Python, trying out both the multiprocessing and threading libraries. How long did each implementation take? About five seconds, same as before. (And raise your hand if that came as a surprise.)

The next thing I tried was replicating the scan in C, just to double-check that readdir-and-stat was a workable approach. I can’t recall the times, but it was pretty quick, so Python’s at fault, right?

Wrong. It’s always your fault.

I realised then that I’d never tried anything outside my tool itself, and ported the bare C code I had to Python. It took exactly the same amount of time. (At which point I remembered that Mercurial, which I actually use for this blog, is largely written in Python; that should have been a clue that it wasn’t likely to be Python in the first place.)

So finally I started taking a look at what my code was actually doing with the directory entries. For each one, it created a File object with a bunch of properties (name, size, etc), along with a cache entry to store things like the file’s content hash and the inputs that produced each output.

Now the objects themselves I couldn’t do much about, but the cache entry creation code was interesting: it first generated a cache key from the name, mtime, and size (as a dict), then added an empty entry to my global cache dict using that key. The global cache was later to be persisted as a JSON object (on which, more later), and so I had to convert the entry’s cache key to something hashable first.

And how to generate that hashable key? Well, it turned out that as I already had a general-purpose serialiser around, I’d made the decision to reuse the JSON encoder as a way to generate a string key from my cache key dict (because it’s not like performance matters, after all). Once I’d replaced that with something simpler, tree scans dropped from 5.2s to 2.9s. Success!

I’d also noticed something odd while I was hacking about: when I removed some of my DEBUG level logging statements, things sped up a bit, even though I was only running at INFO level. I briefly considered that perhaps Python’s logging was just slow, then decided to take another look at how I was setting up the logging in the first place:

logging.basicConfig(
level=logging.DEBUG,
format=('%(levelname).1s %(asctime)s [%(filename)s:%(lineno)s] '
'%(message)s'))
logging.getLogger().handlers[0].setLevel(
logging.DEBUG if args.verbose else logging.INFO)

Python’s logging is similar to Java’s, so this creates a logger that logs everything, then sets the default (console) log handler to only display messages at INFO level and above. Oops.

I’d stolen the code from another project where I’d had an additional always-on DEBUG handler that wrote to a file, but here I was just wasting time formatting log records I’d throw away later. I changed the logging to set the level of the root logger instead, and sped things up by another second. More success!

Finally, I decided to take a look at the way I was writing out my cache. This is a fairly large in-memory dict mapping string keys to dict values for each file I knew about. I’d known that Python’s json module wasn’t likely to be particularly fast, but almost three seconds to write an 11MB file still seemed pretty slow to me.

I wasn’t actually writing it directly, though; the output was quite big and repetitive, so I was compressing it using gzip first:

with gzip.open(self._cache_filename, 'wb') as cache_fh:
json.dump(non_empty_values, cache_fh, sort_keys=True,
default=_json_encode_datetime)

I noticed that if I removed the compression entirely, the time to write the cache dropped from about 2900ms to about 800ms, but by this point I was assuming that everything was my fault instead, so I decided to measure the time taken to separately generate the JSON output and write to the file.

To my surprise, when I split up the two (using json.dumps() to produce a string instead of json.dump()), the total time dropped to just 900ms. I have no real idea why this happens, but I suspect that something is either flushing each write to disk early or that calls from Python code to Python’s native-code zlib module are expensive (or json.dump() is slow).

In total, that brought my no-change runtime down from twelve seconds to just about six.

So, in summary, once I realised that I should actually measure where my time was spent rather than guessing what it might be spent on, I was able to reduce my runtime by about half, quite a big deal in an edit-compile-run cycle. It took it from “I’m bored waiting; maybe read Slashdot instead” to something that was tolerable.

And so success, and scene.

But.


But there’s actually a larger lesson to learn here, too (and in fact I very nearly titled this post “The \sqrt of all evil” in light of that): I didn’t need to do any of this at all.

A few weeks after the above, I realised that I could solve my local editing problem an entirely different way. I moved MathJax out of the content tree entirely, and now (on local runs) just drop a symlink into the output tree after the tool is done. So if you take a look at the page as it serves now, you’ll see I’m back to sourcing MathJax via their CDN.

This means that I’m back down to O(200) input files rather than O(30k), and my no-change builds now take 30ms. It was a fun journey, but I’m not entirely sure that cutting 45ms from a 75ms run was worth all the effort I put in…


  1. Why? Mostly because I can: it’s fun. But also because it gives me an excuse to play around: originally with C and SQLite, and more recently with Python, a language I don’t use much otherwise. I could also say that static blog generators weren’t as common in 2006, but to be honest I don’t think I bothered looking. 

  2. Nope. git is multi-threaded when packing repositories, but as far as I’m aware that’s the only time it is. 

15 November 2013

A maximally-dense encoding for n-choose-k

Last month1, John Regehr asked:

Random question: 32 choose 8 gives less than 11e6 possibilities whereas 24 bits has more than 16e6 choices of value. It seems like there must be a standard fast trick for encoding the choices in the bits. Anyone know it?
John Regehr, Google+ post

As it happens, this is something I’ve occasionally wondered about, too.

The short answer is that there is a standard (though not particularly fast) way to do this called the “Combinatorial Number System”. As usual, Wikipedia has the details, though in this case I found their explanations a little hard to follow, so I’m going to go through it here myself as well.

First, let’s back up for a bit and make sure we know what we’re talking about.

If you wanted a way to compactly encode any combination of — let’s say — 32 choices, your best bet (assuming a uniform distribution) would be to use a 32-element bit-vector, with one bit set per choice made.

If we were implementing that in C++, we could do something like this:

uint32_t encode(const std::set<int>& choices) {
uint32_t result = 0;
for (int choice : choices) {
assert(0 <= choice && choice < 32);
result |= 1 << choice;
}
return result;
}

This pattern is common enough that in practice you’d be more likely to use the bit-vector directly, rather than materialise anything like an explicit set of choice numbers. But while it’s a good solution for the general case, what if your problem involved some fixed number of choices instead of an arbitrary number?

For the purposes of the rest of this discussion, let’s pretend we’re going to make exactly four choices out of the 32 possible. That gives us about 36,000 different possible combinations to encode, which we should be able to fit into a 16-bit value.

Actually, for almost all purposes, there’s still nothing really wrong with the implementation above, even though it is using twice as many bits as needed — unless perhaps you had a very large number of combinations to store (and are willing to trade simplicity for space). However, as we’ll see later, the solution to this problem has a few other uses as well. Plus, I think it’s an interesting topic in itself.

If you just want to read about how the combinatorial number system itself works, feel free to skip the next section, as I’m going to briefly take a look at a ‘better’ (though still non-optimal) alternate to the above.

20 is less than 32

As an improvement to the above one-bit-per-choice implementation, we could simply encode the number of each choice directly using four groups of five bits each:

uint32_t encode(const std::set<int>& choices) {
assert(choices.size() == 4);
uint32_t result = 0;
for (int choice : choices) {
assert(0 <= choice && choice < 32);
result = (result << 5) | choice;
}
return result;
}

This is still fairly simple, and at 20 bits output, it’s more compact than a 32-element bit-vector, but it’s also larger than our optimal result of (approximately) 16 bits.

We can actually see why this is: the encoding we’ve chosen is too expressive for what we actually need to represent2. In this case, there are two ways in which our encoding distinguishes between inputs that could be encoded identically.

The biggest waste occurs because we’re encoding an ordering. For example, \( \lbrace 1, 2, 3, 4 \rbrace \) and \( \lbrace 4, 3, 2, 1 \rbrace \) have different encodings under our scheme, yet really represent the same two combinations of choices.

The second (and much more minor) inefficiency comes from the ability to encode ‘impossible’ combinations like \( \lbrace 4, 4, 4, 4 \rbrace \). Optimally, choices after the first would be encoded using a slightly smaller number of bits, as we have less valid choices to choose from by then.

In this case, we can actually quantify precisely the degree to which this encoding is sub-optimal: being able to represent an ordering means that we have \( 4! = 24 \) times too many encodings for the ‘same’ input, while allowing duplicates means that we have \( 32^4 \div ( 32 \cdot 31 \cdot 30 \cdot 29 ) \approxeq 1.2 \) times too many (i.e. the number of choices we can encode, divided by the number we should be able to encode). Combining the two factors gives us the difference between an optimal encoding and this one.

It’s interesting to think about how we might improve on the above: perhaps we could canonicalise the input by ordering the choices by number, say, then rely on that fact somehow: if we knew that the choices were in decreasing order, for example, it’s clear to see that we could identify the combination \( \lbrace 3, 2, 1, 0 \rbrace \) entirely from the information that the first choice is number 3.

But let’s move on to something that is optimal.

and 16 is less than 20

And so we arrive at the “combinatorial number system”. This system describes a bijection between any combination of (a fixed number of) \(k\) choices and the natural numbers.

Interestingly, this means that this scheme does not depend upon knowing the number of things you’re choosing from: you can convert freely between the two representations just by knowing how many choices you need to make.

First, a brief refresher on binomials. The number of ways we can choose \(k\) things from \(n\) is the binomial coefficient, which can be defined recursively:

\[ \eqalign{ \binom n 0 &= \binom n n = 1 \cr \binom n k &= \binom {n-1} {k-1} + \binom {n-1} k } \]

… or, in a way that’s simpler to compute directly — as long as the numbers involved are small — in terms of factorials:

\[ \binom n k = {n! \over k!(n - k)!} \]

For example, we can compute that there are exactly 35,960 different ways to choose four things from a set of 32:

\[ \binom {32} 4 = {32! \over 4!(32 - 4)!} = 35960 \]

Note that in some cases (for example, the one immediately below), we may also need to define:

\[ \binom n k = 0 \text { when } k \gt n \]

That is, it’s not possible to choose more things from a set than were originally present.

But that’s enough about binomials. The combinatorial number system, then, is then defined as follows:

Given a combination of choices \( \lbrace C_k, C_{k-1}, \dots, C_1 \rbrace \) from \(n\) elements such that \( n \gt C_k \gt C_{k-1} \gt \dots \gt C_1 \ge 0 \), we compute a value \(N\) that encodes these choices as:

\[ N = \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} \]

Somewhat surprisingly, this produces a unique value \( N \) such that:

\[ 0 \le N \lt \binom n k \]

And since the number of possible values of \( N \) is equal to the number of combinations we can make, every \( N \) maps to a valid (and different) combination.

\( N \) will be zero when all the smallest-numbered choices are made (i.e. when \( C_k = k - 1 \) and so \( C_1 = 0 \)), and will reach the maximum value with a combination containing the largest-numbered choices.

We could implement this encoding using something like the following:

uint32_t binom(int n, int k) {
assert(0 <= k);
assert(0 <= n);
if (k > n) return 0;
if (n == k) return 1;
if (k == 0) return 1;
return binom(n-1, k-1) + binom(n-1, k);
}

uint32_t encode(const std::set<int>& choices) {
std::set<int, std::greater<int>> choices_sorted(
choices.begin(), choices.end());
int k = choices.size();

uint32_t result = 0;
for (int choice : choices_sorted) {
result += binom(choice, k--);
}
return result;
}

… though in reality, we’d choose a much more efficient way to calculate binomial coefficients, since the current implementation ends up calling binom() a number of times proportional to the resulting \(N\)!

Decoding can operate using a greedy algorithm that first identifies the greatest-used choice number, then successively removes the terms we added previously:

std::set<int> decode(uint32_t N, int k) {
int choice = k - 1;
while (binom(choice, k) < N) {
choice++;
}

std::set<int> result;
for (; choice >= 0; choice--) {
if (binom(choice, k) <= N) {
N -= binom(choice, k--);
result.insert(choice);
}
}
return result;
}

We could also choose to remove the initial loop and just start choice at the greatest possible choice number, if we knew it in advance.

Number systems, how do they work?

As to how all this works, consider the list produced for successive \(N\).

For \(k = 4\), the enumeration of the combinations begins:

\[ \displaylines{ \lbrace 3, 2, 1, 0 \rbrace \cr \lbrace 4, 2, 1, 0 \rbrace \cr \lbrace 4, 3, 1, 0 \rbrace \cr \lbrace 4, 3, 2, 0 \rbrace \cr \lbrace 4, 3, 2, 1 \rbrace \cr \lbrace 5, 2, 1, 0 \rbrace \cr \vdots } \]

As you can see, this list is in order. More specifically: it’s in lexicographic order. This isn’t by coincidence, but is actually a direct result of the way we construct the equation above. Let’s do that.

First, construct an (infinite) list of all possible \(k\)-combinations, with the choices that form each individual combination in descending order, as above. Sort this list in lexicographic order, as if the choices were digits in some number system, again as shown above.

Pick any entry in that sorted list. We’re going to count the number of entries that precede our chosen one.

To do so, for each ‘digit’ (choice) in our chosen entry, count all valid combinations of the subsequence that extends from that digit to the end of the entry, while only making use of the choices with numbers smaller than the currently-chosen one. Sum those counts, and that’s the number of preceding entries.

That sounds much more complex than it really is, but what we’re doing is equivalent to, in the normal decimal number system, saying something like: “the count of numbers less than 567 is equal to the count of numbers 0xx–4xx, plus the count of numbers 50x–55x, plus the count of numbers 560–566”. Except in this case, the ‘digits’ in our numbers are strictly decreasing.

I skipped a step. How do we find the number of combinations for each subsequence? That’s actually easy: if the choice at the start of our subsequence currently has the choice number \(C_i\) and the subsequence is of length \(i\), then the number of lexicographically-smaller combinations of that subsequence, \(N_i\), is the number of assignments we can make of \(C_i\) different choices3 to the \(i\) different positions in the subsequence.

Or alternatively,

\[ N_i = \binom {C_i} i \]

and so:

\[ \eqalign{ N &= \sum_{i=1}^k N_i \cr &= \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} } \]

\(N\), of course, is both the count of entries preceding the chosen entry in our sorted list, and the index that we assign to that entry.

Random

Okay, so perhaps it’s not that common to need to encode a combination as an integer value, but there’s another way to use this to be aware of here: if you pick a random \(N\) and ‘decode’ it, you end up with a uniformly-chosen random combination. That’s something that I have wanted to do on occasion, and it’s not immediately clear how you’d do it efficiently otherwise.

(For completeness, a third usage that Wikipedia mentions is in being able to construct an array indexed by \(N\) in order to associate some information with each possible \(k\)-combination. I can see what they’re saying, but I can’t see many cases where this might actually be something that you need to do.)

Open questions

A simple bit-vector is optimal for representing any number of choices \( [0, n] \) made from \(n\) items. The combinatorial number system is optimal for representing a fixed number of choices \(k\).

What’s an optimal way to represent a bounded number of choices \( [0, k] \)? Or more generally, any arbitrarily-bounded number of choices?


  1. I’m a slow typist. 

  2. The other possibility is that the encoding may be wasteful in a straightforward information-theoretic sense, akin to using six bits to represent “a number from 0 to 37” rather than ~5.25. While this is strictly a superset of the over-expressiveness problem mentioned in the main text, it seems useful to differentiate the two, and consider the semantics expressed by the encoded representation we’ve decided upon. 

  3. That is, the count of those choice numbers that are smaller than \(C_i\), since choice numbers start from zero. 

7 November 2013