farblog

Noda Time 3.0.0

2020-05-23T17:12:13Z

Noda Time 3.0.0 came out yesterday¹, bringing a shiny new parcel of date- and time-related functionality.

What’s new in 3.0? Firstly, there’s a couple of things in 3.0 that just plain make it easier to use Noda Time:

Nullable reference types. The API now correctly uses the nullable reference types introduced in C# 8.0 to document when a method or property may accept or return a null value.

For example, IDateTimeZoneProvider.GetZoneOrNull(id) now declares its return value as DateTimeZone?, while the similar indexer (which cannot return null) instead returns DateTimeZone.

Nullability was previously noted in our documentation, but now (with appropriate compiler support) you can opt-in to warnings that indicate where you might be accidentally passing a null somewhere you shouldn’t.
A plethora of API improvements. For example, we now have a YearMonth type that can represent a value like “May 2020”; TzdbDateTimeZoneSource now provides explicit dictionaries mapping between TZDB and Windows time zone IDs; and DateAdjusters.AddPeriod() creates a date adjuster that can be used to add a Period to dates, along with many other improvements. As always, see the version history and API changes page for full details.
A single library version. Previous versions of Noda Time were slightly fragmented when it came to supporting different framework versions. For example, Noda Time 1.x was specific to the .NET Framework, and later added a Portable Class Library version that was missing a few key functions, while Noda Time 2.x again provided a separate .NET Standard version that differed slightly from the ‘full’ version. As of Noda Time 3.0, we have just one library version, providing the same functionality on all platforms.
Better support for other frameworks. Most core types are now annotated with TypeConverter and XmlSchemaProvider attributes. Type converters are used in various frameworks to convert one type into other (typically, to or from a string) — for example, ASP.NET will use type converters to convert query string parameters into typed values — while the XML schema attributes make it possible to build an XML schema programmatically for web services that make use of Noda Time types.

Performance

Although not as significant as the changes from Noda Time 1.x to 2.x, performance is still a key concern for Noda Time.

In 3.0.0, we’ve managed to eke out a little more performance for some common operations: finding the earlier of two LocalDate values now takes somewhere between 40–60% of the time it did in Noda Time 2.x, while parsing text strings as LocalTime and LocalDate values using common (ISO-like) patterns should also be a little faster, taking around 90% of the time it did in Noda Time 2.x.

Caveats

The change from Noda Time 2.x to 3.0 is not as big a change as the one from Noda Time 1.x to 2.0, but there are still some small incompatibilities to watch out for.

The migration document details everything that we’re aware of, but there are two points worth calling out explicitly:

Noda Time 3.x has (slightly) greater system requirements than Noda Time 2.x. While Noda Time 2.x required either .NET Framework 4.5+ or .NET Core 1.0+, Noda Time 3.x requires “netstandard2.0”; that is, .NET Framework 4.7.2+ or .NET Core 2.0+.
.NET binary serialization is no longer supported. While .NET Core 2.0 added some support for binary serialization, binary serialization has many known deficiencies, and other serialization frameworks are now generally preferred. Accordingly, we have removed support for binary serialization entirely from Noda Time 3.x.

Noda Time still natively supports .NET XML serialization for all core types, and we also provide official libraries for serializing using JSON (1, 2) and Google’s protobuf.

In general, though, we expect that most projects using Noda Time 2.x should be able to replace it with Noda Time 3.0.0 transparently.

Availability

You can get Noda Time 3.0.0 from the NuGet repository as usual (core and testing packages), or from the links on the Noda Time home page.

Note that the serialization packages were decoupled from the main release during the 2.x releases, and so (for example) there is no new version of NodaTime.Serialization.JsonNet; the current version of that library will work just fine with Noda Time 3.0.0.

What’s next?

Good question. While Noda Time is fairly mature as a library, we do have a few areas we’d like to explore for the future: making use of Span in text parsing, and providing a little more information from CLDR sources (stable timezone IDs, for example). If you’re interested in helping out, come and talk to us on the mailing list.

And once again, I’m going to copy/paste this to produce the official Noda Time blog post. (The evidence suggests that this is the only way I’ll get any content on my personal site, after all.) ↩

Books of 2018

2019-01-02T01:25:57Z

[Insert obligatory “well, it’s been a while since I’ve written anything for this blog” paragraph here.]

With 2018 finally complete, I thought it might be fun to take a quick look at the books I read last year. All of these are from my Goodreads profile, though I tend not to write reviews for individual books there.

Goodreads has a “reading challenge” each year wherein you can set a target number of books to read. In 2016, I hit my target of 34 books, albeit only by cramming both the SRE book and The Calendar of the Roman Republic (long story) on the last day of that year. Buoyed by success, I increased it to 38 books for 2017… and then got distracted by life and fell a bit short.

So, for 2018, I kept the same target as for 2017, and tried to not get distracted. A few weeks ago, I’d got a little bit ahead of that — woohoo me! — and decided it might be fun to put together a short review of each. So here are all the books I read in 2018, in (roughly) chronological order.

Robots vs. Fairies, various authors
Starting off 2018, an anthology of short stories: some about robots, and some about fairies. Definitely mixed, with a few really good ones, and a few that are… not so good (John Scalzi’s comes to mind as one of the latter, surprisingly).
The Fifth Season (The Broken Earth, #1), N.K. Jemisin
So, this I definitely liked. It has a great premise, post-apocalyptic — or maybe just apocalyptic, given the intro — fantasy, good characters, and good worldbuilding, it won the 2016 best novel Hugo, and yet… I haven’t picked up the series again.

I’m not sure exactly why: perhaps because of the writing style (it’s present tense, partly in second person), perhaps because I was irritated by the way the writer withheld some key information the characters knew, or perhaps because of the incomplete ending. It’s possible that the sequels are brilliant, but I haven’t got around to finding out yet. Possibly in 2019.
Dark State (Empire Games, #2), Charles Stross
Continuing Stross’ reboot of the Merchant Princes series, a multiple-alternate-timeline spy/techno-thriller. Stross groks politics and economics (and technology), so this is actually a pretty good alt-history analysis as well as being a lot of fun. (Although if we could stop heading towards the dystopian timeline in real life, that’d be great, thanks.)
The Night Masquerade (Binti, #3), Nnedi Okorafor
Quoting from the one Goodreads review I did write: I was looking forward to this offering a conclusion to the series. Well, in some ways it does do that, and in some — quite important — ways, it doesn’t. I think I’d have been better just appreciating the great world-building here rather than the plot.
Beneath the Sugar Sky (Wayward Children, #3), Seanan McGuire
So, what if fairy tales were real? What happens when they’re over? That’s the premise of this series — in much the same way as Stross’ Equoid asks what it might be like if unicorns were real (spoilers: sharp horns, so blood, mostly).

This book is almost-standalone, with some of the children from earlier books going on portal-hopping adventures of their own. I liked this one a lot more than the second book in the series, which had a different focus, and was a bit more serious. Also, I’ve just realised that book #4 (In an Absent Dream) is out next week!
The Fox’s Tower and Other Tales, Yoon Ha Lee
A collection of flash fiction from Yoon Ha Lee, who’s also written some excellently weird science fiction and interactive fiction. Like Robots vs. Fairies above, I thought this was somewhat hit-and-miss.

The stories I enjoyed more tended to be those heavy on imagery and light on ‘plot’ (such plot as is possible with flash fiction), though The Stone-Hearted Soldier was an excellent inclusion, and an exception to that rule (but also one of the longer stories).
An Unkindness of Ghosts, Rivers Solomon
A dystopian space opera set around a study of oppression and segregation aboard a generation spaceship. The protagonists are incredibly varied and interesting characters, though the bad guys are unfortunately cardboard.

I remember this being something I wanted to keep reading (if challenging in parts), but I can’t actually remember any of the plot at this point. Minor issues notwithstanding, I definitely enjoyed this.
The Arcadia Project series, #1–3 (Borderline, Phantom Pains, Impostor Syndrome), Mishell Baker
From one set of neuroatypical characters to another. No spaceships here, but an urban fantasy/mystery that posits a link between fey and Hollywood celebrity. The whole series is great, the characters are believable and well-rounded (and self-sabotaging and dysfunctional). I was worried that I wouldn’t be that interested in a Los Angeles movie-town setting, but the characters and story won me over.

This series ties with Smoke and Iron (below) as my favourite read of 2018. Recommended.
The Gone World, Tom Sweterlitsch
So, apparently I liked this enough to give it 4/5 on Goodreads, but I can’t actually remember anything about it. It looks like it’s a time travel/murder mystery/apocalypse story? Perhaps I should re-read it.
Sleeping Giants (Themis Files, #1), Sylvain Neuvel,
Told via the medium of interviews and news clippings, in the style of World War Z, this is the story of how the discovery of a giant robot hand plays out politically. There is some sci-fi here, but mostly it’s the politics from Arrival that takes centre stage.

This was alright, but again, I’ve not picked up the next in the series. The journal/interview format makes it hard to get much in the way of interaction between characters, and the story seemed more interested in the politics than in the sci-fi/mystery aspect (which is fine, just not what I was looking for).
Storytelling with Data: A Data Visualization Guide for Business Professionals, Cole Nussbaumer Knaflic
I think this is what you’d get if you boiled down Tufte’s The Visual Display of Quantitative Information into practical advice and case studies, thirty years later. Definitely useful and interesting, even though this isn’t something I need to do on a regular basis professionally.
The Red Rising series, #1–4 (Red Rising, Golden Son, Morning Star, Iron Gold), Pierce Brown
Dystopian sci-fi. The blurb says “Ender’s Game meets The Hunger Games”, and I suppose that’s about right: the protagonist takes on the elite by infiltrating them and subverting them from within, only this time we’re talking about Mars, and later an entire solar system.

I enjoyed the first few books in the series, but somewhere around the third or fourth I started to get a bit tired of the diffusion of the story to uninteresting point-of-view characters, and also in the continuous faux-Roman melodramatics.

The first book is definitely good by itself, and maybe I’ll pick the series up again at some point.
Kindred, Octavia E. Butler
This is also sci-fi, or maybe fantasy¹, but is probably simpler to think of as historical fiction. A modern progressive black woman is transported to early 19^th century Maryland, deep in the antebellum American South.

With the caveat that “modern” here means the 1970s (the book being published in 1979), this is a fascinating story — if deeply unsettling at times — about how culture shapes behaviour, and how social hierarchies and systems can be justified and propagated by those within the system.
The Pliocene Exile / Galactic Milieu series (The Many-Coloured Land, The Golden Torc, The Nonborn King, The Adversary; Intervention; Jack the Bodiless, Diamond Mask, Magnificat), Julian May
An easy re-read. Julian May’s epic galaxy- and time-spanning series starts with a fantastic premise: as Earth has joined a galactic federation of sorts, and as humanity has begun to evolve psionic powers, a misfit group of disaffected/adventurous travellers escapes into exile via a one-way time wormhole that deposits them in France, in the Pliocene epoch, 6 million years ago².

Without spoiling too much, the story shifts very quickly from science fiction to something closer to high fantasy (for the first series, at least; the second is in a more contemporary time period, and is more ‘regular’ sci-fi). Weaving mythology and an epic story, this is well worth the time to read.
A Rag, a Bone and a Hank of Hair, Nicholas Fisk
This YA dystopia was fun to read when I was a lot younger (it was published in 1980; I probably read it sometime in the mid-1980s, along with a lot of other Nicholas Fisk), but it hasn’t really held up that well. The motivation behind the plot falls apart a bit on any analysis, and some of the technology is a bit dated now (explanations about miniature tape recorders, that kind of thing).

However, I do still like how the protagonist learns to interact with the other characters (both modern, and not-so-modern), and how their attitude changes over the course of the story, and I do still appreciate the swerve away from hard sci-fi that happens partway through. It’s flawed, but it’s still a classic.
The Lady Astronaut series, #1–2, plus the initial novelette (The Lady Astronaut of Mars, The Calculating Stars, The Fated Sky), Mary Robinette Kowal
Alt-history in which the author bootstraps the space race a decade early via a meteorite-shaped forcing function. Post-steampunk, but pre-electronic-computer; the author describes it as “punchcard punk”. This is Hidden Figures meets Apollo 13, with a strong focus on the racial and gender discrimination of the 1950s³.

(The novelette was published first — winning the 2014 Hugo for best novelette — but is set some thirty or so years after the novels. I read it first, but you could easily read it after: it’s not directly connected to the novels.)

The novels suffer very slightly from telling two separate stories: one is a humanity-against-the-elements story (Apollo 13 or The Martian), while the other is a documentary about 1950s cultural attitudes. Both are interesting stories, but I found it a little frustrating when the story would focus tightly on the protagonist to the exclusion of the wider global impact (pun most definitely intended).

However, overall this is definitely worth reading.
The Labyrinth Index (Laundry Files, #9), Charles Stross
Well, we’re past the Lovecraftian singularity at this point, and it’s all about surviving while the transhumans play. One of whom happens to be inhabiting the Prime Minister at present, and who has opinions about foreign policy.

Mhari, who we met in her current incarnation in The Rhesus Chart a while back, is presently attempting to stay alive while said elder god is playing eleven-dimensional chess nearby. Meanwhile, the US appears to have collectively forgotten that the executive branch exists…

I liked this a lot. Mhari was interesting without being annoying, as I worried she might be (she was in some of the earlier books; deliberately so in order to annoy Bob, I think). Otherwise, this was pretty much exactly as I expected at this point in the series: a lot of fun.
Revenant Gun (The Machineries of Empire, #3), Yoon Ha Lee
Yoon Ha Lee’s conclusion about a 400-year-old immortal general and crazy magic that works because of a shared consensual reality. It’s military sci-fi, kinda?

I can’t really discuss this without spoilers, but while it did more hand-holding than earlier books in the series, it still featured a lot of creative worldbuilding.
Lies Sleeping (Rivers of London, #7), Ben Aaronovitch,
Like The Labyrinth Index above, by the time you get this far into a series, you pretty much know what to expect: in this case, a fun police procedural with magic and geeky in-jokes.

However, I did find it a bit hard to follow what was going on with the plot here, which seemed to be both a bit muddled and to reach back over the whole of the series. (I’ve also not read the associated graphic novels, which might have helped, though they’re not supposed to be necessary prerequisites.)

Side-note: an interesting article about intersectionality in the Rivers of London series.
A Canticle For Leibowitz, Walter M. Miller Jr.
A classic (1959) post-apocalyptic sci-fi tale published during a high point in Cold War tensions. In the far aftermath of nuclear war, society struggles to drag itself out of a new dark age, and to rediscover and protect old knowledge. This is three distinct stories — originally published as such — separated by time (centuries), and vaguely connected by place.

This unapologetically puts forwards a Christian (specifically, Catholic) viewpoint, with the church to some extent a main character. It has some ironic humour, but also serious comment about ethics and human nature. With one exception near the end, I didn’t find it to be too preachy.

It made a big impact at the time, but is it actually a good story nowadays? Well, meh. I found it thought-provoking (and somewhat depressing) in turns, but I can’t actually say that the story exists much more than as a framework for the author’s viewpoints. Largely unsatisfying, and probably more important for the historical context now.
Smoke and Iron (The Great Library, #4), Rachel Caine
Okay, this is just brilliant. Along with The Arcadia Project series (above), this was easily one of my favourite reads of 2018.

So, why? Well, it’s got good worldbuilding, a fast-paced (and fun) plot, it’s got great characters and character development, and good writing.

The plot itself starts immediately after Ash and Quill, so talking about the plot directly would spoil the earlier books. In general, though, this series is a YA alt-history/fantasy in which the Great Library (of Alexandria) has become a ruthless worldwide power, tightly controlling both the dissemination of information and also the source for some of the magic/alchemy that’s available in this world.

On the writing: one section in particular has the viewpoint character magically hypnotized into believing that they’re someone else, and the author shifts the (tight third-person) text to match that impersonated character, having the viewpoint character not just act as another, but having the prose notice (and the character comment on internally) an entirely different set of things appropriate for the character they were impersonating. Subtle, but I liked it.
The Murderbot Diaries series (All Systems Red, Artificial Condition, Rogue Protocol, Exit Strategy), Martha Wells
Nom nom nom. These were great. I inhaled the whole series all in one go.

I could have become a mass murderer after I hacked my governor module, but then I realized I could access the combined feed of entertainment channels carried on the company satellites. It had been well over 35,000 hours or so since then, with still not much murdering, but probably, I don’t know, a little under 35,000 hours of movies, serials, books, plays, and music consumed. As a heartless killing machine, I was a terrible failure.

The opening lines of All Systems Red

Murderbot is a fairly apathetic and introverted humanform security droid that just wants to be left alone to watch sci-fi soap operas, but stupid humans keep doing stupid things that stop it from doing so, or worse, are trying to interact with it rather than let it stand in a corner by itself (to watch soap operas again, probably).

This is a series of four novellas written with Murderbot narrating, and it’s delightful. They are short, so each has a fairly straightforward plot, but it’s great fun nonetheless.
Ra, Sam Hughes
On the one hand, Ra is excellent: it’s a hard sci-fi novel (novella?) with some really well thought-through worldbuilding. To some extent, it puts me in mind of Snow Crash. (It also has some really nice in-jokes, which I don’t think I can reference without being spoilery.)

It was published in chapters on Sam Hughes’ blog (at qntm.org/ra, where you can read it for free), and there are also a few EPUB versions, some of which you can choose to pay for.

So as a self-published story, it’s really rather good. Unfortunately, on the other hand, I think it could also do with some quite significant editing, as there seem to be two almost completely different stories here, and while they’re linked, the story switches at one point from something grounded (like Snow Crash) to something incomprehensible by Greg Egan, and while both are good, I don’t think they fit well together.

To sum up: I managed to read 40 books last year, almost all of which were fiction, mostly urban fantasy and sci-fi, to nobody’s surprise. (I also started and failed to finish a bunch of non-fiction books).

I think I did a better job of picking books with diverse protagonists this time round, and while most of the books I read were published in the last few years (40% were published in 2018), I managed to also seek out a few older ones (Kindred, for example, I’m really glad I got round to reading).

Onward to 2019!

I’d have called it sci-fi purely because it has time-travel, but I ran across an interview with Butler in which she points out, “Kindred is fantasy. I mean literally, it is fantasy. There’s no science in Kindred.” She has a point. ↩
… though from what I can tell, 6 Ma is squarely in the Miocene epoch, not the Pliocene. In A Pliocene Companion, Word of God resolves this by stating that, in-universe, the Pliocene is considered to start around 11 Ma (not 5.6 or 5.33 Ma, as in our reality). ↩
And to a large extent, discrimination that’s still present today: there’s a line where our heroine says that “people would ignore what I said until [my husband] repeated it”, which sounds familiar enough. ↩

This is Just to Say

2017-07-04T14:37:51Z

I have invalidated
the assumptions
that your code
depended upon

Forgive me
they were so well hidden
and so fragile
— Reid McKenzie, Twitter

Deadlocks in Java class initialisation

2015-10-25T21:21:44Z

I recently ran across the fact that it’s possible to make the Java runtime deadlock while initialising a class — and that this behaviour is even mandated by the Java Language Specification.

Here’s a Java 7 program that demonstrates the problem:

public class Program {
  public static void main(String args[]) {
    new Thread(new Runnable() {
      @Override public void run() {
        A.initMe();
      }
    }).start();

    B.initMe();
  }

  private static class A {
    private static final B b = new B();
    static void initMe() {}
  }

  private static class B {
    private static final A a = new A();
    static void initMe() {}
  }
}

In addition to demonstrating that lambdas are a good idea (all that boilerplate to start a thread!), this also shows how cycles during class initialisation can lead to a deadlock. Here’s what happens when you run it¹:

$ javac Program.java
$ java Program

That is, it hangs.

In Java, classes are loaded at some arbitrary point before use, but are only initialised — running the static {} blocks and static field initialisers — at defined points².

One of these points is just before a static method is invoked, and so the two calls to A.initMe() and B.initMe() above will both trigger initialisation for the respective classes.

In this case, each class contains a static field that instantiates an instance of the other class. Instantiating the other class requires that that class is initialised, and so what we end up with is that each class’s initialisation is blocked waiting for the initialisation of the other class to complete.

If you trigger a thread dump at this point — by sending a SIGQUIT or hitting Ctrl-\ (or Ctrl-Break on Windows) — then you’ll see something like this:

Full thread dump OpenJDK 64-Bit Server VM (24.79-b02 mixed mode):

"Thread-0" prio=10 tid=0x00007efd50105000 nid=0x51db in Object.wait() [0x00007efd3f168000]
   java.lang.Thread.State: RUNNABLE
        at Program$A.(Program.java:13)
        at Program$1.run(Program.java:5)
        at java.lang.Thread.run(Thread.java:745)

"main" prio=10 tid=0x00007efd5000a000 nid=0x51ca in Object.wait() [0x00007efd59d45000]
   java.lang.Thread.State: RUNNABLE
        at Program$B.(Program.java:18)
        at Program.main(Program.java:9)

[...]

Interestingly, you can see that while both threads are executing an implicit Object.wait(), they’re listed as RUNNABLE rather than WAITING, and there’s no output from the deadlock detector. I suspect that the reason for both of these is that the details of class initialisation changed in Java 7:

In Java 6, the runtime would attempt to lock the monitor owned by each Class instance for the duration of the initialisation, while in Java 7, attempting to initialise a class that’s already being initialised by another thread just requires that that the caller be blocked in some undefined fashion until that initialisation completes.

There are other ways to trigger the same problem, too. Here’s another problematic snippet:

public class Foo {
  public static final Foo EMPTY = new EmptyFoo();
}

public class EmptyFoo extends Foo {}

Here we have Foo, and EmptyFoo, a special — presumably empty, in some fashion — version of Foo. EmptyFoo is usable directly, but it’s also available as Foo.EMPTY.

The problem here is that initialising EmptyFoo requires us to initialise the superclass, and initialising Foo requires initialisation of EmptyFoo for the static field. This would be fine in one thread, but if two threads attempt to initialise the two classes separately, deadlock results.

Cyclic dependencies between classes have always been problematic in both Java and C#, as references to non-constant static fields in classes that are already being initialised see uninitialised (Java) or default (C#) values. However, normally the initialisation does complete; here, it doesn’t, and here the dependencies are simply between the classes, not between their data members.

Unfortunately, I don’t know of any convenient way to detect these cycles in Java: OpenJDK provides -XX:+TraceClassInitialization, which I suspect might be useful, but it’s only available in debug builds of the OpenJDK JRE³, and I haven’t been able to confirm exactly what it shows.

And for what it’s worth, I’m not aware of a better solution for detecting cycles in C# either. For Noda Time, we used a custom cycle detector for a while; it spotted some bugs resulting from reading default values, but it was too brittle and invasive (it required modifying each class), and so we removed it before 1.0.

I suppose that if we assume that class initialisation occurs atomically and on multiple threads, then this kind of problem is bound to come up⁴. Perhaps what’s surprising is that these languages do allow the use of partially-initialised classes in the single-threaded case?

If videos are your thing, the folks at Webucator have turned this post into a video as part of their free (registration required) Java Solutions from the Web course. They also offer a series of paid Java Fundamentals classes covering a variety of topics.

Or at least, what happens when I run it, on a multiprocessor Debian machine running OpenJDK 7u79. I don’t think the versions are particularly important — this behaviour seems to be present in all Java versions — though I am a little surprised that I didn’t need to add any additional synchronisation or delays. ↩
A similar situation exists in C# for classes with static constructors (for classes without, the runtime is allowed much more latitude as to when the type is initialised). ↩
You can trace class loading with -XX:TraceClassLoadingPreorder and -XX:TraceClassLoading, but this doesn’t tell you when class initialisation happens. ↩
He says, with a sample size of one. I haven’t managed to confirm what C# does, for example, and C++ avoids this problem by replacing it with a much larger one, the “static initialisation order fiasco”. ↩

pip install --isolated fails on Python 3.0–3.3

2015-06-08T12:35:27Z

(This is a quick post for search-engine fodder, since I didn’t manage to find anything relevant myself.)

If you’re using pip install --isolated to install Python packages and find that it fails with an error like the following:

Complete output from command python setup.py egg_info:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: -c --help [cmd1 cmd2 ...]
   or: -c --help-commands
   or: -c cmd --help

error: option --no-user-cfg not recognized

… then you might have run into an incompatibility between pip and Python versions 3.0–3.3.

pip version 6.0 added an isolated mode (activated by the --isolated flag) that avoids looking at per-user configuration (pip.conf and environment variables).

Running pip in isolated mode also passes the --no-user-cfg flag to Python’s distutils to disable reading the per-user ~/.pydistutils.cfg. But that flag isn’t available in Python versions 3.0–3.3, causing the error above.

I ran into this because I recently migrated the Python code that generates this site to run under Python 3.x. I’m using a virtualenv setup, so once I had everything working under both Python versions, I was reasonably confident that I could switch ‘production’ (i.e. the Compute Engine instance that serves this site) to Python 3 and discard the 2.x-compatibility code.

Good thing I tested it out first, since it didn’t even install.

It turns out that:

virtualenv ships an embedded copy of pip and setuptools, but setuptools will use the system version of distutils¹, and
--no-user-cfg was added in Python 2.7, but wasn’t ported to 3.x until 3.4², and
the distribution I’m using on my real server (Debian 7) ships with Python 3.2.3, rather than the 3.4.x I’m using elsewhere.

I worked around this by just omitting the --isolated flag for Python verions [3.0, 3.4) — though since I don’t actually have any system config files in practice, I probably could have set PIP_CONFIG_FILE=/dev/null instead (which has the effect of ignoring all config files).

I’m not the first person to have noticed that virtualenv isn’t actually hermetic. Though some of that rant is out of date now (Python wheel files provide prebuilt binaries), and some isn’t relevant to the way I’m using virtualenv/pip, it’s definitely true that the dependency on the system Python libraries is the main reasons I’d look to something more like Docker or Vagrant for deployment were I doing this professionally.

So did I finally manage to switch to Python 3.x after that? Not even close: Python 3.x didn’t gain the ability to (redundantly) use the u'foo' syntax for Unicode strings until 3.3, and some of my dependencies use that syntax. So I’m waiting until I can switch to Debian 8 on Compute Engine³, at which point I can cleanly assume Python 3.4 or later.

This is a rant for another day, but it looks like virtualenv monkeypatches pip, which monkeypatches setuptools, which either monkeypatches or builds upon distutils. Debugging through this edifice of patched abstractions is… not easy. ↩
It’s a bit more complex than that: Python 3.0 and 3.1 were released first, then the feature was implemented in both 2.7 and 3.2, but then distutils as a whole was rolled back to its 3.1 state before 3.2 was released. That rollback was reverted for Python 3.4. ↩
I can apt-get dist-upgrade from the Debian 7 image just fine, but it’s a bit slow and hacky, so I’d rather wait for official images. (I also need to fix some custom mail-related configuration that appears to have broken under Debian 8.) ↩

Noda Time 1.3.1

2015-03-10T09:35:33Z

It’s been a while since the last Noda Time release, and while we’re still working towards 2.0, we’ve been collecting a few bug fixes that can’t really wait. So last Friday¹, we released Noda Time 1.3.1.

Noda Time 1.3.1 updates the built-in version of TZDB from 2014e to 2015a, and fixes a few minor bugs, two of which were triggered by recent data changes.

Since it’s been a while since the previous release, it may be worth pointing out that new Noda Time releases are not the only way to get new time zone data: applications can choose to load an external version of the time zone database rather than use the embedded version, and so use up-to-date time zone data with any version of the Noda Time assemblies.

If you’re in a hurry, you can get Noda Time 1.3.1 from the NuGet repository (core, testing, JSON support packages), or from the links on the Noda Time home page. The rest of this post talks about the changes in 1.3.1 in a bit more detail.

End of year transitions (Bangladesh)

In the middle of 2009, Bangladesh started observing permanent daylight saving time, as an energy-saving measure. This was abandoned at the end of that year, and the country went back to permanent standard time.

Until recently, that transition back to standard time was actually recorded as happening a minute too early, at 23:59 on December 31^st. TZDB 2014g fixed this by changing the transition time to “24:00” — that is, midnight at the end of the last day of the year.

Noda Time could already handle transitions at the end of the day, but would incorrectly ignore this particular transition because it occurred ‘after’ 2009. That’s now fixed, and Noda Time 1.3.1 returns the correct offset for Asia/Dhaka when using data from TZDB 2014g or later.

BCL provider: historical changes to the base offset (Russia)

In October 2014, most of Russia switched from permanent daylight saving time to permanent standard time, effectively moving local time back one hour. These changes were included in TZDB 2014f.

For people using the BCL provider instead of the TZDB provider (and using Windows), Microsoft delivered a hotfix in September 2014. However, our BCL provider depends upon the .NET framework’s TimeZoneInfo class, and the .NET framework — unlike TZDB — is unable to represent historical changes to the ‘base’ offset of a time zone (as happened here).

The result is that Noda Time (and other applications using TimeZoneInfo in .NET 4.5.3 and earlier) incorrectly compute the offset for dates before October 26^th, 2014.

A future update of the .NET framework should correct this limitation, but without a corresponding change in Noda Time, the extra information wouldn’t be used; Noda Time 1.3.1 prepares for this change, and will use the correct offset for historical dates when TimeZoneInfo does.

BCL provider: time zone equality

The time zones returned by the BCL provider have long had a limitation in the way time zone equality was implemented: a BCL time zone was considered equal to itself, and unequal to a time zone returned by a different provider, but attempting to compare two different BCL time zone instances for equality always threw a NotImplementedException. This was particularly annoying for ZonedDateTime, as its equality is defined in terms of the contained DateTimeZone.

This was documented, but we always considered it a bug, as it wasn’t possible to predict whether testing for equality would throw an exception. Noda Time 1.3.1 fixes this by implementing equality in terms of the underlying TimeZoneInfo: BCL time zones are considered equal if they wrap the same underlying TimeZoneInfo instance.

Note that innate time zone equality is not really well defined in general, and is something we’re planning to reconsider for Noda Time 2.0. Rather than rely on DateTimeZone.Equals(), we’d recommend that applications that want to compare time zones for equality use ZoneEqualityComparer to specify how two time zones should be compared.

And finally…

There are a handful of other smaller fixes in 1.3.1: the NodaTime assembly correctly declares a dependency on System.Xml, so you won’t have to; the NuGet packages now work with ASP.NET’s kpm tool, and declare support for Xamarin’s Xamarin.iOS (for building iOS applications using C#) in addition to Xamarin.Android, which was already listed; and we’ve fixed a few reported documentation issues along the way.

As usual, see the User Guide and 1.3.1 release notes for more information about all of the above.

Work is still continuing on 2.0 along the lines described in our 1.3.0 release post, and we’re also planning a 1.4 release to act as a bridge between 1.x and 2.0. This will deprecate members that we plan to remove in 2.0 and introduce the replacements where feasible.

Release late on Friday afternoon? What could go wrong? Apart from running out of time to write a blog post, I mean. ↩

Just how big is Noda Time anyway?

2015-02-13T10:15:11Z

Some years back, I posted a graph showing the growth of Subversion’s codebase over time, and I thought it might be fun to do the same with Noda Time. The Subversion graph shows the typical pattern of linear growth over time, so I was expecting to see the same thing with Noda Time. I didn’t¹.

Noda Time’s repository is a lot simpler than Subversion’s (it’s also at least an order-of-magnitude smaller), so it wasn’t that difficult to come up with a measure of code size: I just counted the lines in the .cs files under src/NodaTime/ (for the production code) and src/NodaTime.Test/ (for the test code).

I decided to exclude comments and blank lines this time round, because I wanted to know about the functional code, not whether we’d expanded our documentation. As it turns out, the proportion of comments has stayed about the same over time, but that ratio is very different for the production code and test code: comments and blank lines make up approximately 50% of the production code, but only about 20–25% of the test code.

Here’s the graph. It’s not exactly up-and-to-the-right, more… wibbly-wobbly-timey-wimey.

Noda Time’s codesize in KLOC over time

There are some thing that aren’t surprising: during the pre-1.0 betas (the first two unlabelled points) we actively pruned code that we didn’t want to commit to for 1.x², so the codebase shrinks until we release 1.0. After that, we added a bunch of functionality that we’d been deferring, along with a new compiled TZDB file format for the PCL implementation. So the codebase grows again for 1.1.

But then with 1.2, it shrinks. From what I can see, this is mostly due to an internal rewrite that removed the concept of calendar ‘fields’ (which had come along with the original mechanical port from Joda Time). This seems to counterbalance the fact that at the same time we added support for serialization³ and did a bunch of work on parsing and formatting.

1.3 sees an increase brought on by more features (new calendars and APIs), but then 2.0 (at least so far) sees an initial drop, a steady increase due to new features, and (just last month) another significant drop.

The first decrease for 2.0 came about immediately, as we removed code that was deprecated in 1.x (particularly, the handling for 1.0’s non-PCL-compatible compiled TZDB format). Somewhat surprisingly, this doesn’t come with a corresponding decrease in our test code size, which has otherwise been (roughly speaking) proportional in size to the production code (itself no real surprise, as most of our tests are unit tests). It turns out that the majority of this code was only covered by an integration test, so there wasn’t much test code to remove.

The second drop is more interesting: it’s all down to new features in C# 6.

For example, in Noda Time 1.3, Instant has Equals() and GetHashCode() methods that are written as follows:

public override bool Equals(object obj)
{
    if (obj is Instant)
    {
        return Equals((Instant)obj);
    }
    return false;
}
public override int GetHashCode()
{
    return Ticks.GetHashCode();
}

In Noda Time 2.0, the same methods are written using expression-bodied members, in two lines (I’ve wrapped the first line here):

public override bool Equals(object obj) =>
    obj is Instant && Equals((Instant)obj);
public override int GetHashCode() => duration.GetHashCode();

That’s the same functionality, just written in a terser syntax. I think it’s also clearer: the former reads more like a procedural recipe to me; the latter, a definition.

Likewise, ZoneRecurrence.ToString() uses expression-bodied members and string interpolation to turn this:

public override string ToString()
{
    var builder = new StringBuilder();
    builder.Append(Name);
    builder.Append(" ").Append(Savings);
    builder.Append(" ").Append(YearOffset);
    builder.Append(" [").Append(fromYear).Append("-").Append(toYear).Append("]");
    return builder.ToString();
}

into this:

public override string ToString() =>
    $"{Name} {Savings} {YearOffset} [{FromYear}-{ToYear}]";

There’s no real decrease in test code size though: most of the C# 6 features are really only useful for production code.

All in all, Noda Time’s current production code is within 200 lines of where it was back in 1.0.0-beta1, which isn’t something I would have been able to predict. Also, while we don’t quite have more test code than production code yet, it’s interesting to note that we’re only about a hundred lines short.

Does any of this actually matter? Well, no, not really. Mostly, it was a fun little exercise in plotting some graphs.

It did remind me that we have certainly simplified the codebase along the way — removing undesirable APIs before 1.0 and removing concepts (like fields) that were an unnecessary abstraction — and those are definitely good things for the codebase.

And it’s also interesting to see how effective the syntactic sugar in C# 6 is in reducing line counts, but the removal of unnecessary text also improves readability, and it’s that that’s the key part here rather than the number of lines of code that results.

But mostly I just like the graphs.

Or, if you prefer BuzzFeed-style headlines, “You won’t believe what happened to this codebase!”. ↩
To get to 1.0, we removed at least: a verbose parsing API that tried to squish the Noda Time and BCL parsing models together, an in-code type-dependency graph checker, and a very confusingly-broken CultureInfo replacement. ↩
I’m not counting the size of the NodaTime.Serialization.JsonNet package here at all (nor the NodaTime.Testing support package), so this serialization support just refers to the built-in XML and binary serialization. ↩

Simple privilege separation using ssh

2015-01-24T15:48:39Z

If you have a privileged process that needs to invoke a less-trusted child process, one easy way to reduce what the child is able to do is to run it under a separate user account and use ssh to handle the delegation.

This is pretty simple stuff, but as I’ve just wasted a day trying to achieve the same thing in a much more complicated way, I’m writing it up now to make sure that I don’t forget about it again.

(Note that this is about implementing privilege separation using ssh, not about how ssh itself implements privilege separation; if you came here for that, see the paper Preventing Privilege Escalation by Niels Provos et al.)

In my case, I’ve been migrating my home server to a new less unhappy machine, and one of the things I thought I’d clean up was how push-to-deploy works for this site, which is stored in a Mercurial repository.

What used to happen was that I’d push from wherever I was editing, over ssh, to a repository in my home directory on my home server, then a changegroup hook would update the working copy (hg up) to include whatever I’d just pushed, and run a script (from the repository) to deploy to my webserver. The hook script that runs sends stdout back to me, so I also get to see what happened.

(This may sound a bit convoluted, but I’m not always able to deploy directly from where I’m editing to the webserver. This also has the nice property that I can’t accidentally push an old version live by running from the wrong place, since history is serialised through a single repository.)

The two main problems here are that pushing to the repository has the surprising side-effect of updating the working copy in my home directory (and so falls apart if I accidentally leave uncommitted changes lying around), and that the hook script runs as the user who owns the repository (i.e. me), which is largely unnecessary.

For entirely separate reasons, I’ve recently needed to set up shared Mercurial hosting (which I found to be fairly simple, using mercurial-server), so I now have various repositories owned by a single hg user.

I don’t want to run the (untrusted) push-to-deploy scripts directly as that shared user, because they’d then have write access to all repositories on the server. (This doesn’t matter so much for my repositories, since only I can write to them, and it’s my machine anyway, but it will for some of the others.)

In other words, I want a way to allow one privileged process (the Mercurial server-side process running as the hg user) to invoke another (a push-to-deploy script) in such a way that the child process doesn’t retain the first process’s privileges.

There are lots of ways to achieve this, but one of the simplest is to run the two processes under different user accounts, then either find a way to communicate between two always-running processes (named pipes or shared memory, for example), or for one to invoke the other directly.

The latter is more appropriate in this case, and while the obvious way for a (non-root) user to run a process as another is via sudo, the policy specification for that (in /etc/sudoers) is… complicated. Happily, there’s a simpler way that only requires editing configuration files owned by the two users in question: ssh.

The setup is fairly easy: I’ve created a separate user that will run the push-to-deploy script (hg-blog), generated a password-less keypair for the calling (hg) user, and added the public key (with from= and command= options) to /home/hg-blog/.ssh/authorized_keys.

Now the Mercurial server-side process can trigger the push script simply by creating a $REPOS/.hg/hgrc containing:

[hooks]
changegroup.autopush = ssh hg-blog@localhost

This automatically runs the command I specified in the target user’s authorized_keys, so I don’t even have to worry about listing it here¹.

In conclusion, ssh is pretty good tool for creating a simple privilege separation between two processes. It’s ubiquitous, and doesn’t require root to do anything special, and while the case I’m using it for here involves two processes on the same machine, there’s actually no reason that they couldn’t be on different machines.

The ‘right’ answer may well be to run each of these as Docker containers, completely isolating them from each other. I’m not at that point yet, and in the meantime, hopefully by writing this up I won’t forget about it the next time I need to do something similar!

In this case, adding a command restriction doesn’t protect against a malicious caller, since the command that’s run immediately turns around and fetches the next script to run from that same caller. It does protect against someone else obtaining the (password-less by necessity) keypair, I suppose, though the main reason is the one listed above: it means that ‘what to do when something changes’ is specified entirely in one place. ↩

Mobile friendly

2014-12-31T23:55:34Z

Mobiles friendly: the Nexii are much happier now

This month, I decided to do something about the way this site rendered on mobile devices. Now that it works reasonably well, I thought it might be interesting to talk about what I needed to change — which, as it turned out, wasn’t that much.

What’s the problem?

First off, here’s what things used to look like on a Nexus 6 (using my pangrammatic performance post as an example).

Desktop view: we’re not even using half the screen!
(Nexus frames from the Android device art generator, used under CC BY 2.5)

Double-tapping on a paragraph zooms to fit the text to the viewport, which produces something that’s fairly readable, but you can still scroll left and right into dead space.

As well as making it a pain to just scroll vertically, this also caused other problems, like the way that double-tapping on bulleted lists (which have indented left margins) would zoom the viewport such that it cropped the left edge of the main content area.

Zooming in to a bulleted list: what happened to the first paragraph?

This is all pretty terrible, of course, and about par for the course for mobile browsers.

So what’s going on here? Well, for legacy reasons, mobile browsers typically default to rendering a faux-desktop view, first by setting the viewport so that it can contain content with a fixed “fallback” width (usually around 1000px), and then by fiddling with text sizes to make things more readable.

meta viewport to the rescue

This behaviour can be overridden fairly easily using the (de facto standard, but not particularly well defined) meta viewport construct. For example, this is what I needed to include to revert to a more sensible behaviour:

<meta name=viewport
    content="width=device-width, initial-scale=1">

The two clauses have separate and complementary effects:

width=device-width sets the viewport width to the real screen width rather than the fallback width.
initial-scale=1 both sets an initial 1:1 zoom level, and also maintains that zoom level during device rotation (rather than maintaining the viewport width, as is apparently done by some devices). Importantly, the user isn’t restricted from zooming in further.

(This is all explained in rather more detail in the Google Developers document that I linked to above¹.)

In practice, I’d recommend just taking the snippet above as a cargo-cultable incantation that switches off the weird faux-desktop rendering and fits the content to the screen.

So, after I’ve added the above, I’m done? Not quite.

overflow: visible

With a viewport, the initial view is no longer zoomed out, but now feels cramped, and the page still needs to be scrolled to reach content that extends past the right edge

Our viewport still needs to be scrolled horizontally to reach some of the content, which is far from ideal, and we’ve no longer got any left-hand margin at all. All in all, it’s pretty hard to read our content even though it’s now zoomed in.

It’s probably worth taking a step back to look at the layout we’re using.

The overall page structure here is pretty trivial, roughly:

body {
  max-width: 600px;
  margin: 0 auto;
}

This centres the in the viewport, allowing it to expand up to 600px wide.

We can fix the disappearing margins with body { padding: 0 1em; } (which only has an effect if the body would otherwise be flush to the viewport edges), and while we’re here, we might as well change that max-width: 600px to something based on ems (I went for max-width: 38em).

Most of the content of is text in paragraphs; that’s fine. The two immediate problems are code snippets (in

blocks), and images.

Right away we can see a problem: the images have a declared width and height, and aren’t going to adapt if the width of the element changes.

The code snippets have a related problem:

text won’t reflow, and the default CSS overflow behaviour allows block-level content to overflow its content box, expanding the viewport’s canvas and reintroducing horizontal scrolling².

We can fix the code snippets fairly easily by enabling horizontal scrollbars for the snippets where needed:

pre {
  overflow: auto;
  overflow-y: hidden;
}

This uses overflow, a CSS 2.1 property, to ensure that content is clipped to the content box, adding scrollbars if needed. It then uses overflow-y, a CSS3 property, to remove any vertical scrollbars, leaving us with only the horizontal scrollbars (or none). If the overflow-y property isn’t supported (and in practice it is), the browser will still render something reasonable.

Responsive images

That doesn’t help with the images, of course. The term you’ll want to search for is “responsive images”, but what we’re actually going to do is size the image so that it fits within the space available³.

One easy way to do this is to simply replace:

<img src="kittens" width="400" height="300">

with

<img src="myimage" style="width: 100%">

and, broadly speaking, that’s what I’m now doing⁴. Note that you do need to drop the height property (and so might as well drop width too), otherwise you’ll have an image with a variable width and fixed height (which doesn’t work so well, as you might imagine).

There are some caveats with older versions of Internet Explorer (aren’t there always?) but in my case I’ve decided that I’m only interested in supporting IE9 and above⁵, so these don’t apply.

But wait a sec: we declared the image’s dimensions in the first place so that the browser could reserve space for the image, rather than reflowing the page as it downloaded them. Does this mean that we need to abandon that property?

Maybe. Somewhat surprisingly, there isn’t any way (yet⁶) to declare the aspect ratio (or, equivalently, original size) of an image while also allowing it to be resized to fit a container. However, all’s not lost: for common image aspect ratios, we can adopt a technique documented by Anders Andersen where we prevent reflow by pre-sizing a container to a given aspect ratio.

The tl;dr is that we use something like the following markup instead:

<div class="ratio-16-9">
  <img src="myimage" style="width: 100%">
div>

We then pre-size the containing div using the CSS rule padding-bottom: 56.25% (9/16 = 0.5625; CSS percentages refer to the container’s width), and position the image over the div using absolute positioning, taking it out of the flow.

This works, but there are some caveats: it only works for images with common aspect ratios, of course (4:3 and 16:9 are pretty common, but existing images might have any aspect ratio), and, as written, it only works for images that are sized to 100% of the container’s width (though you could handle fixed smaller sizes as well, if desired).

In my case, I elected to make all images sized to 100% of the viewport width (which works well, mostly), and applied the reflow-avoidance workaround only to those images with 16:9 or 4:3 aspect ratios, leaving the others to size-on-demand.

I did notice some surprising rounding differences on Chrome that lead me to reduce that 56.25% of padding to 56.2% (which may truncate the image by a pixel or two; better than allowing the background to show through, though). I suspect this may be because Chrome allows HTML elements in general to have fractional CSS sizes, while it appears to restrict images to integral pixel sizes.

Just a quality of implementation issue

This gave me pretty good results, but I also took the opportunity to make a few other changes to make things work a little better:

Where possible, I went back to the source images and replaced the version I had with a slightly larger (and more standard aspect-ratio’d) version. These (slightly) higher resolution images look a bit better on high-DPI screens (the Nexus 6 has a 2560×1440 screen, the same resolution as my 27″ Dell monitor, albeit at 493ppi rather than 109ppi; this is, quite frankly, completely ridiculous).
I switched the gnuplot-produced PNG graphs to SVG images, which is something I probably should have done a long time ago. You can see an example in the pangrammatic performance post I mentioned above.
In one case, I decided to try out multiple resolutions using the (not yet supported by anything other than Chrome) syntax. In this case, small-screen devices (e.g. iPhone 3G, if it supported the syntax) get a 320×182 image, desktop browsers get a 750×422 image, and hi-res devices like the Nexus 6 get a 1440×810 image). It’s not quite working completely right yet, but it looks promising.
I also simplified some of the more complicated markup I was using: I had some instances of with an image content and fallback content, which just about worked (and looked great in Lynx!), but which wasn’t easy to adjust to fit to the new approach. It doesn’t look like the new HTML5 image features (, and , which I’m not using) have any support for rendering arbitrary HTML in place of an image, which is a bit of a shame, but probably a reasonable trade-off.Finally, I switched the AdSense ad slot at the bottom of the content to a ‘responsive’ ad unit, which will resize to fit the available space. On a phone, you can generally see it request a wider ad creative when you rotate the device, which is kinda cool. (It’s probably worth mentioning again that I receive no money at all from these ads; they’re only there because I’m currently working on the AdSense team.) It’s worth noting that a lot of these changes also improved the site on desktop browsers. That’s not really surprising: “mobile-friendly” is more about adaptability than a particular class of device. Resources So there you have it: for a good mobile site, you may only have to a) add a meta viewport tag, and b) size your content (particularly images) to adapt to the changing viewport width. Here are some resources (some of which I mentioned above) that I found useful: Configure the Viewport, a PageSpeed Insights article describing what meta viewport actually does. Anders Andersen’s Responsive images – how to prevent reflow, about displaying responsive images with fixed aspect ratios. A List Apart’s Responsive Images in Practice, which covers some of the advanced techniques (that I’m not using) in more detail. The Responsive images community group demos page, showing some examples of and . caniuse.com, for determining which browsers actually pay attention to some of this markup and CSS. BrowserStack’s cross-browser screenshotter, which allowed me to verify that I had, in fact, broken rendering on IE7, but nothing else (free usage is limited to a small number of uses, though). The Google Webmasters Tools Mobile-Friendly Test, which will highlight common problems that make sites mobile-unfriendly. Somewhat surprisingly, this is the best reference I’ve found for what the meta viewport tag actually does. ↩ In theory, the same problem can occur for other elements; for example, an unbreakable URL in running text can cause a element to overflow. In practice, though, that’s not something that I’ve found worth handling. ↩ There is more to responsive images than just resizing. For example, you can serve completely different images to different devices using media queries (so-called “art direction”). However, that’s way more complicated than what I needed. ↩ You can alternatively use max-width if you only want to shrink images wider than their container; I also wanted to enlarge the smaller ones. ↩ Why only IE9? It’s available on everything going back to Windows Vista, and it’s the first version to support SVG natively and a bunch of CSS properties that I’m using (::pseudo-elements, not(), box-shadow, to name a few). Windows XP users could well have trouble connecting to this server in the first place anyway, due to the SSL configuration I’m using, so requiring IE9/Vista doesn’t seem too unreasonable. ↩ From what I’m lead to believe, this is being actively worked on. ↩ Cloudy DNS 2014-11-23T13:04:29Z This is a machine running on the end of an ADSL line. It’s not a very happy machine: $ uptime 11:28:01 up 781 days, 1:39, 1 user, load average: 2.01, 2.03, 2.05 It’s actually idle, so why is the load average above 2.0? Because there’s an unkillable mdadm process stuck in a D state, and a second mount process that’s permanently runnable. So why haven’t I just rebooted it (and better still, upgraded it: obviously it’s running an old kernel)? Because I’m not entirely convinced it’ll start up again: the disks were acting a bit suspiciously, and lately the PSU fan has been making a bit of a racket as well. Unfortunately, it’s also a machine that’s accumulated infrastructure that I care about: DNS, Apache, and so on. The data is safely backed up off-machine, but if I just tear it down, a bunch of things will be broken while I’m rebuilding it. So instead, I’ve been trying to decommission it piece-by-piece. I’ve also got a bit bored running all my own infrastructure, so some of those moving parts have been put onto dedicated consumer hardware (getting the router to handle internal DNS and DHCP, getting a Synology NAS for Samba, etc), and I’ve moved some others onto a hosted VM, so that I don’t have to worry about the hardware: that copy of Apache has been (mostly) obsoleted by moving this site to Google Compute Engine last January, for example. But there’s still a few things that I’m depending upon this machine for. Until recently, one was as the primary DNS server for farside.org.uk. I was using a free secondary DNS service from BuddyNS: they provide replicas that I listed as the primaries, and those did regular zone transfers from my server for the source of truth. That was pretty convenient, and BuddyNS have been pretty great (the free tier is good for up to 300K queries per month, of which I was using about 70-100K), but they only provide secondary DNS, so I went looking for another solution. I’m sure that there are many other DNS providers around, but since I’m hosting www.farside.org.uk on Google Compute Engine, I decided to try out Google Cloud DNS, which provides a simple primary DNS service, available via anycast over both IPv4 and IPv6 (that arrangement seems to be fairly standard for DNS providers nowadays). This one’s not free, but it is pretty cheap: US$0.20/month per domain, plus US$0.40/month per million queries. For me, that should work out to less than $3/year¹. Otherwise, it seems to be broadly similar to other DNS providers. You can make updates via a JSON/REST API, and API client libraries and a basic command-line client are provided. They do only support a predefined set of resource record types, though I suspect that’s not a problem for most people². I actually switched a few weeks ago, but until very recently the programmatic REST API was the only way to make changes, so this wasn’t really a product I’d want to recommend: technically, it worked, but editing a JSON document by hand to send via the command-line client was… suboptimal. Fortunately, there’s now an editor embedded in the Google Developers Console, so you can also make changes interactively. The new Cloud DNS editor in the Google Developer Console Overall, I’m happy enough with the switch: it seems to work well, and didn’t take much effort (once I’d remembered to quote my TXT strings properly, ahem). I did make one or two changes to the domain at the same time, most notably removing the A record for farside.org.uk itself (which had originally been present for direct mail delivery, years ago). This does mean that http://farside.org.uk/ will no longer resolve³, but that hopefully shouldn’t cause any real problems. Full disclosure: I’m currently getting an employee discount, so I’ll be paying less than that. ↩ I did have to drop an RP RR as a result of this, though I wasn’t actually using it for anything. ↩ Previously, this would end up at the aforementioned machine and be redirected by that copy of Apache to www.farside.org.uk, which runs elsewhere. ↩ Pangrams on the web 2014-10-04T01:53:11Z Back in June I wrote about a quick hack to search the Project Gutenberg text for pangrammatic windows; I then wrote a bit more about the implementation and used it as an example for performance tuning in Linux. What I didn’t mention (because I’m only just now getting around to writing it up) is that I also ran the same analysis against the web. Just to remind you what I’m talking about, pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur as substrings of otherwise naturally-occurring text. For example, the shortest known sequence in a published book is 42 letters, from Piers Anthony’s Cube Route: “We are all from Xanth,” Cube said quickly. “Just visiting Phaze. We just want to find the dragon.” Piers Anthony’s Cube Route (pangrammatic window highlighted) Obviously, sequences such as “The quick brown fox jumps over the lazy dog” (35 letters) are shorter, but they aren’t naturally occurring, so don’t count for these purposes. So, back in June, I decided to use some of my 20% time (and some weekends) to run a search to find some of the shortest pangrammatic windows on the web, using Google’s web index in much the same way as I’d earlier run a local search over the Gutenberg corpus of documents, except more so. Searching the web Even though I don’t work on Google’s web search itself, I knew that we had the ability to run analyses over the web at scale: Ian Hickson did something similar back in 2005 to produce a Web Authoring Statistics report on HTML structure on the web. The main difference was that I was hoping to do it for much more trivial reasons. I was happy to find that doing this kind of analysis was pretty easy. The code in question is neither interesting nor open source, but let’s just note that, for search engines, the problems of ‘Do X for all web documents’ and ‘Extract the text from this web page’ are already fairly comprehensively solved. I’d already restricted what I was looking at to English-language documents, and (for hopefully obvious reasons) those documents that weren’t filtered by SafeSearch, so that left me with only the easy bit to solve: writing a matcher that would allow me to run a large-scale grep over the web. I started by writing something simple using the non-backtracking algorithm that Jesse Sheidlower had suggested¹. This simply emitted one result for every unique pangrammatic window shorter than a certain number of letters (I think I started at 45 or so). To work out whether two different windows were equivalent, I normalised the window text by removing all non-alphabetic characters (apart from interior single quotes) and collapsing all runs of whitespace to a single space. In that way, “Fix Mr. Gluck’s hazy TV, PDQ” and “Fix Mr Gluck’s hazy TV,\n ‘PDQ’” (where \n is a newline) would both be normalised to “fix mr gluck’s hazy tv pdq”, and I would pick just one to report. That’s when I ran into something of a problem. When I’d run a simple search for short pangrammatic windows over the Gutenberg text, I’d had to skip through a thousand or so occurrences of ‘the alphabet’ and variations before getting to any real-world text. That clearly wasn’t going to scale up to the web. To clean up these nonsense results, I started with some blacklisting: I’d already discarded entire documents based on a few regular expressions in order to exclude those documents that were specifically talking about pangrams², so I tried adding another blacklist to remove individual results that contained ‘impossible’ words. For example, if the normalised window contains the substring “qrs” (ignoring spaces), it can’t possibly be part of an English word: no word contains “qrs”, and none ends “qr” or starts “rs”, so there is no subdivision that would be valid³. This is very successful at removing a large proportion of the results that were variations on ‘the alphabet’. However, it’s not good enough. I still needed to add “qwerty” and “ytrewq” (“qwerty” reversed) and “azertyu” (French keyboard layout; and note that “azerty” wouldn’t be valid, since some words do end in “azer”) and then “ytreza” and… clearly this isn’t going to scale either. The internet follows rule 34 for misspellings, it seems: I’m fairly confident that even something as simple as the alphabet has been misspelled in almost every possible way. I needed a better way to sort the real text from the nonsense text. Scoring I thought about trying to do something clever — like trying to train a classifier to recognise English words — but then I realised that I could do something dumb instead, which is almost always a better approach. I globbed together a few sources to make a large (100K words or so) dictionary that looked like it contained mostly plausible English words, removed a few words that were valid but problematic (“BC”, “def”, “wert”, etc), and wrote something that would compute a score based on the number of known words in the normalised window. For example, if the input was “a b c d e … z”, we’d have 26 ‘words’, of which two (“a” and “i”) would be considered known words⁴, and so we’d give it a score of 2/26, or 0.077. I knew that I wouldn’t want to set a minimum score of 1.0 (eliminating results that had any unknown words), both because I’d seen from the Gutenberg examples how common proper names were, and also because the nature of reporting a sub-sequence meant that I’d often be selecting a partial word for the first and last words in the window. However, playing around with the threshold showed that it was filtering out nonsense results pretty well, but that I still had a slightly different problem to solve. While I’d managed to filter out windows with a high proportion of nonsense words, the four-word sequence “in the end. abc…xyz” still manages a reasonable score of 0.75 by that metric, since only the last of the four ‘words’ is an unknown word. To fix this problem, I put together an alternative score that was computed from the number of letters covered by each known word. That works well for inputs such as the above (2 + 3 + 3 letters in known words, out of 34 letters total, for a score of 8/34, or 0.235). I couldn’t just replace the known-word score with the by-letter score by itself, though: the letter coverage scorer gives low scores to windows with a long, but truncated, first or last word, and it doesn’t give low scores to windows containing a large number of short nonsense words (things like “wafting zephyrs vex bold p p p p p q jim”). So I did what any reasonable person would do in that situation: I multiplied both scores together. This sounds ridiculously naïve (and I’m sure that anyone who actually does language processing professionally can tell me why this whole approach is idiotic), but it seems to have worked out reasonably well. In case you’re interested, the distributions of scores (for windows of an acceptable length) ended up looking like this: Score distributions: score on the x-axis; frequency on the (log-scale) y-axis That approach is not without its problems, though: based on some initial trials, I decided that the final run would discard any result with a score lower than 0.55. I later spotted that this would also have discarded both examples quoted in the Wikipedia article, though it does accept all the examples I found in the Gutenberg text: Wikipedia: [from Xanth,” Cube said quickly. “Just visiting Phaze. W] (score: 0.492) Wikipedia: [view of The Yards: Mark Wahlberg, Joaquin Phoenix, Charliz] (score: 0.530) Gutenberg: [joyment of Fonblanque gave new zest; and when I expressed to Dick] (score: 0.566) Gutenberg: [xhausting blanket of mid-July heat which pressed to squeeze all the v] (score: 0.698) Gutenberg: [joined words: he gave me Phiz, styx, wrong, buck, flame, q] (score: 0.721) In any case, I also ended up discarding any window with more than 38 letters, which eliminated all of the above anyway. Results! I read through a lot of results, about four thousand in all. By hand, and mostly in a coach to and from Wales. I may have missed something. First, some brief observations about the things that weren’t good results (but that still scored highly enough that I had to look at them): It became pretty obvious very quickly that anything with a URL related to “keyboard practice”, “homework”, or “font” wasn’t going to be a natural sequence. In fact, font samplers were by far the biggest cause of problematic (non-natural) sequences that I would have liked to ignore. Lists of products for sale were the second-biggest problem: the combination of product codes (“ADX XLM1”) with strings like “free shipping” lead to a fairly large number of pangrams. Sequences containing Twitter short URLs (t.co) also showed up a fair amount, probably because they’re short enough that they didn’t affect the score too much. Surprisingly, results from YouTube watch pages showed up multiple times, apparently caused by the related videos list shown next to each video, which contains entries of the form “TITLE by AUTHOR NNN views”, and where (in these results) the author is commonly named something like “XquerbleX”. Honourable mentions go out to the 20 consonant poetry and New Sentences for the Testing of Typewriters pages. I hate you both. Without further ado, the three best results I found (in reverse order). In third place, a post on the “CrackBerry Forums”, which falls down slightly for using what turns out to be some very convenient product names, but wins for the story: So…..i accidentally microwaved my q10. Just bought a z10. Liking it so far except for the typing. YunusHassen, from a post on the CrackBerry Forums; 38 letter pangrammatic window. Second place goes to another forum post. The pangrammatic window here is from a list of words rather than a portion of a sentence, but I did learn the names of some dance styles: That’s not the only ballroom dance. I take lessons too. There’s Salsa, Tango, Waltz, Vienese Waltz, Quick Step, Jive, Swing, Foxtrot, Lindy Hop, Mambo, Cha-Cha, and Merengue. There are obscure ones too. LeBlanc Paris, from a post on Yahoo Answers; 38 letter pangrammatic window. Both of those were 38 letters, but the clear winner on both length and content is the following 36 letter pangrammatic window, from a review of the film Magnolia: Further, fractal geometries are replicated on a human level in the production of certain “types” of subjectivity: for example, aging kid quiz show whiz Donnie Smith (William H. Macy) and up and coming kid quiz show whiz Stanley Spector (Jeremy Blackman) are connected (or, perhaps, being cloned) in ways they couldn’t possibly imagine. Todd Ramlow, reviewing the film Magnolia for PopMatters; 36 letter pangrammatic window. I’m pretty impressed by this result: it’s only one letter longer than “The quick brown fox…”, and while that’s not the shortest possible pangram by far, it is one of the more coherent ones. As for me, I think I’m probably done with pangrams now. Although… I’ve been spending so much time lately reading examples of pangrams that I have started to wonder how I could get myself a sphinx. Preferably one made from black quartz… The reason to pick this algorithm wasn’t performance, but because it has the advantage of being insensitive to the amount of non-alphabetic characters within a window, unlike the fixed-size sliding window used by the algorithm I’d previously used to search the Gutenberg text. ↩ I kept this blacklist in place later on, though I don’t think it actually had much effect. For posterity, the list of (non-case-sensitive) regexps I ended up using was: ‘pangram’, ‘quick.? brown (fox|dogs?) jump’, ‘silk pyjamas exchanged for blue quartz’, ‘DJs flock by when MTV’, ‘five boxing wizards jump’, ‘Pack my box with five dozen’, ‘love my big sphinx of quartz’, and ‘Grumpy wizards make toxic brew’. Only one of which is actually a regexp. ↩ Though it turns out that /usr/share/dict/words on my Ubuntu laptop seems to think that “rs” is itself a word. I suspect I just decided that it was wrong. ↩ Again, Ubuntu’s /usr/share/dict/words seems to consider every letter of the alphabet to be a valid word. The dictionary I used didn’t, though you could probably make the case that it should perhaps have included “O” (and maybe even the txtspeak “U”). ↩ Pangrammatic performance 2014-09-06T11:08:21Z This is the third in a series of posts about searching for pangrammatic windows in a corpus of documents. I’ve previously talked about what I found in the Gutenberg corpus and the code I used, and ended the last post with a question: can we make it faster? Well, that’s what this post is about. Where were we? I’ve expanded the contents of the Project Gutenberg 2010 DVD image into the current directory, and I’m compiling and running pangram.c as follows: $ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram $ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out real 2m56.795s user 0m57.155s sys 0m6.667s In other words, it currently takes about three minutes. We’re going to try to reduce that. Some facts and figures: My laptop has a Core i5 CPU, reporting a total of four processors (one physical CPU with two physical cores and two HT cores). My disk is an SSD. I’m using whole-disk encryption via dm-crypt, and then LVM2 over the top of that, so I have a bunch of different devices that all relate to my root partition. I’m running Linux 3.13.11 (and, for what it’s worth, using gcc 4.6.3). We’re searching a little over 32,000 files totalling 11.6GB of text. Repeatable reads Before we do anything, we really need to make sure we can get repeatable measurements. The first few experiments I tried ended up with nonsensical results — it turns out that “all the Gutenberg text” is less than the total memory on my laptop (16GB), so all the runs after the first just read from the filesystem cache. Fixing that is easy: we ask the kernel to drop the cache before we run a test: # echo 3 > /proc/sys/vm/drop_caches or, since we’re probably not running as root, $ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null /proc/sys/vm/drop_caches is documented in Documentation/sysctl/vm.txt; the command above will drop both the file content cache and the inode cache (containing directory contents). One other source of variation I ran into was caused by how long find took to run (and note that it runs concurrently with xargs): most of the time it completed quickly, but in a few situations it was (inconsistently) delayed by what else was going on, causing the whole search to take much longer than usual. This was also easy to avoid: we capture the list of files in advance: $ find -name '*.txt' > filelist $ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null $ time 100 ./pangram > pangram.out real 2m58.310s user 0m55.787s sys 0m6.630s Change the algorithm So where to start? As I see it, there are at least three things we can try: Change the algorithm to reduce the work we need to do. Change the implementation to do that work more efficiently. Make better use of the resources we have. Let’s take a look at the algorithm first. I’m not going to repeat the whole thing here (see the previous post for that), but in summary: we read through each file until we’ve seen enough letters that we might have found a pangram, then scan backwards until we find one, or until we hit a limit (I used 200 bytes); we then resume scanning forwards from where we left off. Can we improve on this? Perhaps. We clearly need to visit all the letters at least once, but — as suggested to me by Jesse Sheidlower, who wrote the PangramTweets Twitter bot that kicked this all off — we can avoid backtracking during the initial search if we keep some additional state¹. Does this help? Unfortunately, not really: it takes exactly the same amount of time as the backtracking version. Perhaps that’s not too surprising, though. It’s fairly clear that the problem should be I/O-bound, and so — unless the backtracking causes additional I/O (which appears not to be the case) — we should see if we can perhaps spend less time waiting on the I/O we have. Change the implementation As it stands, the total cost of our I/O is pretty much unavoidable: we need to read each file completely into RAM. We could reduce the overall I/O by changing what we read². For example, we could: Read the original zipfiles directly, trading I/O for CPU. Combine the separate text files into a small number of large files, removing many small reads (and making better use of read-ahead, which can’t help with the first read from a file). Perform the reads upfront by copying the files to a tmpfs. However, these fundamentally change the problem we’re trying to solve, not just the way we’re solving it, and so I’m going to stick with what I have for now. So far, we’re using mmap() to read the file. This gives us a memory range into which the kernel will read the file’s contents as-needed, using some amount of asynchronous read-ahead. If we try to read a page that hasn’t been read from disk yet, we’ll block³. At least in my case, using mmap() to map the file on-demand isn’t much different to using read() to read the whole file in one go, which is itself a bit of a surprise: the former’s asynchronous read-ahead should mean that we can get started more quickly. I ran across an email about mmap() on the linux-kernel mailing list where Linus explains that mmap() is actually quite expensive to start with. In any case, most of our files are in the range 256–512KB⁴, and so perhaps there’s just not a lot of read-ahead to do. One thing we could try is reducing the time we spend waiting for I/O by providing hints to the kernel about our usage of an area of memory or a file. For example, to hint that we’re about to read a buffer sequentially, we can write madvise(buf, len, MADV_SEQUENTIAL). In theory, this should allow us to optimise the file I/O based on our usage. In practice (at least in my case), it turns out that these are actually pessimisations. While we have several different ways to hint to the kernel, as far as I can see, they boil down to just two choices: whether or not we need the data immediately, and what the access pattern is for the data in memory. If we need the data “now” (MADV_WILLNEED, MAP_POPULATE for posix_fadvise(), etc), then the kernel will issue a synchronous read there-and-then, returning once the file’s data is in the page cache. This can be no faster than issuing a blocking read() for the whole file, and — I assume due to the overhead of mmap() — actually ends up a bit slower in practice. Otherwise, the access pattern is one of “normal”, “random”, or “sequential”. “Normal” is what we get by default; it triggers some amount of read-ahead. “Random” (MADV_RANDOM, etc) is straightforward: it switches off read-ahead, so every new page access causes a single-page blocking read. This is terrible for performance, as you might expect — in our case, it roughly doubles the runtime. “Sequential” (MADV_SEQUENTIAL) is less well-defined. It’s not completely clear to me what it does in practice — the posix_fadvise() man page says that it doubles the read-ahead window (on Linux), but the kernel source implies that it might be a bit more complex that that — but in any case, it seems to have a small negative effect overall. madvise() and friends provide just one way to tune I/O scheduling, but they are the easiest to use. We could also look at overlapped or threaded I/O, but that’s significantly more complex — and perhaps there’s an easier way to improve our utilization anyway. Make better use of the resources we have In this case, I’m talking about disk and CPU utilization. While some of the disk reads will occur while we’re searching the buffer we’ve already read into memory, most won’t, and so the CPU should often be waiting for an I/O operation to complete. It’d be nice if we could get a bit more detail about how we’re doing than simple wallclock time, so I’m going to measure the current CPU and disk utilization using iostat. iostat is a whole-machine profiler, so it’s probably a good idea not to have much else going on at the time (though that said, I didn’t see too much impact from the copy of Chrome I had running). Alternatively, we could look into per-process monitoring via iotop or pidstat, or an event/tracing approach like SystemTap or iosnoop⁵. (Incidentally, Brendan Gregg’s Linux Performance page is a great resource for finding out more about Linux performance tools.) Running iostat — with -N to print meaningful names for the device mapper devices, and -x to print extended disk statistics — produces something like the following (rather wide, sorry) output: $ iostat -N -x Linux 3.13.0-30-generic (malcolmr) 06/09/14 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.28 0.07 0.77 0.11 0.00 97.77 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.17 0.04 11.92 4.62 239.54 25.25 32.03 0.01 0.33 0.29 0.43 0.17 0.28 sda5_crypt 0.00 0.00 11.81 4.66 238.68 25.24 32.05 0.04 2.43 0.91 6.29 0.29 0.48 sysvg-root 0.00 0.00 11.65 4.48 238.02 25.24 32.65 0.04 2.48 0.92 6.54 0.30 0.48 sysvg-swap_1 0.00 0.00 0.12 0.00 0.47 0.00 8.00 0.00 0.15 0.15 0.00 0.15 0.00 $ It’s important to note that if you run iostat this way, you actually get a running average since boot, which isn’t very useful at all. What I chose to do instead was to start the run and then execute iostat -N -x 30 3, which outputs three reports separated by 30 seconds. The first is the average-since-boot, which we can ignore, but the other two are averages over the 30 seconds since the prior report. Having two good reports allows us to check how variable the numbers we’re seeing are (in my case, fairly reliable). Here’s the kind of output I got. First, we start a run: $ time 100 ./pangram > pangram.out and then concurrently run iostat: $ iostat -N -x 30 3 [...] avg-cpu: %user %nice %system %iowait %steal %idle 8.19 0.00 19.54 15.99 0.00 56.28 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.20 0.00 655.57 2.37 68680.93 23.33 208.85 0.26 0.39 0.39 0.39 0.35 22.93 sda5_crypt 0.00 0.00 654.43 2.37 68666.40 23.33 209.16 1.19 1.82 1.82 0.73 1.31 86.00 sysvg-root 0.00 0.00 654.43 2.17 68666.40 23.33 209.23 1.19 1.82 1.82 0.80 1.31 86.05 sysvg-swap_1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 The bottom line is that the CPU is idle most of the time (%iowait is “idle but there are runnable processes waiting for I/O”), and the disk is also idle some of the time (%util, on the far right, which is the proportion of the time that there was at least one outstanding I/O operation to the device). Note that %util of 100% does not mean that the device cannot take any more requests, just that there was at least one request pending for the device at all times. On the other hand, a device with %util less than 100% (as we have here) is definitely under-utilized. I’m not entirely sure what the discrepancy between the reported %util for sda5_crypt and sda (the dm-crypt volume and raw disk) is due to, but I wasn’t able to get the latter above 50%. It looks to me like perhaps dm-crypt is unwilling to forward more than one request at a time to the raw device: avgqu-sz (the average request queue depth, including active requests) for sda never got to 0.9, no matter how much load I added⁶. Now that we have evidence that both the CPU and disk really are underutilized, how do we improve things? Well, the easiest way is to run over more than one file at a time: $ time 100 -P 2 ./pangram > pangram.out real 1m37.855s user 0m51.309s sys 0m5.786s Well, that’s much better already. The options to xargs tell it to run pangram with at most 100 files, and execute two copies in parallel. (It’s important to limit the number of files per invocation, otherwise xargs will just pass everything to a single copy.) At this point, we’re keeping the disk (sda5_crypt) a lot busier: %util is up from 86% to nearly 99%, and rkB/s (the read throughput) is up from 69MB/s to 122MB/s. In fact, we can continue to increase the number of concurrent processes to reach a peak of about 149MB/s: rkB/s for different values of xargs -P You’ll see that it tops out somewhere around P=8. I’m not sure how to explain the drop in read throughput from around P=32: it seems to correspond to the point at which %idle drops to zero (i.e. there’s a task waiting on I/O at all times), but I don’t see why that would necessarily cause things to run more slowly (and all the metrics apart from r/s and rkB/s are linear in the amount of load). Whatever the reason, we’re done here: we’ve reduced the runtime of this task from three minutes to about 1m20s, a little over half the time it took originally. Summary While this was something of an artificial example (we really didn’t need to run this more than once, after all), and while some of the above is surely specific to my setup, I hope it was an interesting exercise. If I had to tl;dr the above, it would be: When running performance tests, have a metric to optimise, and make sure you can produce it reliably. Try to remove any sources of variation that you can’t avoid. Focus first on the algorithm you’re using and the number of expensive operations you’re doing: moving from O(N²) to O(N log N) or removing an RPC or disk read is almost certainly better that any amount of micro-optimisation. Consider whether you can change the problem to make it easier to do things more efficiently. If things are still slow, dig into the details until you understand what’s going on. In many cases, it isn’t necessary to spend the time on performance tuning, but when it is, the above is probably a good roadmap. Typically, you’d only bother when you’re doing something like handling a user request, where the latency is a target by itself, or when you’re running something repeatedly, perhaps because it’s a core library function, or perhaps because you’re processing a lot of data. Briefly: we track the byte- and letter-offset in the file at which we most-recently saw each letter, and also the minimum letter-offset over all letters (which will change infrequently, thanks to the non-uniform letter distribution); when that minimum letter-offset changes, we have a new pangram, and immediately know the number of letters it contains. This is O(N) in the number of letters in the file (for the search; we still have to backtrack to find the text to output), and does have one other significant advantage: it avoids any artificial limit on the number of non-letter bytes (whitespace, etc) that can appear within a sequence. ↩ I haven’t actually tested any of these, by the way, so they may not actually help, but they all sound reasonable. ↩ These two cases can be distinguished via /proc/vmstat: read-ahead reads are counted as page faults (pgfault), while synchronous (blocking) reads are counted as major page faults (pgmajfault). ↩ The filesize distribution of Gutenberg texts is actually pretty close to a log-normal distribution centred around 2¹⁸. ↩ Oddly, Linux doesn’t appear to have a good solution for iostat-like accounting for cgroups (though pidstat comes close) — or if it does, I couldn’t find it. ↩ Perhaps this is a red herring, but I would have expected that passing those requests down to the raw disk device could only help (and I note that Intel use a queue depth of 32 when measuring SSD performance). As it stands, the maximum read throughput I can get from the raw device (via dm-crypt) is about 150MB/s, a little under half of the real-world sequential read performance I see in reviews. ↩ HTTPS 2014-08-07T14:58:46Z Inspired by Google’s recent decision to boost the ranking of HTTPS sites, and because it’s something I’ve been meaning to do for a while (and also because it’s generally the right thing to do), I’ve just moved this blog to serve via HTTPS. I pretty much just walked through this set of instructions from Eric Mill, using the SSL configuration from Mozilla’s OpSec team (seriously, don’t try to do this bit yourself: the folks at Mozilla know what they’re doing). All told, it only took a couple of hours. Like Eric, I also got my free certificate from StartSSL; they seem reasonable enough at the moment, and I can always change later if I feel like it. Other than needing to switch to a protocol-relative URL for Google Web Fonts, the site worked first time (though it helps that it’s fairly simple: all the odd stuff got left behind when I split the serving of this blog to a Google Compute Engine instance). However, unlike Tim, I didn’t keep the HTTP version of the site around: all http:// URLs now result in a 301 to the HTTPS equivalent¹. I haven’t yet enabled HSTS to pin the site to HTTPS, but I’ll probably do so in a week or so, once I’ve checked to see if any problems turn up. I’m also not entirely concerned about backward-compatibility with old clients (I used the Non-Backward Compatible Ciphersuite list, for example). I was originally planning to only enable TLS 1.2, but it turns out that I do still care about some older clients (no, not Windows XP): GoogleBot and pre-KitKat versions of Android (presumably the Android browser rather than Chrome-on-Android), which only support TLS 1.0². In the end, I only ended up disabling SSL2 and SSL3. Once I’d tested the site, the only thing I needed to do was to register the HTTPS URL in Google Webmaster Tools, and update a few incoming redirects to avoid long redirect chains. I also found the following sites useful: Qualys SSL Server Test, an excellent resource for testing SSL installations. Is TLS Fast Yet?, a collection of some other useful tips for SSL/TLS. In summary: for many sites, enabling HTTPS is pretty trivial. If you’re making a new site, consider making it HTTPS-only. Except for robots.txt, which serves the content directly. I’m not sure if that’s actually important, but it seemed like robots might not want to follow redirects to fetch robots.txt, even if they would for the other content. ↩ In addition, the version of curl I have on my desktop only supports TLS 1.1, so I would have at least wanted to enable that. ↩ Noda Time 1.3.0 2014-06-27T15:57:40Z Noda Time 1.3.0 came out today¹, bringing a healthy mix of new features and bug fixes for all your date and time handling needs. Unlike with previous releases, the improvements in Noda Time 1.3 don’t really have a single theme: they add a handful of features and tidy up some loose ends on the road to 2.0 (on which more below). So in no particular order… Noda Time 1.3 adds support for the Persian (Solar Hijri) calendar, and experimental support for the Hebrew calender. Support for the latter is “experimental” because we are not entirely convinced that calculations around leap years work as people would expect, and because there is currently no support for parsing and formatting month names. See the calendars page in the user guide for more details. Speaking of parsing and formatting, both should be significantly faster in 1.3.0. Parse failures should also be much easier to diagnose, as errors now indicate which part of the input failed to match the relevant part of the pattern. The desktop build of Noda Time should now be usable from partially-trusted contexts (such as ASP.NET shared hosting), as it is now marked with the AllowPartiallyTrustedCallers attribute. Finally, we also fixed a small number of minor bugs, added annotations for ReSharper users, and added a few more convenience methods — ZonedDateTime.IsDaylightSavingTime() and OffsetDateTime.WithOffset(), for example — in response to user requests. There’s also a new option to make the JSON serializer use a string representation for Interval. Again, see the User Guide and 1.3.0 release notes for more information about all of the above. You can get Noda Time 1.3.0 from the NuGet repository as usual (core, testing, JSON support packages), or from the links on the Noda Time home page. Onward to 2.0 Meanwhile, development has started on Noda Time 2.0. Noda Time 2.0 will not be binary-compatible with Noda Time 1.x, but it will be mostly source-compatible: we don’t plan to make completely gratuitous changes. Among other things, Noda Time 2.0 is likely to contain: Significant changes to internal representations, with consequences for overall performance (some good, some — hopefully for less-important cases — less good). To take one example: we expect to change the granularity of Instant and Duration from ticks to nanoseconds. A better definition of the range of values that are supported for various types and calendars, and a defined behaviour for when those ranges are exceeded. In a similar vein, we plan to revisit how ordering and equality are implemented (mostly for edge cases). A unified API for changing dates and times similar to the Java 8 “adjuster” concept. (This may replace some methods that are currently on concrete types.) Removal of everything marked as obsolete in 1.x. We don’t expect to have a release of Noda Time 2.0 until next year, so we may well make some additional releases in the 1.3.x series between now and then, but in general we’ll be focussing on 2.0. If you’re interested in helping out, come and talk to us on the mailing list. And once again, I’m going to plagiarise this post for the official Noda Time blog post. ↩ Pangrams in C 2014-06-06T17:40:31Z In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used. I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”). I can imagine asking that kind of question in an interview¹. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself. Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not. Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters. The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up. What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1. In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.) tl;dr, where’s the code? Download pangram.c. Compile and run using something like: $ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram $ ./pangram file1 file2 file3 The compile flags just define the maximum window size, MAX_PANGRAM (to 200 bytes, the figure I chose in the end), and enable optimisations (which I was surprised to see make a noticeable difference to the runtime). The implementation maps to the algorithm I described above: main() simply uses mmap() to reads the contents of each file in turn into memory, then invokes pangram(). pangram() walks through the file byte-by-byte, calling seen_all() to update the letters we’ve seen in seen. When seen_all() returns true, we call try_scan_backwards() to check whether we have a pangram in the last MAX_PANGRAM bytes, and also to update seen with the new set of letters that are actually within that window (as described above). If we find a pangram, output_pangram() prints the file and contents to stdout. I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad. Loading the whole file into memory in one go isn’t particularly great (we only really need a sliding window of MAX_PANGRAM bytes, so we’re wasting a lot of memory), but it makes the code much simpler, and memory pressure isn’t something I need to worry about here. The largest file I’m dealing with is 43MB (Webster’s Unabridged Dictionary, pgwht04.txt), and my laptop has 16GB of RAM, so there’s no reason to try anything cleverer: mmap() is simple, and it works. How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me… $ find -name '*.txt' | wc -l 32473 … just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1). From the directory containing the unzipped texts, we can run a search by using something simple like this: $ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out real 2m56.795s user 0m57.155s sys 0m6.667s I’ve used -n 100 so that xargs will run our program with 100 files at a time, rather than all 32,000. That’ll be important later, though initially I was a little worried about command-line length limits, probably unnecessarily². The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences: $ sort -n pangram.out | head -n3 26 ./10742.txt: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 26 ./10742.txt: B C D E F G H I J K L M N O P Q R S T U V W X Y Z & a 26 ./10742.txt: C D E F G H I J K L M N O P Q R S T U V W X Y Z & a b Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough. The only thing I’m not entirely sure about here is the safety of combining the results from stdout if I run more than one copy of pangram at a time (spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears to work in practice. Mostly. We printf() to stdout, which I’d thought was line-buffered. However, without an explicit fflush(stdout) after the printf() (which output always finishes with a newline anyway), a small fraction of the output is lost when I concurrently append to a single output file: I’m missing some lines (a few hundred in 42,000 or so), and I get the ends of a few others. With fflush(stdout), I seem to get the right results again, unless I spawn a large number of concurrent processes (say, 300), so I’m guessing there’s a race somewhere that I’m occasionally losing. The reason that I’m a little confused is that I expected this to either work fine, because writes of less than PIPE_BUF bytes (512 by POSIX; in practice, at least 4KB) are atomic — or if that didn’t apply in this situation, I’d expected it to interleave the results completely. Make it go faster? Three minutes is a bit long to wait; can we make it run faster? Yes, we can. But that’s a post for another day. Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) ↩ Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see xargs --show-limits. ↩ Pangrammatic windows 2014-06-02T14:13:09Z Over on Language Log, there’s a post about pangrammatic windows, and a bot that searches Twitter posts for them. Pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur within otherwise naturally-occurring text. For example, the shortest known natural sequence is 42 letters, from Piers Anthony’s Cube Route, discovered in an article in Word Ways: “We are all from Xanth,” Cube said quickly. “Just visiting Phaze. We just want to find the dragon.” Piers Anthony’s Cube Route (pangrammatic window highlighted) I thought it might be interesting to work out how you’d go about searching a given text for pangrammatic windows. A short chat at work and some quick hacking later, and I had a simple proof-of-concept, but no data to run against. That was easily solved by downloading the Project Gutenberg April 2010 DVD image¹ and unzipping everything within. That gave me 11.6GB of text files, ranging in size from 336 bytes (one of the chapters of Moby Dick) to a single 43MB file comprising Webster’s Unabridged Dictionary. I’ll post about the technical side separately, but suffice to say that this search doesn’t exactly tax a modern PC: my laptop has enough RAM to load all of the Gutenberg text into memory, and even from cold, it takes only 80 seconds to search through it all. So what did I find? Well, firstly, several thousand occurrences of “the alphabet”. In retrospect, that probably should have been obvious. I did find another 42-letter sequence, but I don’t think it can really count, as it occurs during a discussion of pangrams itself: De Morgan (the mathematician), while snarking about numerology, writes about trying to construct a meaningful sentence using all the letters save ‘v’ and ‘j’ exactly once: There is a kind of Cabbala Alphabetica which the investigators of the numerals in words would do well to take up: it is the formation of sentences which contain all the letters of the alphabet, and each only once. No one has done it with v and j treated as consonants; but you and I can do it. Dr. Whewell and I amused ourselves, some years ago, with attempts. He could not make sense, though he joined words: he gave me Phiz, styx, wrong, buck, flame, quid. Augustus De Morgan, A Budget of Paradoxes The shortest sequence that seems to fit within the rules is the following 53-letter sequence, from The Life of Charles Dickens: […] there was a second reading to which the presence and enjoyment of Fonblanque gave new zest; and when I expressed to Dickens […] John Forster, The Life of Charles Dickens However, this, and a similar 56-letter sequence (“Köckeritz! Where is the king?”) in Napoleon and the Queen of Prussia both still seem somewhat unnatural to me, since they depend upon proper names to work (and to be fair, the same is true of the Piers Anthony quote as well). Given that, I think the contender for the shortest truly “natural” pangrammatic window in the Gutenberg corpus is the following 57-letter sequence, from Andre Norton’s YA-esque civil war adventure, Ride Proud, Rebel!: They had turned off the road, which was now filled with men, horses, men, artillery, and men, all slogging purposefully forward. They composed an army roused out before daylight, on the move toward another army holed in behind a breastworks and waiting. And over all, the exhausting blanket of mid-July heat which pressed to squeeze all the vital juices out of both man and animal. Andre Alice Norton, Ride Proud, Rebel!, 1961 Funnily enough, one thing that I did expect to find, but didn’t, were any common examples of pangrams — in fact, the word “pangram” does not appear (with that meaning) in the Gutenberg corpus at all! The closest I got were the two near-misses: “the quick, brown fox jumped over the lazy dog” and “the swift brown fox jumps over the lazy dog”, the former of which is, I think, a misquote (the latter isn’t, as it’s called out in the text as an almost-pangram). That’s it for this post. I also have a separate post that goes into a little detail about the code itself. Hey, 14-year-old me? Remember when you spent over an hour on the phone to download 150KB of BBS software on a 300 baud connection? I just took about the same time to download 8.4GB, and I have enough space to store an uncompressed copy too. The future rocks! But while we’re here: could you buy some Apple stock during 2002? Thanks! ↩ Small-scale Compute Engine 2014-01-10T00:23:16Z Google Compute Engine is Google’s “run a virtual machine on Google infrastructure” product. It’s broadly similar to Amazon’s EC2, in that you get an unmanaged (Linux) virtual machine that you can run pretty much anything on, one difference being that it seems to be aimed at larger workloads: 16-core machines with hundreds of GB of RAM, 5TB disks, that kind of thing. While I’d been meaning to look at it for a while, I didn’t think I had any reason to use it; I certainly don’t have any workloads of the scale people seem to be talking about. A short while ago, a friend at work mentioned he was using it to run a private Minecraft server, which seemed pretty small to me, so I thought perhaps I’d take another look. It turns out that Compute Engine is just as suited to small-scale workloads as large ones, and while you do have to pay to use it, it works out to be pretty inexpensive. Having spent a little time with it now, I figured it was time to document what I found out. Boring disclaimer time first, though: I don’t work on Compute Engine, so this isn’t anything official, just some guy on the internet. Also, in the interests of full disclosure: I’m getting an employee discount on the cost of using Compute Engine (though it’s cheap enough that I’d be happy paying full price anyway). With that in mind… Stalkers and readers with good memories will recall that I started proxying this site via Google’s PageSpeed Service a little over two years ago. PageSpeed Service is a reverse proxy running in Google’s data centres that applies various performance rewrites to the original content (minifying CSS, and so on), and it does a pretty good job overall. As an additional benefit, it’s a (short TTL) caching proxy, so nobody needs depend directly on the copy of Apache running at the end of a DSL pipe on my server at home. However, I’ve always been slightly bothered by the fact that that dependency still exists. There’s the usual “home network isn’t very reliable” problem¹, but rather more importantly, that server’s on my home network, and given the choice, I’d rather not have it running a public copy of Apache as well as everything else. Anyway, it turns out that I’m going to need to reinstall that server in a bit anyway, so I figured that it might be a good time to see whether Compute Engine was a good fit to run a simple low-traffic Apache server like the one that serves this site (spoiler: yes). I was hoping that I’d have something clever to say about what I needed to do to set it up, but in truth the Compute Engine quickstart is almost embarrassingly easy, and ends up with a running copy of Apache, not far from where I needed to be. One thing I did decide to do while experimenting was to script the whole install, so that a single script creates the virtual machine (the “instance”), installs everything I need, and sets up Apache to serve this site. Partly² this was to make sure I recorded what I’d done, and partly so that I could experiment and reset to a clean state when I messed things up. That may have been a bit excessive for a simple installation, but it does mean that I now have good documentation that I can go into some detail about. First, create your instance With Compute Engine, the first thing you need to do (assuming you’ve completed the setup in the quickstart) is to create an instance, which is what Compute Engine calls a persistent virtual machine. My script ended up using something like the following, which creates an instance together with a new persistent disk to boot from: $ gcutil --project=farblog addinstance www \ --machine_type=f1-micro \ --zone=us-central1-b \ --image=debian-7 \ --metadata_from_file=startup-script:startup.sh \ --authorized_ssh_keys=myuser:myuser.pub \ --external_ip_address=8.35.193.150 \ --wait_until_running gcutil is the command-line tool from the Google Cloud SDK that allows you to configure and control everything related to Compute Engine (other tools in the SDK cover App Engine, Cloud Storage, and so on, but I didn’t need to use any of those). Taking it from the top, --project specifies the Google Developers Console project ID (these projects are just a way to group different APIs for billing and so on; in this case, I’m only using Compute Engine). You can also ask gcutil to remember the project (or any other flag value) so that you don’t need to keep repeating it. addinstance is the command to add a new instance, and www is my (unimaginative) instance name. Everything after this point is optional: the tool will prompt for the zone, machine type, and image, and use sensible defaults for everything else. The machine type comes next: f1-micro is the smallest machine type available, with about 600MB RAM and a CPU reservation suitable to occasional “bursty” (rather than continuous) workloads. That probably wouldn’t work for a server under load, but it seems to be absolutely fine for one like mine, with a request rate measured in seconds between requests, rather than the other way around. Next is the zone (us-central1-b), where the machine I’m using will be physically located. This currently boils down to the choice of few different locations in the US and Europe (at the time of writing, four different zones across two regions named us-central1 and europe-west1). As with Amazon, the European regions are slightly more expensive (by about 10%) than the US ones, so I’m using a zone in a US region. While Compute Engine was in limited preview, the choice of zone within a region was a bit more important, as different zones had different maintenance schedules, and a maintenance event would shut down the whole zone for about two weeks, requiring you to bring up another instance somewhere else. However, the US zones no longer have downtime for scheduled maintenance: when some part of the zone needs to be taken offline, the instances affected will be migrated to other physical machines transparently (i.e. without a reboot). This is pretty awesome, and really makes it possible to run a set-and-forget service like a web server without any complexity (or cost) involved in, for example, setting up load balancing across multiple instances. After the zone, I’ve specified the disk image that will be used to initialise a new persistent root disk (which will be named the same as the instance). Alternatively, rather than creating a new disk from an image, I could have told the instance to mount an existing disk as the root disk (in either read/write or read-only mode, though a given disk can only be mounted read/write by one instance at any time). The image really is just a raw disk image, and it appears can contain pretty much anything that can run as an x86-64 KVM guest, though all the documentation and tools currently assume you’ll be running some Linux distribution, so you may find it it a little challenging to run something else (though plenty of people seem to be). For convenience, Google provides links to images with recent versions of Debian and CentOS (with RHEL and SUSE available as “premium” options), and above I’m using the latest stable version of Debian Wheezy (debian-7, which is actually a partial match for something like projects/debian-cloud/global/images/debian-7-wheezy-v20131120). Continuing with the options, --metadata_from_file and --authorized_ssh_keys are two ways to specify key/value metadata associated with the instance. In this case, the first option sets the metadata value with the key startup-script to the contents of the file startup.sh, while the second sets the metadata value with the key sshKeys to a list of users and public keys that can be used to log into the instance (here, myuser is the username, and myuser.pub is the SSH public key file). Both of these are specific to the instance (though it’s also possible to inherit metadata set at the project level), and can be queried — along with a host of other default metadata values — from the instance using a simple HTTP request that returns either a text string or JSON payload. I’m not going to go into metadata in any detail other than the above built-ins, but it looks to be pretty powerful if you need any kind of per-instance customisation. The startup-script metadata value is used to store the contents of a script run (as root) after your instance boots. In my case, I’m just using this to set the hostname of my instance, which was otherwise unset³, which in turn makes a bunch of tools throw warnings. I found the easiest way to fix this was to specify a startup script containing just hostname www. The sshKeys metadata value is used to store a list of users and SSH public keys. This is read by a daemon (installed as /usr/share/google/google_daemon/manage_accounts.py) that ensures that each listed user continues to exist (with a home directory, .ssh/authorized_keys containing the specified key, etc), and also ensures that each listed user is a sudoer (present in /etc/sudoers)⁴. Note that you don’t need to specify any of this at all. By default, Compute Engine creates a user account on your instance with a name set to your local login name, and creates a new ssh keypair that it drops into ~/.ssh/google_compute_engine{,.pub} on the machine you created the instance from. You can then simply use gcutil ssh instance-name to ssh into the instance. This is helpful when you’re getting started, but it does mean that if you want to ssh from anywhere else, you either need to copy those keys around, or do something like the above to tell Compute Engine to accept a given public key. Since I wanted to be able to ssh programmatically from machines that didn’t necessarily have gcutil installed, I found it simpler to just create an ssh keypair manually, specify it as above, and use standard ssh to ssh to the instance. --external_ip_address allows you to choose a “static” external IP address (from one you’ve previously reserved). Otherwise, the instance is assigned⁵ an ephemeral external IP address chosen when the instance is created. This is is reclaimed if you delete the instance, so you probably don’t want to rely on ephemeral IP addresses as the target of e.g. a DNS record. However, you can promote an ephemeral address that’s assigned to an instance so that it becomes a static address, and there’s no charge for (in-use) static addresses, so there’s no problem if you start using an ephemeral address and later want to keep it. (Strictly speaking, external IP addresses are actually optional, as all of your instances on the same “network” can talk to each other using internal addresses, but this isn’t something simple installations are likely to use, I wouldn’t have thought.) Compute Engine doesn’t currently have support for IPv6, oddly, though there’s a message right at the top of the networking documentation saying that IPv6 is an “important future direction”, so hopefully that’s just temporary. (EC2, for what it’s worth, doesn’t support IPv6 on their instances either, though their load balancers do, so you can use a load balancer as a [costly] way to get IPv6 accessibility.) Finally (phew!), --wait_until_running won’t return until the instance has actually started booting (typically about 25 seconds; you can add a brand new instance and be ssh’d into a shell in less than a minute.) Note that the machine won’t have any user accounts until the initial boot has finished, so if you’re scripting this you’ll need to spin a bit until ssh starts working. Configure your instance I did need to spend a fair amount of time working out how to configure the instance once it existed, but that was mostly because I’m not too familiar with Debian. There isn’t a great deal to say about this part (and obviously it’ll depend upon what you’re doing), but in my case I simply ran sudo apt-get install to get the packages I needed (apache2 and mercurial, and a few others like less and vim), downloaded and installed mod_pagespeed, the Apache module that does the same thing as PageSpeed Service, built my site, and set up Apache to serve it. There are still two things I’m not quite happy with: Automated updates. This has nothing to do with Compute Engine, just Debian: I’m currently using the unattended-upgrades Debian package, which I believe I’ve now configured to apply security updates correctly, but I don’t fully understand what options I have here, or whether I have in fact configured it correctly. Mail delivery. For probably-reasonable reasons, Compute Engine blocks anything on an outbound SMTP port. I don’t need to send mail generally, but I’d like a way to forward notices to root@. The official solution involves setting up an account with a third-party and then using an SMTP smart host running on a non-standard port. This would work, but it sounds like massive overkill for notices-to-root, so I’m currently thinking about using something like fetchmail (on another machine) in ODMR mode to inject mail to an existing account. Cost I’d estimate the monthly charge for my single instance to be around $15 + VAT (before the discount I’m getting). If I have the numbers right, that’s about what I’m paying for electricity to run my (not very efficient) server at home at present. That price is dominated by the cost of the machine, which from the pricing documentation is currently $0.019/hour; the disk and network cost (for me) is going to end up at significantly less than a dollar a month. I mention VAT above because something that’s not currently clear from the pricing documentation is that individuals in the EU pay VAT on top of the quoted prices (likewise true for EC2). Businesses are liable for VAT too, but they’re responsible for working out what to pay themselves, and do so separately. One other aspect of the tax situation is a bit surprising (for individuals again, not businesses): VAT for Compute Engine is charged at the Irish VAT rate (23%), because when you’re in the EU, you’re paying Google Ireland. (This is in contrast to Amazon, who charge the UK rate even though you’re doing business with Amazon Inc. - tax is complicated.) Admittedly, the difference on the bill above is less than 30p/month, but it still took a little bit of time to figure out what was going on. tl;dr Despite all the talk of “big data” and large-scale data processing, is Compute Engine a viable option for running small-scale jobs like a simple static web server? Absolutely. And while it’s easy to get started, it also looks like it scales naturally: I haven’t looked at load balancing (or protocol forwarding) in any detail, but everything else I’ve read about seems quite powerful and easy to start using incrementally. From the management side, I’m impressed by the focus on scriptability: gcutil itself is fine as far as it goes, but the underlying Compute Engine API is documented in terms of REST and JSON, and the Developers Console goes out of its way to provide links to show you the REST results for what it’s doing (as it just uses the REST API under the hood). There are also a ton of client libraries available (gcutil is written against the Python API, for example), and support from third-party management tools like Scalr. I still don’t think that I personally have any reason to use Compute Engine for large-scale processing, but I’m quite happy using it to serve this content to you. Funny story: I wasn’t sure whether to mention reliability, since it’s actually it’s been pretty good. Then a few hours after writing the first draft of this post, my router fell over, and it was half a day before I could get access to restart it. So there’s that. ↩ There was another reason I originally wanted to script the installation: the preview version of Compute Engine that I started out using supported non-persistent (“scratch”) machine-local disks that were zero-cost. Initially I was considering whether I could get the machine configured in a way that it could boot from a clean disk image and set itself up from scratch on startup. It turned out to be a little more complicated than made sense, so I switched to persistent disks, but kept the script (and then the 1.0 release of Compute Engine came along and did away with scratch disks anyway). ↩ It turns out this was caused by a bug in the way my Developers Console project was set up, many years ago; it doesn’t happen in the general case, and it’ll be fixed if I recreate my instance. ↩ This is actually a bit of a pain. I’m using this to create a service user that is used both for the initial install and website content updates, but I could probably do with separating out the two roles and creating the non-privileged user manually. ↩ Not actually assigned, but kinda: the instance itself only ever has an RFC 1918 address assigned to eth0 (by default, drawn from 10.240.0.0/16, though you can customise even that). Instead, it’s the “network” — which implements NAT⁶ between the outside world and the instance — that holds the external-IP-to-internal-IP mapping. The networking documentation covers this in extensive detail. ↩ I think even the NAT aspect is optional: Protocol forwarding (just announced today, as I write this) appears to allow you to attach multiple external IP addresses directly to a single instance, presumably as additional addresses on eth0. ↩ Noda Time 1.2.0 2013-11-27T00:40:53Z Noda Time 1.2.0 finally came out last week, and since I promised I’d write a post about it, here’s a post about it — which I’ve also just partially self-plagiarised in order to make a post for the Noda Time blog, so apologies if you’ve read some of this already. I promise there’s new content below as well. While the changes in Noda Time 1.1 were around making a Portable Class Library version and filling in the gaps from the first release, Noda Time 1.2 is all about serialization¹ and text formatting. On the serialization side, Noda Time now supports XML and binary serialization natively, and comes with an optional assembly (and NuGet package) to handle JSON serialization (using Json.NET). On the text formatting side, Noda Time 1.2 now properly supports formatting and parsing of the Duration, OffsetDateTime, and ZonedDateTime types. We also fixed a few bugs, and added a some more convenience methods — Interval.Contains() and ZonedDateTime.Calendar, among others — in response to requests we received from people using the library². Finally, it apparently wouldn’t be a proper Noda Time major release without fixing another spelling mistake in our API: we replaced Period.Millseconds in 1.1, but managed not to spot that we’d also misspelled Era.AnnoMartyrm, the era used in the Coptic calendar. That’s fixed in 1.2, and I think (hope) that we’re done now. There’s more information about all of the above in the comprehensive serialization section of the user guide, the pattern documentation for the Duration, OffsetDateTime, and ZonedDateTime types, and the 1.2.0 release notes. You can pick up Noda Time 1.2.0 from the NuGet repository as usual, or from the links on the Noda Time home page. That’s the summary, anyway. Below, I’m going to going into a bit more detail about XML and JSON serialization, and what kind of things you can do with the new text support. XML serialization Using XML serialization is pretty straightforward, and mostly works as you’d expect. Here’s a complete example demonstrating XML serialization of a Noda Time property: using System; using System.IO; using System.Xml; using System.Xml.Serialization; using NodaTime; public class Person { public string Name { get; set; } public LocalDate BirthDate { get; set; } } static class Program { static void Main(string[] args) { var person = new Person { Name = "David", BirthDate = new LocalDate(1979, 3, 22) }; var x = new XmlSerializer(person.GetType()); var namespaces = new XmlSerializerNamespaces( new XmlQualifiedName[] { new XmlQualifiedName("", "urn:") } ); var output = new StringWriter(); x.Serialize(output, person, namespaces); Console.WriteLine(output); } } As you can see, there’s nothing special here, and the output is also as you’d expect: David 1979-03-22 There are a couple of caveats to be aware of regarding XML serialization, though, most notably that the Period type requires special handling. Period is an immutable reference type, which XmlSerializer doesn’t really support, and so you’ll need to serialize via a proxy PeriodBuilder property instead. The other notable issue (which also applies to binary serialization) is that .NET doesn’t provide any way to provide contextual configuration, and so when deserializing a ZonedDateTime, we need a way to find out which time zone provider to use. By default, we’ll use the TZDB provider, but if you’re using the BCL provider (or any custom provider), you’ll need to set a static property: DateTimeZoneProviders.Serialization = DateTimeZoneProviders.Bcl; The serialization section in the user guide has more details about both of these issues. There are also two other limitations of XmlSerializer that aren’t specific to Noda Time, but are good to know about if you’re just getting started: In general, types that implement IXmlSerializable (as the Noda Time types do) can only be serialized as elements, and so annotating your properties with the XmlAttribute attribute won’t work (it appears that .NET will throw an exception, while Mono will instead do something strange). Surprisingly, value types that don’t implement IXmlSerializable are silently serialized as empty elements and deserialized to their default values. This is unlikely to be what you want, and it’s what will happen if you accidentally run using a pre-1.2 Noda Time assembly. JSON serialization Noda Time’s JSON serialization makes use of Json.NET, which means that to use it, you’ll need to add references to both the Json.NET assembly (Newtonsoft.Json.dll) and the Noda Time support assembly (NodaTime.Serialization.JsonNet.dll). The only setup you need to do in code is to inform Json.NET how to serialize Noda Time’s types (and again, which time zone provider to use). This can either be done by hand, or via a ConfigureForNodaTime extension method. Again, the user guide has all the details. Once that’s done, using the serializer is straightforward: using System; using System.IO; using Newtonsoft.Json; using NodaTime; using NodaTime.Serialization.JsonNet; internal class Person { public string Name { get; set; } public LocalDate BirthDate { get; set; } } static class Program { static void Main(string[] args) { var person = new Person { Name = "David", BirthDate = new LocalDate(1979, 3, 22) }; var json = new JsonSerializer(); json.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb); var output = new StringWriter(); json.Serialize(output, person); Console.WriteLine(output); } } Output: {"Name":"David","BirthDate":"1979-03-22"} Unlike the .NET XML serializer, the Json.NET serializer is significantly more configurable. The Json.NET documentation is probably a good place to start if you’re interested in doing that. Better text support Noda Time 1.2 adds parsing and formatting for the Duration, OffsetDateTime, and ZonedDateTime types, which previously only had placeholder ToString() implementations. Given a series of assignments like the following: var paris = DateTimeZoneProviders.Tzdb["Europe/Paris"]; ZonedDateTime zdt = SystemClock.Instance.Now.InZone(paris); OffsetDateTime odt = zdt.ToOffsetDateTime(); Duration duration = Duration.FromSeconds(12345); the result of calling ToString() on each of the zdt, odt, and duration variables would produce something like the following in 1.1: Local: 26/11/2013 19:35:28 Offset: +01 Zone: Europe/Paris 2013-11-26T19:35:28.00081+01 Duration: 123450000000 ticks In 1.2, these types use a standard pattern by default instead: the general invariant pattern (‘G’), for ZonedDateTime and OffsetDateTime, and the round-trip pattern (‘o’) for Duration: 2013-11-26T19:35:28 Europe/Paris (+01) 2013-11-26T19:35:28+01 0:03:25:45 More usefully, we can now use custom patterns: var pattern = ZonedDateTimePattern.CreateWithInvariantCulture( "dd/MM/yyyy' 'HH:mm:ss' ('z')'", null); Console.WriteLine(pattern.Format(zdt)); which will print “26/11/2013 19:35:28 (Europe/Paris)”. The null above is an optional time zone provider. If not specified, as shown above, the resulting pattern can only be use for formatting, and not for parsing³. This is why the standard patterns are format-only: they don’t have a time zone provider. If you do specify a time zone provider, however, you can parse your custom format just fine: var pattern = ZonedDateTimePattern.CreateWithInvariantCulture( "dd/MM/yyyy' 'HH:mm:ss' ('z')'", DateTimeZoneProviders.Tzdb); var zdt = pattern.Parse("26/11/2013 19:35:28 (Europe/Paris)").Value; Console.WriteLine(zdt); which prints “2013-11-26T19:35:28 Europe/Paris (+01)”, as you would expect. As well as formatting the time zone ID (the “z” specified in the format string above), you can also format the time zone abbreviation (using “x”), which given the above input would produce “CET”, for Central European Time. Now, if you’ve seen Jon’s “Humanity: Epic fail” talk — or watched his recent presentation at DevDay Kraków, which covers some of the same content — then you’ll already know that time zone abbreviations aren’t unique. For that reason, if you include a time zone abbreviation when creating a ZonedDateTimePattern, the pattern will also be format-only. In addition to the time zone identifiers, both ZonedDateTime and OffsetDateTime patterns accept a format specifier for the offset in effect. This uses a slightly unusual format, as Offset can be formatted independently: it’s “o<…>”, where the “…” is an Offset pattern specifier. For example: var pattern = ZonedDateTimePattern.CreateWithInvariantCulture( "dd/MM/yyyy' 'HH:mm:ss' ('z o<+HH:mm>')'", null); Console.WriteLine(pattern.Format(zdt)); which will unsurprisingly print “26/11/2013 19:35:28 (Europe/Paris +01:00)”. For OffsetDateTime, the offset is a core part of the type, while for ZonedDateTime, it allows for the disambiguation of otherwise-ambiguous local times (as typically seen during a daylight saving transition). If the offset is not included, the default behaviour for ambiguous times is to consider the input invalid. However, this can also be customised by providing the pattern with a custom resolver. Finally, to Duration. Duration formatting is a bit more interesting, because we allow you to choose the granularity of reporting. For our duration above, of 12,345 seconds, the round-trip pattern shows the number of days, hours, minutes, seconds, and milliseconds (if non-zero), as “0:03:25:45”. We can also format just the hours and minutes: var pattern = DurationPattern.CreateWithInvariantCulture("HH:mm"); var s = pattern.Format(duration); Console.WriteLine(s); which prints “03:25” — or we can choose to format just the minutes and seconds: var pattern = DurationPattern.CreateWithInvariantCulture("M:ss"); var s = pattern.Format(duration); Console.WriteLine(s); which does not print “25:45”, but instead prints “205:45”, reporting the total number of minutes and a ‘partial’ number of seconds. Had we instead used “mm:ss” as the pattern, we would indeed have seen the former result; the case of the format specifier determines whether a total or partial value is used. Once again, there’s more information on all of the above in the relevant sections of the user guide. or serialisation. I apologise in advance for the spelling, but the term turns up in code all the time (e.g. ISerializable), and I find it makes for awkward reading to mix and match the two. ↩ There’s definitely a balance to be had between the Pythonesque “only one way to do it” maxim and providing so many convenience methods that they cloud the basic concepts, and I think for 1.0 we definitely tended towards the former — which isn’t that bad: it’s easy to expand an API, but hard to reduce it. Some things that were a little awkward should be easier with 1.2, though. ↩ The error message you’ll see is “UnparsableValueException: This pattern is only capable of formatting, not parsing.” ↩ Measure twice 2013-11-15T02:39:16Z “Measure twice, cut once.” I can’t recall exactly where I was when I first heard that: perhaps a school carpentry lesson? Or for some reason I’m now thinking it was a physics lesson instead, but no matter. What is true is that I recently discovered that it applies to software engineering as much as carpentry. Here’s the background: this blog is generated as a set of static files, by transforming an input tree into an output tree. The input tree contains a mixture of static resources (images, etc) alongside ‘posts’, text files containing metadata and Markdown content. The output tree contains exactly the same files, except that the posts are rendered to HTML, and there are some additional generated files like the front page and Atom feed. There are tools to do this kind of thing now (we use Jekyll for Noda Time’s web site, for example), but I wrote my own (and then recently rewrote it in Python)¹. My version does three things: it scans the input and output trees, works out what the output tree should look like, then makes it so. It’s basically make (with a persistent cache and content-hash-based change detection) plus a dumb rsync, and it’s not particularly complex. For my most-recent post, I needed to add support for math rendering, which I did by conditionally sourcing a copy of MathJax from their CDN. So far, so good, but then I wanted to be able to proof the post while I was on a plane, so I decided to switch to a local copy of MathJax instead. Problem: a local install of MathJax contains nearly 30,000 files, and my no-change edit-render-preview cycle shot from 75ms to just over 12 seconds. Most of the time I write posts in vim, and only proof the post when I’m done editing, but since this was a math-heavy post (and my LaTeX is… rusty), I was having a swordfight every few minutes. But, you know, optimisation opportunity! I carried out some basic profiling and figured out that the 12 seconds I was seeing was taken up with: a little over five seconds scanning the input and output trees; about four seconds working about what to change; and about three seconds writing out the persistent cache. The second of those I couldn’t see any obvious way to improve, but the first and last surprised me. The input and output tree scanning is done by using os.walk() to walk the tree, and os.stat() to grab each file’s mtime and size (which I use as validators for cache entries). Clearly that was an inefficient way to do it: I’m calling stat(2) about 30,000 times, when I should be reading that information in the same call as the one that reads the directory, right? Except that there’s no such call: the Linux VFS assumes that a file’s metadata is separate from the directory entry²; this isn’t DOS. Perhaps I was thrashing the filesystem cache, then? Maybe I should be sorting (or not sorting) the directory entries, or stat-ing all the files before I recursed into subdirectories? Nope; doesn’t make a difference. Well, I guess we’re blocking on I/O then. After all, git doesn’t take long to scan a tree like this, so it must be doing something clever; I should do that. Ah, but git is multi-threaded, isn’t it?³ I’ll bet that’s how it can be fast: it’s overlapping the I/O operations so that it can make progress without stalling for I/O. So I wrote a parallel directory scanner in Python, trying out both the multiprocessing and threading libraries. How long did each implementation take? About five seconds, same as before. (And raise your hand if that came as a surprise.) The next thing I tried was replicating the scan in C, just to double-check that readdir-and-stat was a workable approach. I can’t recall the times, but it was pretty quick, so Python’s at fault, right? Wrong. It’s always your fault. I realised then that I’d never tried anything outside my tool itself, and ported the bare C code I had to Python. It took exactly the same amount of time. (At which point I remembered that Mercurial, which I actually use for this blog, is largely written in Python; that should have been a clue that it wasn’t likely to be Python in the first place.) So finally I started taking a look at what my code was actually doing with the directory entries. For each one, it created a File object with a bunch of properties (name, size, etc), along with a cache entry to store things like the file’s content hash and the inputs that produced each output. Now the objects themselves I couldn’t do much about, but the cache entry creation code was interesting: it first generated a cache key from the name, mtime, and size (as a dict), then added an empty entry to my global cache dict using that key. The global cache was later to be persisted as a JSON object (on which, more later), and so I had to convert the entry’s cache key to something hashable first. And how to generate that hashable key? Well, it turned out that as I already had a general-purpose serialiser around, I’d made the decision to reuse the JSON encoder as a way to generate a string key from my cache key dict (because it’s not like performance matters, after all). Once I’d replaced that with something simpler, tree scans dropped from 5.2s to 2.9s. Success! I’d also noticed something odd while I was hacking about: when I removed some of my DEBUG level logging statements, things sped up a bit, even though I was only running at INFO level. I briefly considered that perhaps Python’s logging was just slow, then decided to take another look at how I was setting up the logging in the first place: logging.basicConfig( level=logging.DEBUG, format=('%(levelname).1s %(asctime)s [%(filename)s:%(lineno)s] ' '%(message)s')) logging.getLogger().handlers[0].setLevel( logging.DEBUG if args.verbose else logging.INFO) Python’s logging is similar to Java’s, so this creates a logger that logs everything, then sets the default (console) log handler to only display messages at INFO level and above. Oops. I’d stolen the code from another project where I’d had an additional always-on DEBUG handler that wrote to a file, but here I was just wasting time formatting log records I’d throw away later. I changed the logging to set the level of the root logger instead, and sped things up by another second. More success! Finally, I decided to take a look at the way I was writing out my cache. This is a fairly large in-memory dict mapping string keys to dict values for each file I knew about. I’d known that Python’s json module wasn’t likely to be particularly fast, but almost three seconds to write an 11MB file still seemed pretty slow to me. I wasn’t actually writing it directly, though; the output was quite big and repetitive, so I was compressing it using gzip first: with gzip.open(self._cache_filename, 'wb') as cache_fh: json.dump(non_empty_values, cache_fh, sort_keys=True, default=_json_encode_datetime) I noticed that if I removed the compression entirely, the time to write the cache dropped from about 2900ms to about 800ms, but by this point I was assuming that everything was my fault instead, so I decided to measure the time taken to separately generate the JSON output and write to the file. To my surprise, when I split up the two (using json.dumps() to produce a string instead of json.dump()), the total time dropped to just 900ms. I have no real idea why this happens, but I suspect that something is either flushing each write to disk early or that calls from Python code to Python’s native-code zlib module are expensive (or json.dump() is slow). In total, that brought my no-change runtime down from twelve seconds to just about six. So, in summary, once I realised that I should actually measure where my time was spent rather than guessing what it might be spent on, I was able to reduce my runtime by about half, quite a big deal in an edit-compile-run cycle. It took it from “I’m bored waiting; maybe read Slashdot instead” to something that was tolerable. And so success, and scene. But. But there’s actually a larger lesson to learn here, too (and in fact I very nearly titled this post “The \sqrt of all evil” in light of that): I didn’t need to do any of this at all. A few weeks after the above, I realised that I could solve my local editing problem an entirely different way. I moved MathJax out of the content tree entirely, and now (on local runs) just drop a symlink into the output tree after the tool is done. So if you take a look at the page as it serves now, you’ll see I’m back to sourcing MathJax via their CDN. This means that I’m back down to O(200) input files rather than O(30k), and my no-change builds now take 30ms. It was a fun journey, but I’m not entirely sure that cutting 45ms from a 75ms run was worth all the effort I put in… Why? Mostly because I can: it’s fun. But also because it gives me an excuse to play around: originally with C and SQLite, and more recently with Python, a language I don’t use much otherwise. I could also say that static blog generators weren’t as common in 2006, but to be honest I don’t think I bothered looking. ↩ This is a good thing: hard links would be pretty tricky otherwise. ↩ Nope. git is multi-threaded when packing repositories, but as far as I’m aware that’s the only time it is. ↩ A maximally-dense encoding for n-choose-k 2013-11-07T23:58:15Z Last month¹, John Regehr asked: Random question: 32 choose 8 gives less than 11e6 possibilities whereas 24 bits has more than 16e6 choices of value. It seems like there must be a standard fast trick for encoding the choices in the bits. Anyone know it? John Regehr, Google+ post As it happens, this is something I’ve occasionally wondered about, too. The short answer is that there is a standard (though not particularly fast) way to do this called the “Combinatorial Number System”. As usual, Wikipedia has the details, though in this case I found their explanations a little hard to follow, so I’m going to go through it here myself as well. First, let’s back up for a bit and make sure we know what we’re talking about. If you wanted a way to compactly encode any combination of — let’s say — 32 choices, your best bet (assuming a uniform distribution) would be to use a 32-element bit-vector, with one bit set per choice made. If we were implementing that in C++, we could do something like this: uint32_t encode(const std::set<int>& choices) { uint32_t result = 0; for (int choice : choices) { assert(0 <= choice && choice < 32); result |= 1 << choice; } return result; } This pattern is common enough that in practice you’d be more likely to use the bit-vector directly, rather than materialise anything like an explicit set of choice numbers. But while it’s a good solution for the general case, what if your problem involved some fixed number of choices instead of an arbitrary number? For the purposes of the rest of this discussion, let’s pretend we’re going to make exactly four choices out of the 32 possible. That gives us about 36,000 different possible combinations to encode, which we should be able to fit into a 16-bit value. Actually, for almost all purposes, there’s still nothing really wrong with the implementation above, even though it is using twice as many bits as needed — unless perhaps you had a very large number of combinations to store (and are willing to trade simplicity for space). However, as we’ll see later, the solution to this problem has a few other uses as well. Plus, I think it’s an interesting topic in itself. If you just want to read about how the combinatorial number system itself works, feel free to skip the next section, as I’m going to briefly take a look at a ‘better’ (though still non-optimal) alternate to the above. 20 is less than 32 As an improvement to the above one-bit-per-choice implementation, we could simply encode the number of each choice directly using four groups of five bits each: uint32_t encode(const std::set<int>& choices) { assert(choices.size() == 4); uint32_t result = 0; for (int choice : choices) { assert(0 <= choice && choice < 32); result = (result << 5) | choice; } return result; } This is still fairly simple, and at 20 bits output, it’s more compact than a 32-element bit-vector, but it’s also larger than our optimal result of (approximately) 16 bits. We can actually see why this is: the encoding we’ve chosen is too expressive for what we actually need to represent². In this case, there are two ways in which our encoding distinguishes between inputs that could be encoded identically. The biggest waste occurs because we’re encoding an ordering. For example, $ \lbrace 1, 2, 3, 4 \rbrace $ and $ \lbrace 4, 3, 2, 1 \rbrace $ have different encodings under our scheme, yet really represent the same two combinations of choices. The second (and much more minor) inefficiency comes from the ability to encode ‘impossible’ combinations like $ \lbrace 4, 4, 4, 4 \rbrace $. Optimally, choices after the first would be encoded using a slightly smaller number of bits, as we have less valid choices to choose from by then. In this case, we can actually quantify precisely the degree to which this encoding is sub-optimal: being able to represent an ordering means that we have $ 4! = 24 $ times too many encodings for the ‘same’ input, while allowing duplicates means that we have $ 32^4 \div ( 32 \cdot 31 \cdot 30 \cdot 29 ) \approxeq 1.2 $ times too many (i.e. the number of choices we can encode, divided by the number we should be able to encode). Combining the two factors gives us the difference between an optimal encoding and this one. It’s interesting to think about how we might improve on the above: perhaps we could canonicalise the input by ordering the choices by number, say, then rely on that fact somehow: if we knew that the choices were in decreasing order, for example, it’s clear to see that we could identify the combination $ \lbrace 3, 2, 1, 0 \rbrace $ entirely from the information that the first choice is number 3. But let’s move on to something that is optimal. and 16 is less than 20 And so we arrive at the “combinatorial number system”. This system describes a bijection between any combination of (a fixed number of) $k$ choices and the natural numbers. Interestingly, this means that this scheme does not depend upon knowing the number of things you’re choosing from: you can convert freely between the two representations just by knowing how many choices you need to make. First, a brief refresher on binomials. The number of ways we can choose $k$ things from $n$ is the binomial coefficient, which can be defined recursively: \[ \eqalign{ \binom n 0 &= \binom n n = 1 \cr \binom n k &= \binom {n-1} {k-1} + \binom {n-1} k } \] … or, in a way that’s simpler to compute directly — as long as the numbers involved are small — in terms of factorials: \[ \binom n k = {n! \over k!(n - k)!} \] For example, we can compute that there are exactly 35,960 different ways to choose four things from a set of 32: \[ \binom {32} 4 = {32! \over 4!(32 - 4)!} = 35960 \] Note that in some cases (for example, the one immediately below), we may also need to define: \[ \binom n k = 0 \text { when } k \gt n \] That is, it’s not possible to choose more things from a set than were originally present. But that’s enough about binomials. The combinatorial number system, then, is then defined as follows: Given a combination of choices $ \lbrace C_k, C_{k-1}, \dots, C_1 \rbrace $ from $n$ elements such that $ n \gt C_k \gt C_{k-1} \gt \dots \gt C_1 \ge 0 $, we compute a value $N$ that encodes these choices as: \[ N = \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} \] Somewhat surprisingly, this produces a unique value $ N $ such that: \[ 0 \le N \lt \binom n k \] And since the number of possible values of $ N $ is equal to the number of combinations we can make, every $ N $ maps to a valid (and different) combination. $ N $ will be zero when all the smallest-numbered choices are made (i.e. when $ C_k = k - 1 $ and so $ C_1 = 0 $), and will reach the maximum value with a combination containing the largest-numbered choices. We could implement this encoding using something like the following: uint32_t binom(int n, int k) { assert(0 <= k); assert(0 <= n); if (k > n) return 0; if (n == k) return 1; if (k == 0) return 1; return binom(n-1, k-1) + binom(n-1, k); } uint32_t encode(const std::set<int>& choices) { std::set<int, std::greater<int>> choices_sorted( choices.begin(), choices.end()); int k = choices.size(); uint32_t result = 0; for (int choice : choices_sorted) { result += binom(choice, k--); } return result; } … though in reality, we’d choose a much more efficient way to calculate binomial coefficients, since the current implementation ends up calling binom() a number of times proportional to the resulting $N$! Decoding can operate using a greedy algorithm that first identifies the greatest-used choice number, then successively removes the terms we added previously: std::set<int> decode(uint32_t N, int k) { int choice = k - 1; while (binom(choice, k) < N) { choice++; } std::set<int> result; for (; choice >= 0; choice--) { if (binom(choice, k) <= N) { N -= binom(choice, k--); result.insert(choice); } } return result; } We could also choose to remove the initial loop and just start choice at the greatest possible choice number, if we knew it in advance. Number systems, how do they work? As to how all this works, consider the list produced for successive $N$. For $k = 4$, the enumeration of the combinations begins: \[ \displaylines{ \lbrace 3, 2, 1, 0 \rbrace \cr \lbrace 4, 2, 1, 0 \rbrace \cr \lbrace 4, 3, 1, 0 \rbrace \cr \lbrace 4, 3, 2, 0 \rbrace \cr \lbrace 4, 3, 2, 1 \rbrace \cr \lbrace 5, 2, 1, 0 \rbrace \cr \vdots } \] As you can see, this list is in order. More specifically: it’s in lexicographic order. This isn’t by coincidence, but is actually a direct result of the way we construct the equation above. Let’s do that. First, construct an (infinite) list of all possible $k$-combinations, with the choices that form each individual combination in descending order, as above. Sort this list in lexicographic order, as if the choices were digits in some number system, again as shown above. Pick any entry in that sorted list. We’re going to count the number of entries that precede our chosen one. To do so, for each ‘digit’ (choice) in our chosen entry, count all valid combinations of the subsequence that extends from that digit to the end of the entry, while only making use of the choices with numbers smaller than the currently-chosen one. Sum those counts, and that’s the number of preceding entries. That sounds much more complex than it really is, but what we’re doing is equivalent to, in the normal decimal number system, saying something like: “the count of numbers less than 567 is equal to the count of numbers 0xx–4xx, plus the count of numbers 50x–55x, plus the count of numbers 560–566”. Except in this case, the ‘digits’ in our numbers are strictly decreasing. I skipped a step. How do we find the number of combinations for each subsequence? That’s actually easy: if the choice at the start of our subsequence currently has the choice number $C_i$ and the subsequence is of length $i$, then the number of lexicographically-smaller combinations of that subsequence, $N_i$, is the number of assignments we can make of $C_i$ different choices³ to the $i$ different positions in the subsequence. Or alternatively, \[ N_i = \binom {C_i} i \] and so: \[ \eqalign{ N &= \sum_{i=1}^k N_i \cr &= \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} } \] $N$, of course, is both the count of entries preceding the chosen entry in our sorted list, and the index that we assign to that entry. Random Okay, so perhaps it’s not that common to need to encode a combination as an integer value, but there’s another way to use this to be aware of here: if you pick a random $N$ and ‘decode’ it, you end up with a uniformly-chosen random combination. That’s something that I have wanted to do on occasion, and it’s not immediately clear how you’d do it efficiently otherwise. (For completeness, a third usage that Wikipedia mentions is in being able to construct an array indexed by $N$ in order to associate some information with each possible $k$-combination. I can see what they’re saying, but I can’t see many cases where this might actually be something that you need to do.) Open questions A simple bit-vector is optimal for representing any number of choices $ [0, n] $ made from $n$ items. The combinatorial number system is optimal for representing a fixed number of choices $k$. What’s an optimal way to represent a bounded number of choices $ [0, k] $? Or more generally, any arbitrarily-bounded number of choices? I’m a slow typist. ↩ The other possibility is that the encoding may be wasteful in a straightforward information-theoretic sense, akin to using six bits to represent “a number from 0 to 37” rather than ~5.25. While this is strictly a superset of the over-expressiveness problem mentioned in the main text, it seems useful to differentiate the two, and consider the semantics expressed by the encoded representation we’ve decided upon. ↩ That is, the count of those choice numbers that are smaller than $C_i$, since choice numbers start from zero. ↩

farblog

Noda Time 3.0.0

Performance

Caveats

Availability

What’s next?

Books of 2018

This is Just to Say

Deadlocks in Java class initialisation

pip install --isolated fails on Python 3.0–3.3

Noda Time 1.3.1

End of year transitions (Bangladesh)

BCL provider: historical changes to the base offset (Russia)

BCL provider: time zone equality

And finally…

Just how big is Noda Time anyway?

Simple privilege separation using ssh

Mobile friendly

What’s the problem?

meta viewport to the rescue

overflow: visible

Responsive images

Just a quality of implementation issue

Resources

Cloudy DNS

Pangrams on the web

Searching the web

Scoring

Results!

Pangrammatic performance

Where were we?

Repeatable reads

Change the algorithm

Change the implementation

Make better use of the resources we have

Summary

HTTPS

Noda Time 1.3.0

Onward to 2.0

Pangrams in C

tl;dr, where’s the code?

Make it go faster?

Pangrammatic windows

Small-scale Compute Engine

First, create your instance

Configure your instance

Cost

tl;dr

Noda Time 1.2.0

XML serialization

JSON serialization

Better text support

Measure twice

A maximally-dense encoding for n-choose-k

20 is less than 32

and 16 is less than 20

Number systems, how do they work?

Random

Open questions