Noda Time 3.0.0 came out yesterday1, bringing a shiny new parcel of date- and time-related functionality.
What’s new in 3.0? Firstly, there’s a couple of things in 3.0 that just plain make it easier to use Noda Time:
Nullable reference types. The API now correctly uses the
nullable reference types introduced in C# 8.0 to document when a
method or property may accept or return a null
value.
For example,
IDateTimeZoneProvider.GetZoneOrNull(id)
now declares its return value as DateTimeZone?
, while the similar
indexer (which cannot return null)
instead returns DateTimeZone
.
Nullability was previously noted in our documentation, but now (with
appropriate compiler support) you can opt-in to warnings that indicate
where you might be accidentally passing a null
somewhere you shouldn’t.
A plethora of API improvements. For example, we now have a
YearMonth
type that can represent a value like “May 2020”;
TzdbDateTimeZoneSource
now provides explicit
dictionaries mapping between TZDB and Windows time zone IDs; and
DateAdjusters.AddPeriod()
creates a date
adjuster that can be used to add a Period
to dates, along with many
other improvements. As always, see the version history and API
changes page for full details.
A single library version. Previous versions of Noda Time were slightly fragmented when it came to supporting different framework versions. For example, Noda Time 1.x was specific to the .NET Framework, and later added a Portable Class Library version that was missing a few key functions, while Noda Time 2.x again provided a separate .NET Standard version that differed slightly from the ‘full’ version. As of Noda Time 3.0, we have just one library version, providing the same functionality on all platforms.
Better support for other frameworks. Most core types are now annotated with TypeConverter and XmlSchemaProvider attributes. Type converters are used in various frameworks to convert one type into other (typically, to or from a string) — for example, ASP.NET will use type converters to convert query string parameters into typed values — while the XML schema attributes make it possible to build an XML schema programmatically for web services that make use of Noda Time types.
Although not as significant as the changes from Noda Time 1.x to 2.x, performance is still a key concern for Noda Time.
In 3.0.0, we’ve managed to eke out a little more performance for some common
operations: finding the earlier of two LocalDate
values now takes
somewhere between 40–60% of the time it did in Noda Time 2.x, while parsing
text strings as LocalTime
and LocalDate
values using common (ISO-like)
patterns should also be a little faster, taking around 90% of the time it
did in Noda Time 2.x.
The change from Noda Time 2.x to 3.0 is not as big a change as the one from Noda Time 1.x to 2.0, but there are still some small incompatibilities to watch out for.
The migration document details everything that we’re aware of, but there are two points worth calling out explicitly:
Noda Time 3.x has (slightly) greater system requirements than Noda Time
2.x. While Noda Time 2.x required either .NET Framework 4.5+ or .NET Core
1.0+, Noda Time 3.x requires “netstandard2.0
”; that is, .NET Framework
4.7.2+ or .NET Core 2.0+.
.NET binary serialization is no longer supported. While .NET Core 2.0 added some support for binary serialization, binary serialization has many known deficiencies, and other serialization frameworks are now generally preferred. Accordingly, we have removed support for binary serialization entirely from Noda Time 3.x.
Noda Time still natively supports .NET XML serialization for all core types, and we also provide official libraries for serializing using JSON (1, 2) and Google’s protobuf.
In general, though, we expect that most projects using Noda Time 2.x should be able to replace it with Noda Time 3.0.0 transparently.
You can get Noda Time 3.0.0 from the NuGet repository as usual (core and testing packages), or from the links on the Noda Time home page.
Note that the serialization packages were decoupled from the main release during the 2.x releases, and so (for example) there is no new version of NodaTime.Serialization.JsonNet; the current version of that library will work just fine with Noda Time 3.0.0.
Good question. While Noda Time is fairly mature as a library, we do have a
few areas we’d like to explore for the future: making use of Span<T>
in
text parsing, and providing a little more information from CLDR sources
(stable timezone IDs, for example). If you’re interested in helping out,
come and talk to us on the mailing list.
And once again, I’m going to copy/paste this to produce the official Noda Time blog post. (The evidence suggests that this is the only way I’ll get any content on my personal site, after all.) ↩
[Insert obligatory “well, it’s been a while since I’ve written anything for this blog” paragraph here.]
With 2018 finally complete, I thought it might be fun to take a quick look at the books I read last year. All of these are from my Goodreads profile, though I tend not to write reviews for individual books there.
Goodreads has a “reading challenge” each year wherein you can set a target number of books to read. In 2016, I hit my target of 34 books, albeit only by cramming both the SRE book and The Calendar of the Roman Republic (long story) on the last day of that year. Buoyed by success, I increased it to 38 books for 2017… and then got distracted by life and fell a bit short.
So, for 2018, I kept the same target as for 2017, and tried to not get distracted. A few weeks ago, I’d got a little bit ahead of that — woohoo me! — and decided it might be fun to put together a short review of each. So here are all the books I read in 2018, in (roughly) chronological order.
Robots vs. Fairies, various authors
Starting off 2018, an anthology of short stories: some about robots, and
some about fairies. Definitely mixed, with a few really good ones, and
a few that are… not so good (John Scalzi’s comes to mind as one of the
latter, surprisingly).
The Fifth Season (The Broken Earth, #1), N.K.
Jemisin
So, this I definitely liked. It has a great premise, post-apocalyptic —
or maybe just apocalyptic, given the intro — fantasy, good characters,
and good worldbuilding, it won the 2016 best novel Hugo, and yet… I
haven’t picked up the series again.
I’m not sure exactly why: perhaps because of the writing style (it’s present tense, partly in second person), perhaps because I was irritated by the way the writer withheld some key information the characters knew, or perhaps because of the incomplete ending. It’s possible that the sequels are brilliant, but I haven’t got around to finding out yet. Possibly in 2019.
Dark State (Empire Games, #2), Charles
Stross
Continuing Stross’ reboot of the Merchant Princes series, a
multiple-alternate-timeline spy/techno-thriller. Stross groks politics and
economics (and technology), so this is actually a pretty good alt-history
analysis as well as being a lot of fun. (Although if we could stop heading
towards the dystopian timeline in real life, that’d be great, thanks.)
The Night Masquerade (Binti, #3), Nnedi
Okorafor
Quoting from the one Goodreads review I did write: I was looking forward
to this offering a conclusion to the series. Well, in some ways it does do
that, and in some — quite important — ways, it doesn’t. I think I’d
have been better just appreciating the great world-building here rather
than the plot.
Beneath the Sugar Sky (Wayward Children, #3),
Seanan McGuire
So, what if fairy tales were real? What happens when they’re over? That’s
the premise of this series — in much the same way as Stross’
Equoid asks what it might be like if unicorns
were real (spoilers: sharp horns, so blood, mostly).
This book is almost-standalone, with some of the children from earlier books going on portal-hopping adventures of their own. I liked this one a lot more than the second book in the series, which had a different focus, and was a bit more serious. Also, I’ve just realised that book #4 (In an Absent Dream) is out next week!
The Fox’s Tower and Other Tales, Yoon Ha
Lee
A collection of flash fiction from Yoon Ha Lee, who’s also written some
excellently weird science fiction and interactive fiction. Like
Robots vs. Fairies above, I thought this was somewhat
hit-and-miss.
The stories I enjoyed more tended to be those heavy on imagery and light on ‘plot’ (such plot as is possible with flash fiction), though The Stone-Hearted Soldier was an excellent inclusion, and an exception to that rule (but also one of the longer stories).
An Unkindness of Ghosts, Rivers Solomon
A dystopian space opera set around a study of oppression and segregation
aboard a generation spaceship. The protagonists are incredibly varied and
interesting characters, though the bad guys are unfortunately cardboard.
I remember this being something I wanted to keep reading (if challenging in parts), but I can’t actually remember any of the plot at this point. Minor issues notwithstanding, I definitely enjoyed this.
The Arcadia Project series, #1–3
(Borderline,
Phantom Pains,
Impostor Syndrome),
Mishell Baker
From one set of neuroatypical characters to another. No spaceships here,
but an urban fantasy/mystery that posits a link between fey and Hollywood
celebrity. The whole series is great, the characters are believable and
well-rounded (and self-sabotaging and dysfunctional). I was worried that I
wouldn’t be that interested in a Los Angeles movie-town setting, but the
characters and story won me over.
This series ties with Smoke and Iron (below) as my favourite read of 2018. Recommended.
The Gone World, Tom Sweterlitsch
So, apparently I liked this enough to give it 4/5 on Goodreads, but I
can’t actually remember anything about it. It looks like it’s a time
travel/murder mystery/apocalypse story? Perhaps I should re-read it.
Sleeping Giants (Themis Files, #1), Sylvain
Neuvel,
Told via the medium of interviews and news clippings, in the style of
World War Z, this is the story of how the discovery of a
giant robot hand plays out politically. There is some sci-fi here, but
mostly it’s the politics from Arrival that takes centre
stage.
This was alright, but again, I’ve not picked up the next in the series. The journal/interview format makes it hard to get much in the way of interaction between characters, and the story seemed more interested in the politics than in the sci-fi/mystery aspect (which is fine, just not what I was looking for).
Storytelling with Data: A Data Visualization Guide for Business
Professionals, Cole Nussbaumer Knaflic
I think this is what you’d get if you boiled down Tufte’s The
Visual Display of Quantitative Information into
practical advice and case studies, thirty years later. Definitely useful
and interesting, even though this isn’t something I need to do on a
regular basis professionally.
The Red Rising series, #1–4
(Red Rising,
Golden Son,
Morning Star,
Iron Gold),
Pierce Brown
Dystopian sci-fi. The blurb says “Ender’s Game meets The Hunger Games”,
and I suppose that’s about right: the protagonist takes on the elite by
infiltrating them and subverting them from within, only this time we’re
talking about Mars, and later an entire solar system.
I enjoyed the first few books in the series, but somewhere around the third or fourth I started to get a bit tired of the diffusion of the story to uninteresting point-of-view characters, and also in the continuous faux-Roman melodramatics.
The first book is definitely good by itself, and maybe I’ll pick the series up again at some point.
Kindred, Octavia E. Butler
This is also sci-fi, or maybe fantasy1, but is probably simpler to
think of as historical fiction. A modern progressive black woman is
transported to early 19th century Maryland, deep in the
antebellum American South.
With the caveat that “modern” here means the 1970s (the book being published in 1979), this is a fascinating story — if deeply unsettling at times — about how culture shapes behaviour, and how social hierarchies and systems can be justified and propagated by those within the system.
The Pliocene Exile / Galactic Milieu series
(The Many-Coloured Land,
The Golden Torc,
The Nonborn King,
The Adversary;
Intervention;
Jack the Bodiless,
Diamond Mask,
Magnificat),
Julian May
An easy re-read. Julian May’s epic galaxy- and time-spanning series starts
with a fantastic premise: as Earth has joined a galactic federation of
sorts, and as humanity has begun to evolve psionic powers, a misfit group
of disaffected/adventurous travellers escapes into exile via a one-way
time wormhole that deposits them in France, in the Pliocene epoch, 6
million years ago2.
Without spoiling too much, the story shifts very quickly from science fiction to something closer to high fantasy (for the first series, at least; the second is in a more contemporary time period, and is more ‘regular’ sci-fi). Weaving mythology and an epic story, this is well worth the time to read.
A Rag, a Bone and a Hank of Hair, Nicholas
Fisk
This YA dystopia was fun to read when I was a lot younger (it was
published in 1980; I probably read it sometime in the mid-1980s, along
with a lot of other Nicholas Fisk), but it hasn’t really held up that
well. The motivation behind the plot falls apart a bit on any analysis,
and some of the technology is a bit dated now (explanations about
miniature tape recorders, that kind of thing).
However, I do still like how the protagonist learns to interact with the other characters (both modern, and not-so-modern), and how their attitude changes over the course of the story, and I do still appreciate the swerve away from hard sci-fi that happens partway through. It’s flawed, but it’s still a classic.
The Lady Astronaut series, #1–2, plus the initial novelette
(The Lady Astronaut of Mars,
The Calculating Stars,
The Fated Sky),
Mary Robinette Kowal
Alt-history in which the author bootstraps the space race a decade early
via a meteorite-shaped forcing function. Post-steampunk, but
pre-electronic-computer; the author describes it as “punchcard punk”.
This is Hidden Figures meets Apollo 13, with a
strong focus on the racial and gender discrimination of the
1950s3.
(The novelette was published first — winning the 2014 Hugo for best novelette — but is set some thirty or so years after the novels. I read it first, but you could easily read it after: it’s not directly connected to the novels.)
The novels suffer very slightly from telling two separate stories: one is a humanity-against-the-elements story (Apollo 13 or The Martian), while the other is a documentary about 1950s cultural attitudes. Both are interesting stories, but I found it a little frustrating when the story would focus tightly on the protagonist to the exclusion of the wider global impact (pun most definitely intended).
However, overall this is definitely worth reading.
The Labyrinth Index (Laundry Files, #9),
Charles Stross
Well, we’re past the Lovecraftian singularity at this point, and it’s all
about surviving while the transhumans play. One of whom happens to be
inhabiting the Prime Minister at present, and who has opinions about
foreign policy.
Mhari, who we met in her current incarnation in The Rhesus Chart a while back, is presently attempting to stay alive while said elder god is playing eleven-dimensional chess nearby. Meanwhile, the US appears to have collectively forgotten that the executive branch exists…
I liked this a lot. Mhari was interesting without being annoying, as I worried she might be (she was in some of the earlier books; deliberately so in order to annoy Bob, I think). Otherwise, this was pretty much exactly as I expected at this point in the series: a lot of fun.
Revenant Gun (The Machineries of Empire, #3),
Yoon Ha Lee
Yoon Ha Lee’s conclusion about a 400-year-old immortal general and crazy
magic that works because of a shared consensual reality. It’s military
sci-fi, kinda?
I can’t really discuss this without spoilers, but while it did more hand-holding than earlier books in the series, it still featured a lot of creative worldbuilding.
Lies Sleeping (Rivers of London, #7), Ben
Aaronovitch,
Like The Labyrinth Index above, by the time you get this far
into a series, you pretty much know what to expect: in this case, a fun
police procedural with magic and geeky in-jokes.
However, I did find it a bit hard to follow what was going on with the plot here, which seemed to be both a bit muddled and to reach back over the whole of the series. (I’ve also not read the associated graphic novels, which might have helped, though they’re not supposed to be necessary prerequisites.)
Side-note: an interesting article about intersectionality in the Rivers of London series.
A Canticle For Leibowitz, Walter M. Miller
Jr.
A classic (1959) post-apocalyptic sci-fi tale published during a high
point in Cold War tensions. In the far aftermath of nuclear war, society
struggles to drag itself out of a new dark age, and to rediscover and
protect old knowledge. This is three distinct stories — originally
published as such — separated by time (centuries), and vaguely connected
by place.
This unapologetically puts forwards a Christian (specifically, Catholic) viewpoint, with the church to some extent a main character. It has some ironic humour, but also serious comment about ethics and human nature. With one exception near the end, I didn’t find it to be too preachy.
It made a big impact at the time, but is it actually a good story nowadays? Well, meh. I found it thought-provoking (and somewhat depressing) in turns, but I can’t actually say that the story exists much more than as a framework for the author’s viewpoints. Largely unsatisfying, and probably more important for the historical context now.
Smoke and Iron (The Great Library, #4), Rachel
Caine
Okay, this is just brilliant. Along with The Arcadia Project series
(above), this was easily one of my favourite reads of 2018.
So, why? Well, it’s got good worldbuilding, a fast-paced (and fun) plot, it’s got great characters and character development, and good writing.
The plot itself starts immediately after Ash and Quill, so talking about the plot directly would spoil the earlier books. In general, though, this series is a YA alt-history/fantasy in which the Great Library (of Alexandria) has become a ruthless worldwide power, tightly controlling both the dissemination of information and also the source for some of the magic/alchemy that’s available in this world.
On the writing: one section in particular has the viewpoint character magically hypnotized into believing that they’re someone else, and the author shifts the (tight third-person) text to match that impersonated character, having the viewpoint character not just act as another, but having the prose notice (and the character comment on internally) an entirely different set of things appropriate for the character they were impersonating. Subtle, but I liked it.
The Murderbot Diaries series
(All Systems Red,
Artificial Condition,
Rogue Protocol,
Exit Strategy),
Martha Wells
Nom nom nom. These were great. I inhaled the whole series all in one go.
Murderbot is a fairly apathetic and introverted humanform security droid that just wants to be left alone to watch sci-fi soap operas, but stupid humans keep doing stupid things that stop it from doing so, or worse, are trying to interact with it rather than let it stand in a corner by itself (to watch soap operas again, probably).
This is a series of four novellas written with Murderbot narrating, and it’s delightful. They are short, so each has a fairly straightforward plot, but it’s great fun nonetheless.
Ra, Sam Hughes
On the one hand, Ra is excellent: it’s a hard sci-fi novel (novella?)
with some really well thought-through worldbuilding. To some extent, it
puts me in mind of Snow Crash. (It also has some
really nice in-jokes, which I don’t think I can reference without being
spoilery.)
It was published in chapters on Sam Hughes’ blog (at qntm.org/ra, where you can read it for free), and there are also a few EPUB versions, some of which you can choose to pay for.
So as a self-published story, it’s really rather good. Unfortunately, on the other hand, I think it could also do with some quite significant editing, as there seem to be two almost completely different stories here, and while they’re linked, the story switches at one point from something grounded (like Snow Crash) to something incomprehensible by Greg Egan, and while both are good, I don’t think they fit well together.
To sum up: I managed to read 40 books last year, almost all of which were fiction, mostly urban fantasy and sci-fi, to nobody’s surprise. (I also started and failed to finish a bunch of non-fiction books).
I think I did a better job of picking books with diverse protagonists this time round, and while most of the books I read were published in the last few years (40% were published in 2018), I managed to also seek out a few older ones (Kindred, for example, I’m really glad I got round to reading).
Onward to 2019!
I’d have called it sci-fi purely because it has time-travel, but I ran across an interview with Butler in which she points out, “Kindred is fantasy. I mean literally, it is fantasy. There’s no science in Kindred.” She has a point. ↩
… though from what I can tell, 6 Ma is squarely in the Miocene epoch, not the Pliocene. In A Pliocene Companion, Word of God resolves this by stating that, in-universe, the Pliocene is considered to start around 11 Ma (not 5.6 or 5.33 Ma, as in our reality). ↩
And to a large extent, discrimination that’s still present today: there’s a line where our heroine says that “people would ignore what I said until [my husband] repeated it”, which sounds familiar enough. ↩
I have invalidated
the assumptions
that your code
depended upon
Forgive me
they were so well hidden
and so fragile
— Reid McKenzie, Twitter
I recently ran across the fact that it’s possible to make the Java runtime deadlock while initialising a class — and that this behaviour is even mandated by the Java Language Specification.
Here’s a Java 7 program that demonstrates the problem:
public class Program {
public static void main(String args[]) {
new Thread(new Runnable() {
@Override public void run() {
A.initMe();
}
}).start();
B.initMe();
}
private static class A {
private static final B b = new B();
static void initMe() {}
}
private static class B {
private static final A a = new A();
static void initMe() {}
}
}
In addition to demonstrating that lambdas are a good idea (all that boilerplate to start a thread!), this also shows how cycles during class initialisation can lead to a deadlock. Here’s what happens when you run it1:
$ javac Program.java
$ java Program
That is, it hangs.
In Java, classes are loaded at some arbitrary point before use, but are only
initialised — running the static {}
blocks and static field
initialisers — at defined points2.
One of these points is just before a static method is invoked, and so the
two calls to A.initMe()
and B.initMe()
above will both trigger
initialisation for the respective classes.
In this case, each class contains a static field that instantiates an instance of the other class. Instantiating the other class requires that that class is initialised, and so what we end up with is that each class’s initialisation is blocked waiting for the initialisation of the other class to complete.
If you trigger a thread dump at this point — by sending a SIGQUIT
or
hitting Ctrl-\
(or Ctrl-Break
on Windows) — then you’ll see something
like this:
Full thread dump OpenJDK 64-Bit Server VM (24.79-b02 mixed mode):
"Thread-0" prio=10 tid=0x00007efd50105000 nid=0x51db in Object.wait() [0x00007efd3f168000]
java.lang.Thread.State: RUNNABLE
at Program$A.<clinit>(Program.java:13)
at Program$1.run(Program.java:5)
at java.lang.Thread.run(Thread.java:745)
"main" prio=10 tid=0x00007efd5000a000 nid=0x51ca in Object.wait() [0x00007efd59d45000]
java.lang.Thread.State: RUNNABLE
at Program$B.<clinit>(Program.java:18)
at Program.main(Program.java:9)
[...]
Interestingly, you can see that while both threads are executing an implicit
Object.wait()
, they’re listed as RUNNABLE
rather than WAITING
, and
there’s no output from the deadlock detector. I suspect that the reason for
both of these is that the details of class initialisation
changed in Java 7:
In Java 6, the runtime would attempt to lock the monitor owned by each
Class
instance for the duration of the initialisation, while in Java 7,
attempting to initialise a class that’s already being initialised by another
thread just requires that that the caller be blocked in some undefined
fashion until that initialisation completes.
There are other ways to trigger the same problem, too. Here’s another problematic snippet:
public class Foo {
public static final Foo EMPTY = new EmptyFoo();
}
public class EmptyFoo extends Foo {}
Here we have Foo
, and EmptyFoo
, a special — presumably empty, in some
fashion — version of Foo
. EmptyFoo
is usable directly, but it’s also
available as Foo.EMPTY
.
The problem here is that initialising EmptyFoo
requires us to initialise
the superclass, and initialising Foo
requires initialisation of EmptyFoo
for the static field. This would be fine in one thread, but if two threads
attempt to initialise the two classes separately, deadlock results.
Cyclic dependencies between classes have always been problematic in both Java and C#, as references to non-constant static fields in classes that are already being initialised see uninitialised (Java) or default (C#) values. However, normally the initialisation does complete; here, it doesn’t, and here the dependencies are simply between the classes, not between their data members.
Unfortunately, I don’t know of any convenient way to detect these cycles in
Java: OpenJDK provides -XX:+TraceClassInitialization
, which I suspect
might be useful, but it’s only available in debug builds of the OpenJDK
JRE3, and I haven’t been able to confirm exactly what it
shows.
And for what it’s worth, I’m not aware of a better solution for detecting cycles in C# either. For Noda Time, we used a custom cycle detector for a while; it spotted some bugs resulting from reading default values, but it was too brittle and invasive (it required modifying each class), and so we removed it before 1.0.
I suppose that if we assume that class initialisation occurs atomically and on multiple threads, then this kind of problem is bound to come up4. Perhaps what’s surprising is that these languages do allow the use of partially-initialised classes in the single-threaded case?
If videos are your thing, the folks at Webucator have turned this post into a video as part of their free (registration required) Java Solutions from the Web course. They also offer a series of paid Java Fundamentals classes covering a variety of topics.
Or at least, what happens when I run it, on a multiprocessor Debian machine running OpenJDK 7u79. I don’t think the versions are particularly important — this behaviour seems to be present in all Java versions — though I am a little surprised that I didn’t need to add any additional synchronisation or delays. ↩
A similar situation exists in C# for classes with static constructors (for classes without, the runtime is allowed much more latitude as to when the type is initialised). ↩
You can trace class loading with -XX:TraceClassLoadingPreorder
and -XX:TraceClassLoading
, but this doesn’t tell you when class initialisation happens. ↩
He says, with a sample size of one. I haven’t managed to confirm what C# does, for example, and C++ avoids this problem by replacing it with a much larger one, the “static initialisation order fiasco”. ↩
(This is a quick post for search-engine fodder, since I didn’t manage to find anything relevant myself.)
If you’re using pip install --isolated
to install Python packages and find
that it fails with an error like the following:
Complete output from command python setup.py egg_info:
usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
or: -c --help [cmd1 cmd2 ...]
or: -c --help-commands
or: -c cmd --help
error: option --no-user-cfg not recognized
… then you might have run into an incompatibility between pip and Python versions 3.0–3.3.
pip version 6.0 added an isolated mode (activated by the
--isolated
flag) that avoids looking at per-user
configuration (pip.conf
and environment
variables).
Running pip in isolated mode also passes the --no-user-cfg
flag to Python’s distutils
to disable reading
the per-user ~/.pydistutils.cfg
. But that flag isn’t available in Python
versions 3.0–3.3, causing the error above.
I ran into this because I recently migrated the Python code that generates this site to run under Python 3.x. I’m using a virtualenv setup, so once I had everything working under both Python versions, I was reasonably confident that I could switch ‘production’ (i.e. the Compute Engine instance that serves this site) to Python 3 and discard the 2.x-compatibility code.
Good thing I tested it out first, since it didn’t even install.
It turns out that:
--no-user-cfg
was added in Python 2.7, but wasn’t ported to 3.x until
3.42, andI worked around this by just omitting the --isolated
flag for Python
verions [3.0, 3.4) — though since I don’t actually have any system config
files in practice, I probably could have set PIP_CONFIG_FILE=/dev/null
instead (which has the effect of ignoring all config files).
I’m not the first person to have noticed that virtualenv isn’t actually hermetic. Though some of that rant is out of date now (Python wheel files provide prebuilt binaries), and some isn’t relevant to the way I’m using virtualenv/pip, it’s definitely true that the dependency on the system Python libraries is the main reasons I’d look to something more like Docker or Vagrant for deployment were I doing this professionally.
So did I finally manage to switch to Python 3.x after that? Not even close:
Python 3.x didn’t gain the ability to (redundantly) use the u'foo'
syntax
for Unicode strings until 3.3, and some of my dependencies use that syntax.
So I’m waiting until I can switch to Debian 8 on Compute
Engine3, at which point I can cleanly assume Python 3.4 or
later.
This is a rant for another day, but it looks like virtualenv monkeypatches pip, which monkeypatches setuptools, which either monkeypatches or builds upon distutils. Debugging through this edifice of patched abstractions is… not easy. ↩
It’s a bit more complex than that: Python 3.0 and 3.1 were released first, then the feature was implemented in both 2.7 and 3.2, but then distutils as a whole was rolled back to its 3.1 state before 3.2 was released. That rollback was reverted for Python 3.4. ↩
I can apt-get dist-upgrade
from the Debian 7 image just fine, but it’s a bit slow and hacky, so I’d rather wait for official images. (I also need to fix some custom mail-related configuration that appears to have broken under Debian 8.) ↩
It’s been a while since the last Noda Time release, and while we’re still working towards 2.0, we’ve been collecting a few bug fixes that can’t really wait. So last Friday1, we released Noda Time 1.3.1.
Noda Time 1.3.1 updates the built-in version of TZDB from 2014e to 2015a, and fixes a few minor bugs, two of which were triggered by recent data changes.
Since it’s been a while since the previous release, it may be worth pointing out that new Noda Time releases are not the only way to get new time zone data: applications can choose to load an external version of the time zone database rather than use the embedded version, and so use up-to-date time zone data with any version of the Noda Time assemblies.
If you’re in a hurry, you can get Noda Time 1.3.1 from the NuGet repository (core, testing, JSON support packages), or from the links on the Noda Time home page. The rest of this post talks about the changes in 1.3.1 in a bit more detail.
In the middle of 2009, Bangladesh started observing permanent daylight saving time, as an energy-saving measure. This was abandoned at the end of that year, and the country went back to permanent standard time.
Until recently, that transition back to standard time was actually recorded as happening a minute too early, at 23:59 on December 31st. TZDB 2014g fixed this by changing the transition time to “24:00” — that is, midnight at the end of the last day of the year.
Noda Time could already handle transitions at the end of the day, but would incorrectly ignore this particular transition because it occurred ‘after’ 2009. That’s now fixed, and Noda Time 1.3.1 returns the correct offset for Asia/Dhaka when using data from TZDB 2014g or later.
In October 2014, most of Russia switched from permanent daylight saving time to permanent standard time, effectively moving local time back one hour. These changes were included in TZDB 2014f.
For people using the BCL provider instead of the TZDB
provider (and using Windows), Microsoft delivered a
hotfix in September 2014. However, our BCL provider depends upon the .NET
framework’s TimeZoneInfo
class, and the .NET framework — unlike TZDB —
is unable to represent historical changes to the ‘base’ offset of a time
zone (as happened here).
The result is that Noda Time (and other applications using
TimeZoneInfo
in .NET 4.5.3 and earlier) incorrectly compute the offset for
dates before October 26th, 2014.
A future update of the .NET framework should correct this limitation, but
without a corresponding change in Noda Time, the extra information wouldn’t
be used; Noda Time 1.3.1 prepares for this change, and will use the correct
offset for historical dates when TimeZoneInfo
does.
The time zones returned by the BCL provider have long had a limitation in
the way time zone equality was implemented: a BCL time zone was considered
equal to itself, and unequal to a time zone returned by a different
provider, but attempting to compare two different BCL time zone instances
for equality always threw a NotImplementedException
. This was
particularly annoying for ZonedDateTime
, as its equality is defined in
terms of the contained DateTimeZone
.
This was documented, but we always considered it a bug, as it wasn’t
possible to predict whether testing for equality would throw an exception.
Noda Time 1.3.1 fixes this by implementing equality in terms of the
underlying TimeZoneInfo
: BCL time zones are considered equal if they wrap
the same underlying TimeZoneInfo
instance.
Note that innate time zone equality is not really well defined in general,
and is something we’re planning to reconsider for Noda Time 2.0. Rather
than rely on DateTimeZone.Equals()
, we’d recommend that applications that
want to compare time zones for equality use
ZoneEqualityComparer
to specify how two time
zones should be compared.
There are a handful of other smaller fixes in 1.3.1: the NodaTime
assembly
correctly declares a dependency on System.Xml
, so you won’t have to; the
NuGet packages now work with ASP.NET’s kpm
tool, and declare support for
Xamarin’s Xamarin.iOS (for building iOS applications using C#)
in addition to Xamarin.Android, which was already listed; and we’ve fixed a
few reported documentation issues along the way.
As usual, see the User Guide and 1.3.1 release notes for more information about all of the above.
Work is still continuing on 2.0 along the lines described in our 1.3.0 release post, and we’re also planning a 1.4 release to act as a bridge between 1.x and 2.0. This will deprecate members that we plan to remove in 2.0 and introduce the replacements where feasible.
Release late on Friday afternoon? What could go wrong? Apart from running out of time to write a blog post, I mean. ↩
Some years back, I posted a graph showing the growth of Subversion’s codebase over time, and I thought it might be fun to do the same with Noda Time. The Subversion graph shows the typical pattern of linear growth over time, so I was expecting to see the same thing with Noda Time. I didn’t1.
Noda Time’s repository is a lot simpler than Subversion’s (it’s also at
least an order-of-magnitude smaller), so it wasn’t that difficult to come up
with a measure of code size: I just counted the lines in the .cs
files
under src/NodaTime/
(for the production code) and src/NodaTime.Test/
(for the test code).
I decided to exclude comments and blank lines this time round, because I wanted to know about the functional code, not whether we’d expanded our documentation. As it turns out, the proportion of comments has stayed about the same over time, but that ratio is very different for the production code and test code: comments and blank lines make up approximately 50% of the production code, but only about 20–25% of the test code.
Here’s the graph. It’s not exactly up-and-to-the-right, more… wibbly-wobbly-timey-wimey.
There are some thing that aren’t surprising: during the pre-1.0 betas (the first two unlabelled points) we actively pruned code that we didn’t want to commit to for 1.x2, so the codebase shrinks until we release 1.0. After that, we added a bunch of functionality that we’d been deferring, along with a new compiled TZDB file format for the PCL implementation. So the codebase grows again for 1.1.
But then with 1.2, it shrinks. From what I can see, this is mostly due to an internal rewrite that removed the concept of calendar ‘fields’ (which had come along with the original mechanical port from Joda Time). This seems to counterbalance the fact that at the same time we added support for serialization3 and did a bunch of work on parsing and formatting.
1.3 sees an increase brought on by more features (new calendars and APIs), but then 2.0 (at least so far) sees an initial drop, a steady increase due to new features, and (just last month) another significant drop.
The first decrease for 2.0 came about immediately, as we removed code that was deprecated in 1.x (particularly, the handling for 1.0’s non-PCL-compatible compiled TZDB format). Somewhat surprisingly, this doesn’t come with a corresponding decrease in our test code size, which has otherwise been (roughly speaking) proportional in size to the production code (itself no real surprise, as most of our tests are unit tests). It turns out that the majority of this code was only covered by an integration test, so there wasn’t much test code to remove.
The second drop is more interesting: it’s all down to new features in C# 6.
For example, in Noda Time 1.3, Instant
has Equals()
and GetHashCode()
methods that are written as follows:
public override bool Equals(object obj)
{
if (obj is Instant)
{
return Equals((Instant)obj);
}
return false;
}
public override int GetHashCode()
{
return Ticks.GetHashCode();
}
In Noda Time 2.0, the same methods are written using expression-bodied members, in two lines (I’ve wrapped the first line here):
public override bool Equals(object obj) =>
obj is Instant && Equals((Instant)obj);
public override int GetHashCode() => duration.GetHashCode();
That’s the same functionality, just written in a terser syntax. I think it’s also clearer: the former reads more like a procedural recipe to me; the latter, a definition.
Likewise, ZoneRecurrence.ToString()
uses expression-bodied members and
string interpolation to turn this:
public override string ToString()
{
var builder = new StringBuilder();
builder.Append(Name);
builder.Append(" ").Append(Savings);
builder.Append(" ").Append(YearOffset);
builder.Append(" [").Append(fromYear).Append("-").Append(toYear).Append("]");
return builder.ToString();
}
into this:
public override string ToString() =>
$"{Name} {Savings} {YearOffset} [{FromYear}-{ToYear}]";
There’s no real decrease in test code size though: most of the C# 6 features are really only useful for production code.
All in all, Noda Time’s current production code is within 200 lines of where it was back in 1.0.0-beta1, which isn’t something I would have been able to predict. Also, while we don’t quite have more test code than production code yet, it’s interesting to note that we’re only about a hundred lines short.
Does any of this actually matter? Well, no, not really. Mostly, it was a fun little exercise in plotting some graphs.
It did remind me that we have certainly simplified the codebase along the way — removing undesirable APIs before 1.0 and removing concepts (like fields) that were an unnecessary abstraction — and those are definitely good things for the codebase.
And it’s also interesting to see how effective the syntactic sugar in C# 6 is in reducing line counts, but the removal of unnecessary text also improves readability, and it’s that that’s the key part here rather than the number of lines of code that results.
But mostly I just like the graphs.
Or, if you prefer BuzzFeed-style headlines, “You won’t believe what happened to this codebase!”. ↩
To get to 1.0, we removed at least: a verbose parsing API that tried to squish the Noda Time and BCL parsing models together, an in-code type-dependency graph checker, and a very confusingly-broken CultureInfo
replacement. ↩
I’m not counting the size of the NodaTime.Serialization.JsonNet
package here at all (nor the NodaTime.Testing
support package), so this serialization support just refers to the built-in XML and binary serialization. ↩
If you have a privileged process that needs to invoke a less-trusted child process, one easy way to reduce what the child is able to do is to run it under a separate user account and use ssh to handle the delegation.
This is pretty simple stuff, but as I’ve just wasted a day trying to achieve the same thing in a much more complicated way, I’m writing it up now to make sure that I don’t forget about it again.
(Note that this is about implementing privilege separation using ssh, not about how ssh itself implements privilege separation; if you came here for that, see the paper Preventing Privilege Escalation by Niels Provos et al.)
In my case, I’ve been migrating my home server to a new less unhappy machine, and one of the things I thought I’d clean up was how push-to-deploy works for this site, which is stored in a Mercurial repository.
What used to happen was that I’d push from wherever I was editing, over ssh,
to a repository in my home directory on my home server, then a changegroup
hook would update the working copy (hg up
) to include whatever I’d
just pushed, and run a script (from the repository) to deploy to my
webserver. The hook script that runs sends stdout
back to me, so I also
get to see what happened.
(This may sound a bit convoluted, but I’m not always able to deploy directly from where I’m editing to the webserver. This also has the nice property that I can’t accidentally push an old version live by running from the wrong place, since history is serialised through a single repository.)
The two main problems here are that pushing to the repository has the surprising side-effect of updating the working copy in my home directory (and so falls apart if I accidentally leave uncommitted changes lying around), and that the hook script runs as the user who owns the repository (i.e. me), which is largely unnecessary.
For entirely separate reasons, I’ve recently needed to set up shared
Mercurial hosting (which I found to be fairly simple, using
mercurial-server), so I now have various repositories owned by a single
hg
user.
I don’t want to run the (untrusted) push-to-deploy scripts directly as that shared user, because they’d then have write access to all repositories on the server. (This doesn’t matter so much for my repositories, since only I can write to them, and it’s my machine anyway, but it will for some of the others.)
In other words, I want a way to allow one privileged process (the Mercurial
server-side process running as the hg
user) to invoke another (a
push-to-deploy script) in such a way that the child process doesn’t retain
the first process’s privileges.
There are lots of ways to achieve this, but one of the simplest is to run the two processes under different user accounts, then either find a way to communicate between two always-running processes (named pipes or shared memory, for example), or for one to invoke the other directly.
The latter is more appropriate in this case, and while the obvious way for a
(non-root) user to run a process as another is via sudo
, the policy
specification for that (in /etc/sudoers
) is… complicated. Happily,
there’s a simpler way that only requires editing configuration files owned
by the two users in question: ssh.
The setup is fairly easy: I’ve created a separate user that will run the
push-to-deploy script (hg-blog
), generated a password-less keypair for
the calling (hg
) user, and added the public key (with from=
and
command=
options) to /home/hg-blog/.ssh/authorized_keys
.
Now the Mercurial server-side process can trigger the push script simply by
creating a $REPOS/.hg/hgrc
containing:
[hooks]
changegroup.autopush = ssh hg-blog@localhost
This automatically runs the command
I specified in the target user’s
authorized_keys
, so I don’t even have to worry about listing it
here1.
In conclusion, ssh is pretty good tool for creating a simple privilege
separation between two processes. It’s ubiquitous, and doesn’t require
root
to do anything special, and while the case I’m using it for here
involves two processes on the same machine, there’s actually no reason that
they couldn’t be on different machines.
The ‘right’ answer may well be to run each of these as Docker containers, completely isolating them from each other. I’m not at that point yet, and in the meantime, hopefully by writing this up I won’t forget about it the next time I need to do something similar!
In this case, adding a command
restriction doesn’t protect against a malicious caller, since the command that’s run immediately turns around and fetches the next script to run from that same caller. It does protect against someone else obtaining the (password-less by necessity) keypair, I suppose, though the main reason is the one listed above: it means that ‘what to do when something changes’ is specified entirely in one place. ↩
This month, I decided to do something about the way this site rendered on mobile devices. Now that it works reasonably well, I thought it might be interesting to talk about what I needed to change — which, as it turned out, wasn’t that much.
First off, here’s what things used to look like on a Nexus 6 (using my pangrammatic performance post as an example).
Double-tapping on a paragraph zooms to fit the text to the viewport, which produces something that’s fairly readable, but you can still scroll left and right into dead space.
As well as making it a pain to just scroll vertically, this also caused other problems, like the way that double-tapping on bulleted lists (which have indented left margins) would zoom the viewport such that it cropped the left edge of the main content area.
This is all pretty terrible, of course, and about par for the course for mobile browsers.
So what’s going on here? Well, for legacy reasons, mobile browsers typically default to rendering a faux-desktop view, first by setting the viewport so that it can contain content with a fixed “fallback” width (usually around 1000px), and then by fiddling with text sizes to make things more readable.
This behaviour can be overridden fairly easily using the (de facto standard, but not particularly well defined) meta viewport construct. For example, this is what I needed to include to revert to a more sensible behaviour:
<meta name=viewport
content="width=device-width, initial-scale=1">
The two clauses have separate and complementary effects:
width=device-width
sets the viewport width to the real screen width
rather than the fallback width.initial-scale=1
both sets an initial 1:1 zoom level, and also maintains
that zoom level during device rotation (rather than maintaining the
viewport width, as is apparently done by some devices). Importantly, the
user isn’t restricted from zooming in further.(This is all explained in rather more detail in the Google Developers document that I linked to above1.)
In practice, I’d recommend just taking the snippet above as a cargo-cultable incantation that switches off the weird faux-desktop rendering and fits the content to the screen.
So, after I’ve added the above, I’m done? Not quite.
Our viewport still needs to be scrolled horizontally to reach some of the content, which is far from ideal, and we’ve no longer got any left-hand margin at all. All in all, it’s pretty hard to read our content even though it’s now zoomed in.
It’s probably worth taking a step back to look at the layout we’re using.
The overall page structure here is pretty trivial, roughly:
body {
max-width: 600px;
margin: 0 auto;
}
This centres the <body>
in the viewport, allowing it to expand up to
600px wide.
We can fix the disappearing margins with body { padding: 0 1em; }
(which
only has an effect if the body would otherwise be flush to the viewport
edges), and while we’re here, we might as well change that max-width:
600px
to something based on ems (I went for max-width: 38em
).
Most of the content of <body>
is text in paragraphs; that’s fine. The two
immediate problems are code snippets (in <pre>
blocks), and images.
Right away we can see a problem: the images have a declared width and
height, and aren’t going to adapt if the width of the <body>
element
changes.
The code snippets have a related problem: <pre>
text won’t reflow, and the
default CSS overflow
behaviour allows block-level content to overflow its
content box, expanding the viewport’s canvas and reintroducing horizontal
scrolling2.
We can fix the code snippets fairly easily by enabling horizontal scrollbars for the snippets where needed:
pre {
overflow: auto;
overflow-y: hidden;
}
This uses overflow
, a CSS 2.1 property, to ensure that content is clipped
to the content box, adding scrollbars if needed. It then uses overflow-y
,
a CSS3 property, to remove any vertical scrollbars, leaving us with only
the horizontal scrollbars (or none). If the overflow-y
property isn’t
supported (and in practice it is), the browser will still render something
reasonable.
That doesn’t help with the images, of course. The term you’ll want to search for is “responsive images”, but what we’re actually going to do is size the image so that it fits within the space available3.
One easy way to do this is to simply replace:
<img src="kittens" width="400" height="300">
with
<img src="myimage" style="width: 100%">
and, broadly speaking, that’s what I’m now doing4. Note that you
do need to drop the height
property (and so might as well drop width
too), otherwise you’ll have an image with a variable width and fixed height
(which doesn’t work so well, as you might imagine).
There are some caveats with older versions of Internet Explorer (aren’t there always?) but in my case I’ve decided that I’m only interested in supporting IE9 and above5, so these don’t apply.
But wait a sec: we declared the image’s dimensions in the first place so that the browser could reserve space for the image, rather than reflowing the page as it downloaded them. Does this mean that we need to abandon that property?
Maybe. Somewhat surprisingly, there isn’t any way (yet6) to declare the aspect ratio (or, equivalently, original size) of an image while also allowing it to be resized to fit a container. However, all’s not lost: for common image aspect ratios, we can adopt a technique documented by Anders Andersen where we prevent reflow by pre-sizing a container to a given aspect ratio.
The tl;dr is that we use something like the following markup instead:
<div class="ratio-16-9">
<img src="myimage" style="width: 100%">
</div>
We then pre-size the containing div
using the CSS rule padding-bottom:
56.25%
(9/16 = 0.5625; CSS percentages refer to the container’s width),
and position the image over the div
using absolute positioning, taking
it out of the flow.
This works, but there are some caveats: it only works for images with common aspect ratios, of course (4:3 and 16:9 are pretty common, but existing images might have any aspect ratio), and, as written, it only works for images that are sized to 100% of the container’s width (though you could handle fixed smaller sizes as well, if desired).
In my case, I elected to make all images sized to 100% of the viewport width (which works well, mostly), and applied the reflow-avoidance workaround only to those images with 16:9 or 4:3 aspect ratios, leaving the others to size-on-demand.
I did notice some surprising rounding differences on Chrome that lead me to reduce that 56.25% of padding to 56.2% (which may truncate the image by a pixel or two; better than allowing the background to show through, though). I suspect this may be because Chrome allows HTML elements in general to have fractional CSS sizes, while it appears to restrict images to integral pixel sizes.
This gave me pretty good results, but I also took the opportunity to make a few other changes to make things work a little better:
<img src="..." srcset="...">
syntax. In this case, small-screen devices (e.g. iPhone 3G, if it
supported the syntax) get a 320×182 image, desktop browsers get a 750×422
image, and hi-res devices like the Nexus 6 get a 1440×810 image). It’s not
quite working completely right yet, but it looks promising.<object>
with an image content and fallback <table>
content, which just about worked (and looked great in Lynx!), but which
wasn’t easy to adjust to fit to the new approach. It doesn’t look like
the new HTML5 image features (<img srcset>
, and <picture>
, which I’m
not using) have any support for rendering arbitrary HTML in place of an
image, which is a bit of a shame, but probably a reasonable trade-off.It’s worth noting that a lot of these changes also improved the site on desktop browsers. That’s not really surprising: “mobile-friendly” is more about adaptability than a particular class of device.
So there you have it: for a good mobile site, you may only have to a) add a
meta viewport
tag, and b) size your content (particularly images) to adapt
to the changing viewport width.
Here are some resources (some of which I mentioned above) that I found useful:
meta viewport
actually does.<picture>
and <img srcset>
.Somewhat surprisingly, this is the best reference I’ve found for what the meta viewport
tag actually does. ↩
In theory, the same problem can occur for other elements; for example, an unbreakable URL in running text can cause a <p>
element to overflow. In practice, though, that’s not something that I’ve found worth handling. ↩
There is more to responsive images than just resizing. For example, you can serve completely different images to different devices using media queries (so-called “art direction”). However, that’s way more complicated than what I needed. ↩
You can alternatively use max-width
if you only want to shrink images wider than their container; I also wanted to enlarge the smaller ones. ↩
Why only IE9? It’s available on everything going back to Windows Vista, and it’s the first version to support SVG natively and a bunch of CSS properties that I’m using (::pseudo-elements, not(), box-shadow, to name a few). Windows XP users could well have trouble connecting to this server in the first place anyway, due to the SSL configuration I’m using, so requiring IE9/Vista doesn’t seem too unreasonable. ↩
From what I’m lead to believe, this is being actively worked on. ↩
This is a machine running on the end of an ADSL line. It’s not a very happy machine:
$ uptime
11:28:01 up 781 days, 1:39, 1 user, load average: 2.01, 2.03, 2.05
It’s actually idle, so why is the load average above 2.0? Because there’s
an unkillable mdadm
process stuck in a D state, and a second mount
process that’s permanently runnable.
So why haven’t I just rebooted it (and better still, upgraded it: obviously it’s running an old kernel)? Because I’m not entirely convinced it’ll start up again: the disks were acting a bit suspiciously, and lately the PSU fan has been making a bit of a racket as well.
Unfortunately, it’s also a machine that’s accumulated infrastructure that I care about: DNS, Apache, and so on. The data is safely backed up off-machine, but if I just tear it down, a bunch of things will be broken while I’m rebuilding it. So instead, I’ve been trying to decommission it piece-by-piece.
I’ve also got a bit bored running all my own infrastructure, so some of those moving parts have been put onto dedicated consumer hardware (getting the router to handle internal DNS and DHCP, getting a Synology NAS for Samba, etc), and I’ve moved some others onto a hosted VM, so that I don’t have to worry about the hardware: that copy of Apache has been (mostly) obsoleted by moving this site to Google Compute Engine last January, for example.
But there’s still a few things that I’m depending upon this machine for.
Until recently, one was as the primary DNS server for farside.org.uk
.
I was using a free secondary DNS service from BuddyNS: they provide replicas that I listed as the primaries, and those did regular zone transfers from my server for the source of truth.
That was pretty convenient, and BuddyNS have been pretty great (the free tier is good for up to 300K queries per month, of which I was using about 70-100K), but they only provide secondary DNS, so I went looking for another solution.
I’m sure that there are many other DNS providers around, but since I’m
hosting www.farside.org.uk
on Google Compute Engine, I decided to try out
Google Cloud DNS, which provides a simple primary DNS service,
available via anycast over both IPv4 and IPv6 (that
arrangement seems to be fairly standard for DNS providers nowadays).
This one’s not free, but it is pretty cheap: US$0.20/month per domain, plus US$0.40/month per million queries. For me, that should work out to less than $3/year1.
Otherwise, it seems to be broadly similar to other DNS providers. You can make updates via a JSON/REST API, and API client libraries and a basic command-line client are provided. They do only support a predefined set of resource record types, though I suspect that’s not a problem for most people2.
I actually switched a few weeks ago, but until very recently the programmatic REST API was the only way to make changes, so this wasn’t really a product I’d want to recommend: technically, it worked, but editing a JSON document by hand to send via the command-line client was… suboptimal.
Fortunately, there’s now an editor embedded in the Google Developers Console, so you can also make changes interactively.
Overall, I’m happy enough with the switch: it seems to work well, and didn’t take much effort (once I’d remembered to quote my TXT strings properly, ahem).
I did make one or two changes to the domain at the same time, most notably
removing the A record for farside.org.uk
itself (which had originally been
present for direct mail delivery, years ago). This does mean that
http://farside.org.uk/
will no longer resolve3, but that
hopefully shouldn’t cause any real problems.
Full disclosure: I’m currently getting an employee discount, so I’ll be paying less than that. ↩
I did have to drop an RP RR as a result of this, though I wasn’t actually using it for anything. ↩
Previously, this would end up at the aforementioned machine and be redirected by that copy of Apache to www.farside.org.uk
, which runs elsewhere. ↩
Back in June I wrote about a quick hack to search the Project Gutenberg text for pangrammatic windows; I then wrote a bit more about the implementation and used it as an example for performance tuning in Linux.
What I didn’t mention (because I’m only just now getting around to writing it up) is that I also ran the same analysis against the web.
Just to remind you what I’m talking about, pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur as substrings of otherwise naturally-occurring text. For example, the shortest known sequence in a published book is 42 letters, from Piers Anthony’s Cube Route:
Obviously, sequences such as “The quick brown fox jumps over the lazy dog” (35 letters) are shorter, but they aren’t naturally occurring, so don’t count for these purposes.
So, back in June, I decided to use some of my 20% time (and some weekends) to run a search to find some of the shortest pangrammatic windows on the web, using Google’s web index in much the same way as I’d earlier run a local search over the Gutenberg corpus of documents, except more so.
Even though I don’t work on Google’s web search itself, I knew that we had the ability to run analyses over the web at scale: Ian Hickson did something similar back in 2005 to produce a Web Authoring Statistics report on HTML structure on the web. The main difference was that I was hoping to do it for much more trivial reasons.
I was happy to find that doing this kind of analysis was pretty easy. The code in question is neither interesting nor open source, but let’s just note that, for search engines, the problems of ‘Do X for all web documents’ and ‘Extract the text from this web page’ are already fairly comprehensively solved.
I’d already restricted what I was looking at to English-language documents,
and (for hopefully obvious reasons) those documents that weren’t filtered by
SafeSearch, so that left me with only the easy bit to solve: writing a
matcher that would allow me to run a large-scale grep
over the web.
I started by writing something simple using the non-backtracking algorithm that Jesse Sheidlower had suggested1. This simply emitted one result for every unique pangrammatic window shorter than a certain number of letters (I think I started at 45 or so).
To work out whether two different windows were equivalent, I normalised the
window text by removing all non-alphabetic characters (apart from interior
single quotes) and collapsing all runs of whitespace to a single space. In
that way, “Fix Mr. Gluck’s hazy TV, PDQ” and “Fix Mr Gluck’s hazy TV,\n
‘PDQ’” (where \n
is a newline) would both be normalised to “fix mr gluck’s
hazy tv pdq”, and I would pick just one to report.
That’s when I ran into something of a problem. When I’d run a simple search for short pangrammatic windows over the Gutenberg text, I’d had to skip through a thousand or so occurrences of ‘the alphabet’ and variations before getting to any real-world text. That clearly wasn’t going to scale up to the web.
To clean up these nonsense results, I started with some blacklisting: I’d already discarded entire documents based on a few regular expressions in order to exclude those documents that were specifically talking about pangrams2, so I tried adding another blacklist to remove individual results that contained ‘impossible’ words.
For example, if the normalised window contains the substring “qrs” (ignoring spaces), it can’t possibly be part of an English word: no word contains “qrs”, and none ends “qr” or starts “rs”, so there is no subdivision that would be valid3. This is very successful at removing a large proportion of the results that were variations on ‘the alphabet’.
However, it’s not good enough. I still needed to add “qwerty” and “ytrewq” (“qwerty” reversed) and “azertyu” (French keyboard layout; and note that “azerty” wouldn’t be valid, since some words do end in “azer”) and then “ytreza” and… clearly this isn’t going to scale either.
The internet follows rule 34 for misspellings, it seems: I’m fairly confident that even something as simple as the alphabet has been misspelled in almost every possible way.
I needed a better way to sort the real text from the nonsense text.
I thought about trying to do something clever — like trying to train a classifier to recognise English words — but then I realised that I could do something dumb instead, which is almost always a better approach.
I globbed together a few sources to make a large (100K words or so) dictionary that looked like it contained mostly plausible English words, removed a few words that were valid but problematic (“BC”, “def”, “wert”, etc), and wrote something that would compute a score based on the number of known words in the normalised window.
For example, if the input was “a b c d e … z”, we’d have 26 ‘words’, of which two (“a” and “i”) would be considered known words4, and so we’d give it a score of 2/26, or 0.077.
I knew that I wouldn’t want to set a minimum score of 1.0 (eliminating results that had any unknown words), both because I’d seen from the Gutenberg examples how common proper names were, and also because the nature of reporting a sub-sequence meant that I’d often be selecting a partial word for the first and last words in the window. However, playing around with the threshold showed that it was filtering out nonsense results pretty well, but that I still had a slightly different problem to solve.
While I’d managed to filter out windows with a high proportion of nonsense words, the four-word sequence “in the end. abc…xyz” still manages a reasonable score of 0.75 by that metric, since only the last of the four ‘words’ is an unknown word.
To fix this problem, I put together an alternative score that was computed from the number of letters covered by each known word. That works well for inputs such as the above (2 + 3 + 3 letters in known words, out of 34 letters total, for a score of 8/34, or 0.235).
I couldn’t just replace the known-word score with the by-letter score by itself, though: the letter coverage scorer gives low scores to windows with a long, but truncated, first or last word, and it doesn’t give low scores to windows containing a large number of short nonsense words (things like “wafting zephyrs vex bold p p p p p q jim”).
So I did what any reasonable person would do in that situation: I multiplied both scores together. This sounds ridiculously naïve (and I’m sure that anyone who actually does language processing professionally can tell me why this whole approach is idiotic), but it seems to have worked out reasonably well.
In case you’re interested, the distributions of scores (for windows of an acceptable length) ended up looking like this:
That approach is not without its problems, though: based on some initial trials, I decided that the final run would discard any result with a score lower than 0.55. I later spotted that this would also have discarded both examples quoted in the Wikipedia article, though it does accept all the examples I found in the Gutenberg text:
In any case, I also ended up discarding any window with more than 38 letters, which eliminated all of the above anyway.
I read through a lot of results, about four thousand in all. By hand, and mostly in a coach to and from Wales. I may have missed something.
First, some brief observations about the things that weren’t good results (but that still scored highly enough that I had to look at them):
Without further ado, the three best results I found (in reverse order).
In third place, a post on the “CrackBerry Forums”, which falls down slightly for using what turns out to be some very convenient product names, but wins for the story:
Second place goes to another forum post. The pangrammatic window here is from a list of words rather than a portion of a sentence, but I did learn the names of some dance styles:
Both of those were 38 letters, but the clear winner on both length and content is the following 36 letter pangrammatic window, from a review of the film Magnolia:
I’m pretty impressed by this result: it’s only one letter longer than “The quick brown fox…”, and while that’s not the shortest possible pangram by far, it is one of the more coherent ones.
As for me, I think I’m probably done with pangrams now. Although… I’ve been spending so much time lately reading examples of pangrams that I have started to wonder how I could get myself a sphinx. Preferably one made from black quartz…
The reason to pick this algorithm wasn’t performance, but because it has the advantage of being insensitive to the amount of non-alphabetic characters within a window, unlike the fixed-size sliding window used by the algorithm I’d previously used to search the Gutenberg text. ↩
I kept this blacklist in place later on, though I don’t think it actually had much effect. For posterity, the list of (non-case-sensitive) regexps I ended up using was: ‘pangram’, ‘quick.? brown (fox|dogs?) jump’, ‘silk pyjamas exchanged for blue quartz’, ‘DJs flock by when MTV’, ‘five boxing wizards jump’, ‘Pack my box with five dozen’, ‘love my big sphinx of quartz’, and ‘Grumpy wizards make toxic brew’. Only one of which is actually a regexp. ↩
Though it turns out that /usr/share/dict/words
on my Ubuntu laptop seems to think that “rs” is itself a word. I suspect I just decided that it was wrong. ↩
Again, Ubuntu’s /usr/share/dict/words
seems to consider every letter of the alphabet to be a valid word. The dictionary I used didn’t, though you could probably make the case that it should perhaps have included “O” (and maybe even the txtspeak “U”). ↩
This is the third in a series of posts about searching for pangrammatic windows in a corpus of documents. I’ve previously talked about what I found in the Gutenberg corpus and the code I used, and ended the last post with a question: can we make it faster?
Well, that’s what this post is about.
I’ve expanded the contents of the Project Gutenberg 2010 DVD image
into the current directory, and I’m compiling and running
pangram.c
as follows:
$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out
real 2m56.795s
user 0m57.155s
sys 0m6.667s
In other words, it currently takes about three minutes. We’re going to try to reduce that. Some facts and figures:
Before we do anything, we really need to make sure we can get repeatable measurements. The first few experiments I tried ended up with nonsensical results — it turns out that “all the Gutenberg text” is less than the total memory on my laptop (16GB), so all the runs after the first just read from the filesystem cache.
Fixing that is easy: we ask the kernel to drop the cache before we run a test:
# echo 3 > /proc/sys/vm/drop_caches
or, since we’re probably not running as root,
$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
/proc/sys/vm/drop_caches
is documented in
Documentation/sysctl/vm.txt
; the command above will drop both the
file content cache and the inode cache (containing directory contents).
One other source of variation I ran into was caused by how long find
took
to run (and note that it runs concurrently with xargs
): most of the time
it completed quickly, but in a few situations it was (inconsistently)
delayed by what else was going on, causing the whole search to take much
longer than usual. This was also easy to avoid: we capture the list of
files in advance:
$ find -name '*.txt' > filelist
$ echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
$ time <filelist xargs -n 100 ./pangram > pangram.out
real 2m58.310s
user 0m55.787s
sys 0m6.630s
So where to start? As I see it, there are at least three things we can try:
Let’s take a look at the algorithm first.
I’m not going to repeat the whole thing here (see the previous post for that), but in summary: we read through each file until we’ve seen enough letters that we might have found a pangram, then scan backwards until we find one, or until we hit a limit (I used 200 bytes); we then resume scanning forwards from where we left off.
Can we improve on this? Perhaps. We clearly need to visit all the letters at least once, but — as suggested to me by Jesse Sheidlower, who wrote the PangramTweets Twitter bot that kicked this all off — we can avoid backtracking during the initial search if we keep some additional state1.
Does this help? Unfortunately, not really: it takes exactly the same amount of time as the backtracking version.
Perhaps that’s not too surprising, though. It’s fairly clear that the problem should be I/O-bound, and so — unless the backtracking causes additional I/O (which appears not to be the case) — we should see if we can perhaps spend less time waiting on the I/O we have.
As it stands, the total cost of our I/O is pretty much unavoidable: we need to read each file completely into RAM.
We could reduce the overall I/O by changing what we read2. For example, we could:
However, these fundamentally change the problem we’re trying to solve, not just the way we’re solving it, and so I’m going to stick with what I have for now.
So far, we’re using mmap()
to read the file. This gives us a memory range
into which the kernel will read the file’s contents as-needed, using some
amount of asynchronous read-ahead. If we try to read a page that hasn’t
been read from disk yet, we’ll block3.
At least in my case, using mmap()
to map the file on-demand isn’t much
different to using read()
to read the whole file in one go, which is
itself a bit of a surprise: the former’s asynchronous read-ahead should mean
that we can get started more quickly. I ran across an email about mmap()
on the linux-kernel mailing list where Linus explains that
mmap()
is actually quite expensive to start with. In any case, most of our
files are in the range 256–512KB4, and so perhaps there’s just
not a lot of read-ahead to do.
One thing we could try is reducing the time we spend waiting for I/O by
providing hints to the kernel about our usage of an area of memory or a
file. For example, to hint that we’re about to read a buffer sequentially,
we can write madvise(buf, len, MADV_SEQUENTIAL)
.
In theory, this should allow us to optimise the file I/O based on our usage. In practice (at least in my case), it turns out that these are actually pessimisations.
While we have several different ways to hint to the kernel, as far as I can see, they boil down to just two choices: whether or not we need the data immediately, and what the access pattern is for the data in memory.
If we need the data “now” (MADV_WILLNEED
, MAP_POPULATE
for
posix_fadvise()
, etc), then the kernel will issue a synchronous read
there-and-then, returning once the file’s data is in the page cache. This
can be no faster than issuing a blocking read()
for the whole file, and
— I assume due to the overhead of mmap()
— actually ends up a bit
slower in practice.
Otherwise, the access pattern is one of “normal”, “random”, or “sequential”. “Normal” is what we get by default; it triggers some amount of read-ahead.
“Random” (MADV_RANDOM
, etc) is straightforward: it switches off
read-ahead, so every new page access causes a single-page blocking read.
This is terrible for performance, as you might expect — in our case, it
roughly doubles the runtime.
“Sequential” (MADV_SEQUENTIAL
) is less well-defined. It’s not completely
clear to me what it does in practice — the posix_fadvise()
man page says
that it doubles the read-ahead window (on Linux), but the kernel source
implies that it might be a bit more complex that that — but in any case,
it seems to have a small negative effect overall.
madvise()
and friends provide just one way to tune I/O scheduling, but
they are the easiest to use. We could also look at overlapped or threaded
I/O, but that’s significantly more complex — and perhaps there’s an easier
way to improve our utilization anyway.
In this case, I’m talking about disk and CPU utilization. While some of the disk reads will occur while we’re searching the buffer we’ve already read into memory, most won’t, and so the CPU should often be waiting for an I/O operation to complete.
It’d be nice if we could get a bit more detail about how we’re doing than
simple wallclock time, so I’m going to measure the current CPU and disk
utilization using iostat
.
iostat
is a whole-machine profiler, so it’s probably a good idea not to
have much else going on at the time (though that said, I didn’t see too much
impact from the copy of Chrome I had running). Alternatively, we could look
into per-process monitoring via iotop or pidstat, or an
event/tracing approach like SystemTap or
iosnoop5. (Incidentally, Brendan Gregg’s Linux
Performance page is a great resource for finding out more about
Linux performance tools.)
Running iostat
— with -N
to print meaningful names for the device mapper
devices, and -x
to print extended disk statistics — produces something
like the following (rather wide, sorry) output:
$ iostat -N -x
Linux 3.13.0-30-generic (malcolmr) 06/09/14 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.28 0.07 0.77 0.11 0.00 97.77
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.17 0.04 11.92 4.62 239.54 25.25 32.03 0.01 0.33 0.29 0.43 0.17 0.28
sda5_crypt 0.00 0.00 11.81 4.66 238.68 25.24 32.05 0.04 2.43 0.91 6.29 0.29 0.48
sysvg-root 0.00 0.00 11.65 4.48 238.02 25.24 32.65 0.04 2.48 0.92 6.54 0.30 0.48
sysvg-swap_1 0.00 0.00 0.12 0.00 0.47 0.00 8.00 0.00 0.15 0.15 0.00 0.15 0.00
$
It’s important to note that if you run iostat
this way, you actually get a
running average since boot, which isn’t very useful at all. What I chose to
do instead was to start the run and then execute iostat -N -x 30 3
, which
outputs three reports separated by 30 seconds. The first is the
average-since-boot, which we can ignore, but the other two are averages over
the 30 seconds since the prior report.
Having two good reports allows us to check how variable the numbers we’re seeing are (in my case, fairly reliable). Here’s the kind of output I got.
First, we start a run:
$ time <filelist xargs -n 100 ./pangram > pangram.out
and then concurrently run iostat
:
$ iostat -N -x 30 3
[...]
avg-cpu: %user %nice %system %iowait %steal %idle
8.19 0.00 19.54 15.99 0.00 56.28
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.20 0.00 655.57 2.37 68680.93 23.33 208.85 0.26 0.39 0.39 0.39 0.35 22.93
sda5_crypt 0.00 0.00 654.43 2.37 68666.40 23.33 209.16 1.19 1.82 1.82 0.73 1.31 86.00
sysvg-root 0.00 0.00 654.43 2.17 68666.40 23.33 209.23 1.19 1.82 1.82 0.80 1.31 86.05
sysvg-swap_1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The bottom line is that the CPU is idle most of the time (%iowait
is “idle
but there are runnable processes waiting for I/O”), and the disk is also
idle some of the time (%util
, on the far right, which is the proportion of
the time that there was at least one outstanding I/O operation to the
device).
Note that %util
of 100% does not mean that the device cannot take any
more requests, just that there was at least one request pending for the
device at all times. On the other hand, a device with %util
less than
100% (as we have here) is definitely under-utilized.
I’m not entirely sure what the discrepancy between the reported %util
for
sda5_crypt
and sda
(the dm-crypt volume and raw disk) is due to, but I
wasn’t able to get the latter above 50%. It looks to me like perhaps
dm-crypt is unwilling to forward more than one request at a time to the raw
device: avgqu-sz
(the average request queue depth, including active
requests) for sda
never got to 0.9, no matter how much load I
added6.
Now that we have evidence that both the CPU and disk really are underutilized, how do we improve things? Well, the easiest way is to run over more than one file at a time:
$ time <filelist xargs -n 100 -P 2 ./pangram > pangram.out
real 1m37.855s
user 0m51.309s
sys 0m5.786s
Well, that’s much better already. The options to xargs
tell it to run
pangram
with at most 100 files, and execute two copies in parallel. (It’s
important to limit the number of files per invocation, otherwise xargs
will just pass everything to a single copy.)
At this point, we’re keeping the disk (sda5_crypt
) a lot busier: %util
is up from 86% to nearly 99%, and rkB/s
(the read throughput) is up from
69MB/s to 122MB/s. In fact, we can continue to increase the number of
concurrent processes to reach a peak of about 149MB/s:
You’ll see that it tops out somewhere around P=8. I’m not sure how to
explain the drop in read throughput from around P=32: it seems to correspond
to the point at which %idle
drops to zero (i.e. there’s a task waiting on
I/O at all times), but I don’t see why that would necessarily cause things
to run more slowly (and all the metrics apart from r/s
and rkB/s
are
linear in the amount of load).
Whatever the reason, we’re done here: we’ve reduced the runtime of this task from three minutes to about 1m20s, a little over half the time it took originally.
While this was something of an artificial example (we really didn’t need to run this more than once, after all), and while some of the above is surely specific to my setup, I hope it was an interesting exercise.
If I had to tl;dr the above, it would be:
In many cases, it isn’t necessary to spend the time on performance tuning, but when it is, the above is probably a good roadmap. Typically, you’d only bother when you’re doing something like handling a user request, where the latency is a target by itself, or when you’re running something repeatedly, perhaps because it’s a core library function, or perhaps because you’re processing a lot of data.
Briefly: we track the byte- and letter-offset in the file at which we most-recently saw each letter, and also the minimum letter-offset over all letters (which will change infrequently, thanks to the non-uniform letter distribution); when that minimum letter-offset changes, we have a new pangram, and immediately know the number of letters it contains. This is O(N) in the number of letters in the file (for the search; we still have to backtrack to find the text to output), and does have one other significant advantage: it avoids any artificial limit on the number of non-letter bytes (whitespace, etc) that can appear within a sequence. ↩
I haven’t actually tested any of these, by the way, so they may not actually help, but they all sound reasonable. ↩
These two cases can be distinguished via /proc/vmstat
: read-ahead reads are counted as page faults (pgfault
), while synchronous (blocking) reads are counted as major page faults (pgmajfault
). ↩
The filesize distribution of Gutenberg texts is actually pretty close to a log-normal distribution centred around 218. ↩
Oddly, Linux doesn’t appear to have a good solution for iostat
-like accounting for cgroups (though pidstat comes close) — or if it does, I couldn’t find it. ↩
Perhaps this is a red herring, but I would have expected that passing those requests down to the raw disk device could only help (and I note that Intel use a queue depth of 32 when measuring SSD performance). As it stands, the maximum read throughput I can get from the raw device (via dm-crypt) is about 150MB/s, a little under half of the real-world sequential read performance I see in reviews. ↩
Inspired by Google’s recent decision to boost the ranking of HTTPS sites, and because it’s something I’ve been meaning to do for a while (and also because it’s generally the right thing to do), I’ve just moved this blog to serve via HTTPS.
I pretty much just walked through this set of instructions from Eric Mill, using the SSL configuration from Mozilla’s OpSec team (seriously, don’t try to do this bit yourself: the folks at Mozilla know what they’re doing). All told, it only took a couple of hours.
Like Eric, I also got my free certificate from StartSSL; they seem reasonable enough at the moment, and I can always change later if I feel like it.
Other than needing to switch to a protocol-relative URL for Google Web Fonts, the site worked first time (though it helps that it’s fairly simple: all the odd stuff got left behind when I split the serving of this blog to a Google Compute Engine instance).
However, unlike Tim, I didn’t keep the HTTP version of the
site around: all http://
URLs now result in a 301 to the HTTPS
equivalent1. I haven’t yet enabled HSTS to pin the site to
HTTPS, but I’ll probably do so in a week or so, once I’ve checked to see if
any problems turn up.
I’m also not entirely concerned about backward-compatibility with old clients (I used the Non-Backward Compatible Ciphersuite list, for example). I was originally planning to only enable TLS 1.2, but it turns out that I do still care about some older clients (no, not Windows XP): GoogleBot and pre-KitKat versions of Android (presumably the Android browser rather than Chrome-on-Android), which only support TLS 1.02. In the end, I only ended up disabling SSL2 and SSL3.
Once I’d tested the site, the only thing I needed to do was to register the HTTPS URL in Google Webmaster Tools, and update a few incoming redirects to avoid long redirect chains.
I also found the following sites useful:
In summary: for many sites, enabling HTTPS is pretty trivial. If you’re making a new site, consider making it HTTPS-only.
Except for robots.txt
, which serves the content directly. I’m not sure if that’s actually important, but it seemed like robots might not want to follow redirects to fetch robots.txt
, even if they would for the other content. ↩
In addition, the version of curl
I have on my desktop only supports TLS 1.1, so I would have at least wanted to enable that. ↩
Noda Time 1.3.0 came out today1, bringing a healthy mix of new features and bug fixes for all your date and time handling needs. Unlike with previous releases, the improvements in Noda Time 1.3 don’t really have a single theme: they add a handful of features and tidy up some loose ends on the road to 2.0 (on which more below).
So in no particular order…
Noda Time 1.3 adds support for the Persian (Solar Hijri) calendar, and experimental support for the Hebrew calender. Support for the latter is “experimental” because we are not entirely convinced that calculations around leap years work as people would expect, and because there is currently no support for parsing and formatting month names. See the calendars page in the user guide for more details.
Speaking of parsing and formatting, both should be significantly faster in 1.3.0. Parse failures should also be much easier to diagnose, as errors now indicate which part of the input failed to match the relevant part of the pattern.
The desktop build of Noda Time should now be usable from partially-trusted
contexts (such as ASP.NET shared hosting), as it is now marked with the
AllowPartiallyTrustedCallers
attribute.
Finally, we also fixed a small number of minor bugs, added annotations for
ReSharper users, and added a few more convenience methods —
ZonedDateTime.IsDaylightSavingTime()
and OffsetDateTime.WithOffset()
,
for example — in response to user requests. There’s also a new option to
make the JSON serializer use a string representation for Interval
.
Again, see the User Guide and 1.3.0 release notes for more information about all of the above.
You can get Noda Time 1.3.0 from the NuGet repository as usual (core, testing, JSON support packages), or from the links on the Noda Time home page.
Meanwhile, development has started on Noda Time 2.0. Noda Time 2.0 will not be binary-compatible with Noda Time 1.x, but it will be mostly source-compatible: we don’t plan to make completely gratuitous changes.
Among other things, Noda Time 2.0 is likely to contain:
Instant
and Duration
from ticks to nanoseconds.We don’t expect to have a release of Noda Time 2.0 until next year, so we may well make some additional releases in the 1.3.x series between now and then, but in general we’ll be focussing on 2.0. If you’re interested in helping out, come and talk to us on the mailing list.
And once again, I’m going to plagiarise this post for the official Noda Time blog post. ↩
In the previous post, I talked about finding pangrammatic windows in a corpus of text files from Project Gutenberg (in particular, the 2010 DVD image). Here I’m going to talk a bit about the implementation I used.
I think the problem itself is quite interesting. Restated, it’s “search a given text for all substrings that contain all the letters of the alphabet, and that do not themselves contain another matching substring” (the latter, because given “fooabc…xyzbar” we only want to emit “abc…xyz”).
I can imagine asking that kind of question in an interview1. If you enjoy that kind of thing, you might want to go and think about how you’d solve it yourself.
Back already? When I brought up the idea at work, one sensible suggestion (thanks, Christian!) was to keep going until we’d seen each character at least once, keeping track by setting a bit per letter. Once we’d seen all the letters once, we could scan backwards to work out whether we’d found the end of a pangrammatic sequence or not.
Since the frequency of some letters (Q, Z) is very low in English text, we’d expect to only have to scan backwards occasionally. We’d also limit the size of that backward scan to avoid examining O(N^2) characters.
The only other wrinkle is what to do when we scan backwards: if we see all the letters (and we can use the same mechanism as before to track which we’ve seen), then we immediately know that we have a pangrammatic window, so we can output it. Otherwise we keep going for some maximum number of characters — I used 200 — and then give up.
What then? After a match covering offsets [a, b], we can’t forget about everything and jump back to offset b+1, as we might be looking at a string like “zaabc…xyz” (where we’d want to emit “zaabc…xy” and then the shorter “abc…xyz”). It’s always safe to restart at offset a+1, but we can do better: we can keep the set of letters we’ve seen (i.e. all of them) and remove the character at the start of the matched substring (“z”, in this case), which by definition must have only occurred once, and then continue from offset b+1.
In the much more likely case that we don’t see a pangrammatic sequence, we also continue the search at b+1, with the seen set covering what we’ve seen in the range (b-window size, b]. (Note that if we knew that the character at the start of the window had only appeared once, then we could remove it as before, but in general, we can’t.)
Download pangram.c
. Compile and run using something
like:
$ gcc -std=gnu99 -O2 -DMAX_PANGRAM=200 pangram.c -o pangram
$ ./pangram file1 file2 file3
The compile flags just define the maximum window size, MAX_PANGRAM
(to 200
bytes, the figure I chose in the end), and enable optimisations (which I was
surprised to see make a noticeable difference to the runtime).
The implementation maps to the algorithm I described above:
main()
simply uses mmap()
to reads the contents of each file in turn
into memory, then invokes pangram()
.pangram()
walks through the file byte-by-byte, calling seen_all()
to
update the letters we’ve seen in seen
.seen_all()
returns true, we call try_scan_backwards()
to check
whether we have a pangram in the last MAX_PANGRAM
bytes, and also to
update seen
with the new set of letters that are actually within that
window (as described above).output_pangram()
prints the file and contents to
stdout
.I’m fairly happy with the result. It’s not the best code I’ve written, but it’s not too bad.
Loading the whole file into memory in one go isn’t particularly great (we
only really need a sliding window of MAX_PANGRAM
bytes, so we’re wasting a
lot of memory), but it makes the code much simpler, and memory pressure
isn’t something I need to worry about here. The largest file I’m dealing
with is 43MB (Webster’s Unabridged Dictionary, pgwht04.txt
), and my laptop
has 16GB of RAM, so there’s no reason to try anything cleverer: mmap()
is
simple, and it works.
How do we actually go about running this over all the texts? I’d previously loopback-mounted the ISO image and unzipped everything into a directory (though some of the zipfiles contained directories themselves), but that still gave me…
$ find -name '*.txt' | wc -l
32473
… just over 32,000 files to consider (totalling about 11.6GB of text). I’d decided to do this simply: I didn’t try to filter out non-English texts (assuming, correctly as it turned out, that foreign-language text was unlikely to show up in the results anyway), and for the same reason, I also didn’t bother dealing with different file encodings (as the files use a mixture of at least UTF-8 and ISO-8859-1).
From the directory containing the unzipped texts, we can run a search by using something simple like this:
$ time find -name '*.txt' | xargs -n 100 ./pangram > pangram.out
real 2m56.795s
user 0m57.155s
sys 0m6.667s
I’ve used -n 100
so that xargs
will run our program with 100 files at a
time, rather than all 32,000. That’ll be important later, though initially
I was a little worried about command-line length limits, probably
unnecessarily2.
The resulting output file contains about 42,000 results, each with the letter count (which is what matters, not the byte count) followed by the filename and text, so we can easily find the shortest sequences:
$ sort -n pangram.out | head -n3
26 ./10742.txt: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
26 ./10742.txt: B C D E F G H I J K L M N O P Q R S T U V W X Y Z & a
26 ./10742.txt: C D E F G H I J K L M N O P Q R S T U V W X Y Z & a b
Okay, it needs a bit of manual review to weed out the nonsense, but it’s good enough.
The only thing I’m not entirely sure about here is the safety of combining
the results from stdout
if I run more than one copy of pangram
at a time
(spoilers!). Well, rather: I’m pretty sure it’s not safe, but it appears
to work in practice. Mostly.
We printf()
to stdout
, which I’d thought was line-buffered. However,
without an explicit fflush(stdout)
after the printf()
(which output
always finishes with a newline anyway), a small fraction of the output is
lost when I concurrently append to a single output file: I’m missing some
lines (a few hundred in 42,000 or so), and I get the ends of a few others.
With fflush(stdout)
, I seem to get the right results again, unless I
spawn a large number of concurrent processes (say, 300), so I’m guessing
there’s a race somewhere that I’m occasionally losing. The reason that I’m
a little confused is that I expected this to either work fine, because
writes of less than PIPE_BUF
bytes (512 by POSIX; in practice, at least
4KB) are atomic — or if that didn’t apply in this situation, I’d expected
it to interleave the results completely.
Three minutes is a bit long to wait; can we make it run faster?
Yes, we can. But that’s a post for another day.
Note to interview candidates: I will not actually be asking that question. Do not revise that question. (Or do. I’m a footnote, not a cop.) ↩
Definitely unnecessarily: I learned more recently that xargs automatically caps command-line lengths according to the maximum size (with a lower-bound on that cap of 128KB); see xargs --show-limits
. ↩
Over on Language Log, there’s a post about pangrammatic windows, and a bot that searches Twitter posts for them. Pangrammatic windows are pangrams — a piece of text using all the letters in the (English) alphabet — that occur within otherwise naturally-occurring text.
For example, the shortest known natural sequence is 42 letters, from Piers Anthony’s Cube Route, discovered in an article in Word Ways:
I thought it might be interesting to work out how you’d go about searching a given text for pangrammatic windows. A short chat at work and some quick hacking later, and I had a simple proof-of-concept, but no data to run against.
That was easily solved by downloading the Project Gutenberg April 2010 DVD image1 and unzipping everything within. That gave me 11.6GB of text files, ranging in size from 336 bytes (one of the chapters of Moby Dick) to a single 43MB file comprising Webster’s Unabridged Dictionary.
I’ll post about the technical side separately, but suffice to say that this search doesn’t exactly tax a modern PC: my laptop has enough RAM to load all of the Gutenberg text into memory, and even from cold, it takes only 80 seconds to search through it all.
So what did I find? Well, firstly, several thousand occurrences of “the alphabet”. In retrospect, that probably should have been obvious.
I did find another 42-letter sequence, but I don’t think it can really count, as it occurs during a discussion of pangrams itself: De Morgan (the mathematician), while snarking about numerology, writes about trying to construct a meaningful sentence using all the letters save ‘v’ and ‘j’ exactly once:
The shortest sequence that seems to fit within the rules is the following 53-letter sequence, from The Life of Charles Dickens:
However, this, and a similar 56-letter sequence (“Köckeritz! Where is the king?”) in Napoleon and the Queen of Prussia both still seem somewhat unnatural to me, since they depend upon proper names to work (and to be fair, the same is true of the Piers Anthony quote as well).
Given that, I think the contender for the shortest truly “natural” pangrammatic window in the Gutenberg corpus is the following 57-letter sequence, from Andre Norton’s YA-esque civil war adventure, Ride Proud, Rebel!:
Funnily enough, one thing that I did expect to find, but didn’t, were any common examples of pangrams — in fact, the word “pangram” does not appear (with that meaning) in the Gutenberg corpus at all! The closest I got were the two near-misses: “the quick, brown fox jumped over the lazy dog” and “the swift brown fox jumps over the lazy dog”, the former of which is, I think, a misquote (the latter isn’t, as it’s called out in the text as an almost-pangram).
That’s it for this post. I also have a separate post that goes into a little detail about the code itself.
Hey, 14-year-old me? Remember when you spent over an hour on the phone to download 150KB of BBS software on a 300 baud connection? I just took about the same time to download 8.4GB, and I have enough space to store an uncompressed copy too. The future rocks! But while we’re here: could you buy some Apple stock during 2002? Thanks! ↩
Google Compute Engine is Google’s “run a virtual machine on Google infrastructure” product. It’s broadly similar to Amazon’s EC2, in that you get an unmanaged (Linux) virtual machine that you can run pretty much anything on, one difference being that it seems to be aimed at larger workloads: 16-core machines with hundreds of GB of RAM, 5TB disks, that kind of thing.
While I’d been meaning to look at it for a while, I didn’t think I had any reason to use it; I certainly don’t have any workloads of the scale people seem to be talking about. A short while ago, a friend at work mentioned he was using it to run a private Minecraft server, which seemed pretty small to me, so I thought perhaps I’d take another look.
It turns out that Compute Engine is just as suited to small-scale workloads as large ones, and while you do have to pay to use it, it works out to be pretty inexpensive. Having spent a little time with it now, I figured it was time to document what I found out.
Boring disclaimer time first, though: I don’t work on Compute Engine, so this isn’t anything official, just some guy on the internet. Also, in the interests of full disclosure: I’m getting an employee discount on the cost of using Compute Engine (though it’s cheap enough that I’d be happy paying full price anyway). With that in mind…
Stalkers and readers with good memories will recall that I started proxying this site via Google’s PageSpeed Service a little over two years ago. PageSpeed Service is a reverse proxy running in Google’s data centres that applies various performance rewrites to the original content (minifying CSS, and so on), and it does a pretty good job overall. As an additional benefit, it’s a (short TTL) caching proxy, so nobody needs depend directly on the copy of Apache running at the end of a DSL pipe on my server at home.
However, I’ve always been slightly bothered by the fact that that dependency still exists. There’s the usual “home network isn’t very reliable” problem1, but rather more importantly, that server’s on my home network, and given the choice, I’d rather not have it running a public copy of Apache as well as everything else.
Anyway, it turns out that I’m going to need to reinstall that server in a bit anyway, so I figured that it might be a good time to see whether Compute Engine was a good fit to run a simple low-traffic Apache server like the one that serves this site (spoiler: yes).
I was hoping that I’d have something clever to say about what I needed to do to set it up, but in truth the Compute Engine quickstart is almost embarrassingly easy, and ends up with a running copy of Apache, not far from where I needed to be.
One thing I did decide to do while experimenting was to script the whole install, so that a single script creates the virtual machine (the “instance”), installs everything I need, and sets up Apache to serve this site. Partly2 this was to make sure I recorded what I’d done, and partly so that I could experiment and reset to a clean state when I messed things up.
That may have been a bit excessive for a simple installation, but it does mean that I now have good documentation that I can go into some detail about.
With Compute Engine, the first thing you need to do (assuming you’ve completed the setup in the quickstart) is to create an instance, which is what Compute Engine calls a persistent virtual machine. My script ended up using something like the following, which creates an instance together with a new persistent disk to boot from:
$ gcutil --project=farblog addinstance www \
--machine_type=f1-micro \
--zone=us-central1-b \
--image=debian-7 \
--metadata_from_file=startup-script:startup.sh \
--authorized_ssh_keys=myuser:myuser.pub \
--external_ip_address=8.35.193.150 \
--wait_until_running
gcutil
is the command-line tool from the Google Cloud SDK that
allows you to configure and control everything related to Compute Engine
(other tools in the SDK cover App Engine, Cloud Storage, and so on, but I
didn’t need to use any of those).
Taking it from the top, --project
specifies the Google Developers
Console project ID (these projects are just a way to group
different APIs for billing and so on; in this case, I’m only using Compute
Engine). You can also ask gcutil
to remember the project (or any other
flag value) so that you don’t need to keep repeating it.
addinstance
is the command to add a new instance, and www
is my
(unimaginative) instance name. Everything after this point is optional: the
tool will prompt for the zone, machine type, and image, and use sensible
defaults for everything else.
The machine type comes next: f1-micro
is the smallest machine
type available, with about 600MB RAM and a CPU
reservation suitable to occasional “bursty” (rather than continuous)
workloads. That probably wouldn’t work for a server under load, but it
seems to be absolutely fine for one like mine, with a request rate measured
in seconds between requests, rather than the other way around.
Next is the zone (us-central1-b
), where the machine I’m using will be
physically located. This currently boils down to the choice of few
different locations in the US and Europe (at the time of writing, four
different zones across two regions named us-central1
and europe-west1
).
As with Amazon, the European regions are slightly more expensive (by about
10%) than the US ones, so I’m using a zone in a US region.
While Compute Engine was in limited preview, the choice of zone within a region was a bit more important, as different zones had different maintenance schedules, and a maintenance event would shut down the whole zone for about two weeks, requiring you to bring up another instance somewhere else. However, the US zones no longer have downtime for scheduled maintenance: when some part of the zone needs to be taken offline, the instances affected will be migrated to other physical machines transparently (i.e. without a reboot).
This is pretty awesome, and really makes it possible to run a set-and-forget service like a web server without any complexity (or cost) involved in, for example, setting up load balancing across multiple instances.
After the zone, I’ve specified the disk image that will be used to initialise a new persistent root disk (which will be named the same as the instance). Alternatively, rather than creating a new disk from an image, I could have told the instance to mount an existing disk as the root disk (in either read/write or read-only mode, though a given disk can only be mounted read/write by one instance at any time).
The image really is just a raw disk image, and it appears can contain pretty much anything that can run as an x86-64 KVM guest, though all the documentation and tools currently assume you’ll be running some Linux distribution, so you may find it it a little challenging to run something else (though plenty of people seem to be).
For convenience, Google provides links to images with recent versions of
Debian and CentOS (with RHEL and SUSE available as “premium” options), and
above I’m using the latest stable version of Debian Wheezy (debian-7
,
which is actually a partial match for something like
projects/debian-cloud/global/images/debian-7-wheezy-v20131120
).
Continuing with the options, --metadata_from_file
--authorized_ssh_keys
startup-script
to the contents of the file
startup.sh
, while the second sets the metadata value with the key
sshKeys
to a list of users and public keys that can be used to log into
the instance (here, myuser
is the username, and myuser.pub
is the SSH
public key file).
Both of these are specific to the instance (though it’s also possible to inherit metadata set at the project level), and can be queried — along with a host of other default metadata values — from the instance using a simple HTTP request that returns either a text string or JSON payload.
I’m not going to go into metadata in any detail other than the above
The startup-script
metadata value is used to store the contents of a
script run (as root) after your instance boots. In my
case, I’m just using this to set the hostname of my instance, which was
otherwise unset3, which in turn makes a bunch of tools throw
warnings. I found the easiest way to fix this was to specify a startup
script containing just hostname www
.
The sshKeys
metadata value is used to store a list of users and SSH public
keys. This is read by a daemon (installed as
/usr/share/google/google_daemon/manage_accounts.py
) that ensures that each
listed user continues to exist (with a home directory,
.ssh/authorized_keys
containing the specified key, etc), and also
ensures that each listed user is a sudoer (present in
/etc/sudoers
)4.
Note that you don’t need to specify any of this at all. By default,
Compute Engine creates a user account on your instance with a name set to
your local login name, and creates a new ssh keypair that it drops into
~/.ssh/google_compute_engine{,.pub}
on the machine you created the
instance from. You can then simply use gcutil ssh instance-name
to ssh
into the instance.
This is helpful when you’re getting started, but it does mean that if you
want to ssh from anywhere else, you either need to copy those keys around,
or do something like the above to tell Compute Engine to accept a given
public key. Since I wanted to be able to ssh programmatically from machines
that didn’t necessarily have gcutil
installed, I found it simpler to just
create an ssh keypair manually, specify it as above, and use standard
ssh to ssh to the instance.
--external_ip_address
allows you to choose a “static” external IP address
(from one you’ve previously reserved). Otherwise, the instance is
assigned5 an ephemeral external IP address chosen when the
instance is created. This is is reclaimed if you delete the instance, so
you probably don’t want to rely on ephemeral IP addresses as the target of
e.g. a DNS record.
However, you can promote an ephemeral address that’s assigned to an instance so that it becomes a static address, and there’s no charge for (in-use) static addresses, so there’s no problem if you start using an ephemeral address and later want to keep it. (Strictly speaking, external IP addresses are actually optional, as all of your instances on the same “network” can talk to each other using internal addresses, but this isn’t something simple installations are likely to use, I wouldn’t have thought.)
Compute Engine doesn’t currently have support for IPv6, oddly, though there’s a message right at the top of the networking documentation saying that IPv6 is an “important future direction”, so hopefully that’s just temporary. (EC2, for what it’s worth, doesn’t support IPv6 on their instances either, though their load balancers do, so you can use a load balancer as a [costly] way to get IPv6 accessibility.)
Finally (phew!), --wait_until_running
won’t return until the instance has
actually started booting (typically about 25 seconds; you can add a brand
new instance and be ssh’d into a shell in less than a minute.) Note that
the machine won’t have any user accounts until the initial boot has
finished, so if you’re scripting this you’ll need to spin a bit until ssh
starts working.
I did need to spend a fair amount of time working out how to configure the instance once it existed, but that was mostly because I’m not too familiar with Debian.
There isn’t a great deal to say about this part (and obviously it’ll depend
upon what you’re doing), but in my case I simply ran sudo apt-get install
to get the packages I needed (apache2
and mercurial
, and a few others
like less
and vim
), downloaded and installed
mod_pagespeed
, the Apache module that does the same thing
as PageSpeed Service, built my site, and set up Apache to serve it.
There are still two things I’m not quite happy with:
unattended-upgrades
Debian package,
which I believe I’ve now configured to apply security updates
correctly, but I don’t fully understand what options I have here, or
whether I have in fact configured it correctly.fetchmail
(on another machine) in
ODMR mode to inject mail to an existing account.I’d estimate the monthly charge for my single instance to be around $15 + VAT (before the discount I’m getting). If I have the numbers right, that’s about what I’m paying for electricity to run my (not very efficient) server at home at present.
That price is dominated by the cost of the machine, which from the pricing documentation is currently $0.019/hour; the disk and network cost (for me) is going to end up at significantly less than a dollar a month.
I mention VAT above because something that’s not currently clear from the pricing documentation is that individuals in the EU pay VAT on top of the quoted prices (likewise true for EC2). Businesses are liable for VAT too, but they’re responsible for working out what to pay themselves, and do so separately.
One other aspect of the tax situation is a bit surprising (for individuals again, not businesses): VAT for Compute Engine is charged at the Irish VAT rate (23%), because when you’re in the EU, you’re paying Google Ireland. (This is in contrast to Amazon, who charge the UK rate even though you’re doing business with Amazon Inc. - tax is complicated.) Admittedly, the difference on the bill above is less than 30p/month, but it still took a little bit of time to figure out what was going on.
Despite all the talk of “big data” and large-scale data processing, is Compute Engine a viable option for running small-scale jobs like a simple static web server? Absolutely.
And while it’s easy to get started, it also looks like it scales naturally: I haven’t looked at load balancing (or protocol forwarding) in any detail, but everything else I’ve read about seems quite powerful and easy to start using incrementally.
From the management side, I’m impressed by the focus on scriptability:
gcutil
itself is fine as far as it goes, but the underlying Compute Engine
API is documented in terms of REST and JSON, and the Developers
Console goes out of its way to provide links to show you the REST
results for what it’s doing (as it just uses the REST API under the hood).
There are also a ton of client libraries
available (gcutil
is written against the Python API, for example), and
support from third-party management tools like
Scalr.
I still don’t think that I personally have any reason to use Compute Engine for large-scale processing, but I’m quite happy using it to serve this content to you.
Funny story: I wasn’t sure whether to mention reliability, since it’s actually it’s been pretty good. Then a few hours after writing the first draft of this post, my router fell over, and it was half a day before I could get access to restart it. So there’s that. ↩
There was another reason I originally wanted to script the installation: the preview version of Compute Engine that I started out using supported non-persistent (“scratch”) machine-local disks that were zero-cost. Initially I was considering whether I could get the machine configured in a way that it could boot from a clean disk image and set itself up from scratch on startup. It turned out to be a little more complicated than made sense, so I switched to persistent disks, but kept the script (and then the 1.0 release of Compute Engine came along and did away with scratch disks anyway). ↩
It turns out this was caused by a bug in the way my Developers Console project was set up, many years ago; it doesn’t happen in the general case, and it’ll be fixed if I recreate my instance. ↩
This is actually a bit of a pain. I’m using this to create a service user that is used both for the initial install and website content updates, but I could probably do with separating out the two roles and creating the non-privileged user manually. ↩
Not actually assigned, but kinda: the instance itself only ever has an RFC 1918 address assigned to eth0
(by default, drawn from 10.240.0.0/16, though you can customise even that). Instead, it’s the “network” — which implements NAT6 between the outside world and the instance — that holds the external-IP-to-internal-IP mapping. The networking documentation covers this in extensive detail. ↩
I think even the NAT aspect is optional: Protocol forwarding (just announced today, as I write this) appears to allow you to attach multiple external IP addresses directly to a single instance, presumably as additional addresses on eth0
. ↩
Noda Time 1.2.0 finally came out last week, and since I promised I’d write a post about it, here’s a post about it — which I’ve also just partially self-plagiarised in order to make a post for the Noda Time blog, so apologies if you’ve read some of this already. I promise there’s new content below as well.
While the changes in Noda Time 1.1 were around making a Portable Class Library version and filling in the gaps from the first release, Noda Time 1.2 is all about serialization1 and text formatting.
On the serialization side, Noda Time now supports XML and binary serialization natively, and comes with an optional assembly (and NuGet package) to handle JSON serialization (using Json.NET).
On the text formatting side, Noda Time 1.2 now properly supports formatting
and parsing of the Duration
, OffsetDateTime
, and ZonedDateTime
types.
We also fixed a few bugs, and added a some more convenience methods —
Interval.Contains()
and ZonedDateTime.Calendar
, among others — in
response to requests we received from people using the
library2.
Finally, it apparently wouldn’t be a proper Noda Time major release without
fixing another spelling mistake in our API: we replaced Period.Millseconds
in 1.1, but managed not to spot that we’d also misspelled Era.AnnoMartyrm
,
the era used in the Coptic calendar. That’s fixed in 1.2, and I think
(hope) that we’re done now.
There’s more information about all of the above in the comprehensive
serialization section of the user guide, the pattern
documentation for the Duration
,
OffsetDateTime
, and
ZonedDateTime
types, and the 1.2.0 release notes.
You can pick up Noda Time 1.2.0 from the NuGet repository as usual, or from the links on the Noda Time home page.
That’s the summary, anyway. Below, I’m going to going into a bit more detail about XML and JSON serialization, and what kind of things you can do with the new text support.
Using XML serialization is pretty straightforward, and mostly works as you’d expect. Here’s a complete example demonstrating XML serialization of a Noda Time property:
using System;
using System.IO;
using System.Xml;
using System.Xml.Serialization;
using NodaTime;
public class Person
{
public string Name { get; set; }
public LocalDate BirthDate { get; set; }
}
static class Program
{
static void Main(string[] args)
{
var person = new Person {
Name = "David",
BirthDate = new LocalDate(1979, 3, 22)
};
var x = new XmlSerializer(person.GetType());
var namespaces = new XmlSerializerNamespaces(
new XmlQualifiedName[] { new XmlQualifiedName("", "urn:") } );
var output = new StringWriter();
x.Serialize(output, person, namespaces);
Console.WriteLine(output);
}
}
As you can see, there’s nothing special here, and the output is also as you’d expect:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<Name>David</Name>
<BirthDate>1979-03-22</BirthDate>
</Person>
There are a couple of caveats to be aware of regarding XML serialization,
though, most notably that the Period
type requires special handling.
Period
is an immutable reference type, which XmlSerializer
doesn’t
really support, and so you’ll need to serialize via a proxy PeriodBuilder
property instead.
The other notable issue (which also applies to binary serialization) is that
.NET doesn’t provide any way to provide contextual configuration, and so
when deserializing a ZonedDateTime
, we need a way to find out which time
zone provider to use.
By default, we’ll use the TZDB provider, but if you’re using the BCL provider (or any custom provider), you’ll need to set a static property:
DateTimeZoneProviders.Serialization = DateTimeZoneProviders.Bcl;
The serialization section in the user guide has more details about both of these issues.
There are also two other limitations of XmlSerializer
that aren’t specific
to Noda Time, but are good to know about if you’re just getting started:
IXmlSerializable
(as the Noda Time
types do) can only be serialized as elements, and so
annotating your properties with the XmlAttribute
attribute won’t
work (it appears that .NET will throw an exception, while Mono will
instead do something strange).IXmlSerializable
are
silently serialized as empty elements and deserialized to their default
values. This is unlikely to be what you want, and it’s what will
happen if you accidentally run using a pre-1.2 Noda Time assembly.Noda Time’s JSON serialization makes use of Json.NET, which means that
to use it, you’ll need to add references to both the Json.NET assembly
(Newtonsoft.Json.dll
) and the Noda Time support assembly
(NodaTime.Serialization.JsonNet.dll
).
The only setup you need to do in code is to inform Json.NET how to serialize
Noda Time’s types (and again, which time zone provider to use). This can
either be done by hand, or via a ConfigureForNodaTime
extension
method. Again, the user guide has
all the details.
Once that’s done, using the serializer is straightforward:
using System;
using System.IO;
using Newtonsoft.Json;
using NodaTime;
using NodaTime.Serialization.JsonNet;
internal class Person
{
public string Name { get; set; }
public LocalDate BirthDate { get; set; }
}
static class Program
{
static void Main(string[] args)
{
var person = new Person {
Name = "David",
BirthDate = new LocalDate(1979, 3, 22)
};
var json = new JsonSerializer();
json.ConfigureForNodaTime(DateTimeZoneProviders.Tzdb);
var output = new StringWriter();
json.Serialize(output, person);
Console.WriteLine(output);
}
}
Output:
{"Name":"David","BirthDate":"1979-03-22"}
Unlike the .NET XML serializer, the Json.NET serializer is significantly more configurable. The Json.NET documentation is probably a good place to start if you’re interested in doing that.
Noda Time 1.2 adds parsing and formatting for the Duration
,
OffsetDateTime
, and ZonedDateTime
types, which previously only had
placeholder ToString()
implementations. Given a series of assignments
like the following:
var paris = DateTimeZoneProviders.Tzdb["Europe/Paris"];
ZonedDateTime zdt = SystemClock.Instance.Now.InZone(paris);
OffsetDateTime odt = zdt.ToOffsetDateTime();
Duration duration = Duration.FromSeconds(12345);
the result of calling ToString()
on each of the zdt
, odt
, and
duration
variables would produce something like the following in 1.1:
Local: 26/11/2013 19:35:28 Offset: +01 Zone: Europe/Paris
2013-11-26T19:35:28.00081+01
Duration: 123450000000 ticks
In 1.2, these types use a standard pattern by default instead: the general
invariant pattern (‘G’), for ZonedDateTime
and OffsetDateTime
, and the
round-trip pattern (‘o’) for Duration
:
2013-11-26T19:35:28 Europe/Paris (+01)
2013-11-26T19:35:28+01
0:03:25:45
More usefully, we can now use custom patterns:
var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z')'", null);
Console.WriteLine(pattern.Format(zdt));
which will print “26/11/2013 19:35:28 (Europe/Paris)
”.
The null
above is an optional time zone provider. If not specified, as
shown above, the resulting pattern can only be use for formatting, and not
for parsing3. This is why the standard patterns are format-only:
they don’t have a time zone provider.
If you do specify a time zone provider, however, you can parse your custom format just fine:
var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z')'", DateTimeZoneProviders.Tzdb);
var zdt = pattern.Parse("26/11/2013 19:35:28 (Europe/Paris)").Value;
Console.WriteLine(zdt);
which prints “2013-11-26T19:35:28 Europe/Paris (+01)
”, as you would expect.
As well as formatting the time zone ID (the “z” specified in the format string above), you can also format the time zone abbreviation (using “x”), which given the above input would produce “CET”, for Central European Time.
Now, if you’ve seen Jon’s “Humanity: Epic fail” talk — or watched
his recent presentation at DevDay Kraków, which covers some of the
same content — then you’ll already know that time zone abbreviations
aren’t unique. For that reason, if you include a time zone abbreviation
when creating a ZonedDateTimePattern
, the pattern will also be
format-only.
In addition to the time zone identifiers, both ZonedDateTime
and
OffsetDateTime
patterns accept a format specifier for the offset in
effect. This uses a slightly unusual format, as Offset
can be formatted
independently: it’s “o<…>”, where the “…” is an Offset
pattern
specifier. For example:
var pattern = ZonedDateTimePattern.CreateWithInvariantCulture(
"dd/MM/yyyy' 'HH:mm:ss' ('z o<+HH:mm>')'", null);
Console.WriteLine(pattern.Format(zdt));
which will unsurprisingly print “26/11/2013 19:35:28 (Europe/Paris
+01:00)
”.
For OffsetDateTime
, the offset is a core part of the type, while for
ZonedDateTime
, it allows for the disambiguation of otherwise-ambiguous
local times (as typically seen during a daylight saving transition).
If the offset is not included, the default behaviour for ambiguous times is to consider the input invalid. However, this can also be customised by providing the pattern with a custom resolver.
Finally, to Duration
. Duration formatting is a bit more interesting,
because we allow you to choose the granularity of reporting. For our
duration above, of 12,345 seconds, the round-trip pattern shows the number
of days, hours, minutes, seconds, and milliseconds (if non-zero), as
“0:03:25:45
”.
We can also format just the hours and minutes:
var pattern = DurationPattern.CreateWithInvariantCulture("HH:mm");
var s = pattern.Format(duration);
Console.WriteLine(s);
which prints “03:25
” — or we can choose to format just the minutes and
seconds:
var pattern = DurationPattern.CreateWithInvariantCulture("M:ss");
var s = pattern.Format(duration);
Console.WriteLine(s);
which does not print “25:45
”, but instead prints “205:45
”, reporting
the total number of minutes and a ‘partial’ number of seconds. Had we
instead used “mm:ss
” as the pattern, we would indeed have seen the former
result; the case of the format specifier determines whether a total or
partial value is used.
Once again, there’s more information on all of the above in the relevant sections of the user guide.
or serialisation. I apologise in advance for the spelling, but the term turns up in code all the time (e.g. ISerializable
), and I find it makes for awkward reading to mix and match the two. ↩
There’s definitely a balance to be had between the Pythonesque “only one way to do it” maxim and providing so many convenience methods that they cloud the basic concepts, and I think for 1.0 we definitely tended towards the former — which isn’t that bad: it’s easy to expand an API, but hard to reduce it. Some things that were a little awkward should be easier with 1.2, though. ↩
The error message you’ll see is “UnparsableValueException: This pattern is only capable of formatting, not parsing.” ↩
“Measure twice, cut once.” I can’t recall exactly where I was when I first heard that: perhaps a school carpentry lesson? Or for some reason I’m now thinking it was a physics lesson instead, but no matter. What is true is that I recently discovered that it applies to software engineering as much as carpentry.
Here’s the background: this blog is generated as a set of static files, by transforming an input tree into an output tree. The input tree contains a mixture of static resources (images, etc) alongside ‘posts’, text files containing metadata and Markdown content. The output tree contains exactly the same files, except that the posts are rendered to HTML, and there are some additional generated files like the front page and Atom feed.
There are tools to do this kind of thing now (we use
Jekyll for Noda Time’s web site, for example), but I
wrote my own (and then recently rewrote it in Python)1. My version
does three things: it scans the input and output trees, works out what the
output tree should look like, then makes it so. It’s basically make
(with
a persistent cache and content-hash-based change detection) plus a dumb
rsync
, and it’s not particularly complex.
For my most-recent post, I needed to add support for math rendering, which I did by conditionally sourcing a copy of MathJax from their CDN. So far, so good, but then I wanted to be able to proof the post while I was on a plane, so I decided to switch to a local copy of MathJax instead.
Problem: a local install of MathJax contains nearly 30,000 files, and my
But, you know, optimisation opportunity! I carried out some basic profiling and figured out that the 12 seconds I was seeing was taken up with:
The second of those I couldn’t see any obvious way to improve, but the first and last surprised me.
The input and output tree scanning is done by using os.walk()
to walk the
tree, and os.stat()
to grab each file’s mtime and size (which I use as
validators for cache entries).
Clearly that was an inefficient way to do it: I’m calling stat(2) about 30,000 times, when I should be reading that information in the same call as the one that reads the directory, right? Except that there’s no such call: the Linux VFS assumes that a file’s metadata is separate from the directory entry2; this isn’t DOS.
Perhaps I was thrashing the filesystem cache, then? Maybe I should be sorting (or not sorting) the directory entries, or stat-ing all the files before I recursed into subdirectories? Nope; doesn’t make a difference.
Well, I guess we’re blocking on I/O then. After all, git
doesn’t take
long to scan a tree like this, so it must be doing something clever; I
should do that. Ah, but git
is multi-threaded, isn’t it?3
I’ll bet that’s how it can be fast: it’s overlapping the I/O operations so
that it can make progress without stalling for I/O.
So I wrote a parallel directory scanner in Python, trying out both the
multiprocessing
and threading
libraries. How long did each
implementation take? About five seconds, same as before. (And raise your
hand if that came as a surprise.)
The next thing I tried was replicating the scan in C, just to double-check
that readdir
-and-stat
was a workable approach. I can’t recall the
times, but it was pretty quick, so Python’s at fault, right?
Wrong. It’s always your fault.
I realised then that I’d never tried anything outside my tool itself, and ported the bare C code I had to Python. It took exactly the same amount of time. (At which point I remembered that Mercurial, which I actually use for this blog, is largely written in Python; that should have been a clue that it wasn’t likely to be Python in the first place.)
So finally I started taking a look at what my code was actually doing with
the directory entries. For each one, it created a File
object with a
bunch of properties (name, size, etc), along with a cache entry to store
things like the file’s content hash and the inputs that produced each
output.
Now the objects themselves I couldn’t do much about, but the cache entry
creation code was interesting: it first generated a cache key from the name,
mtime, and size (as a dict
), then added an empty entry to my global cache
dict
using that key. The global cache was later to be persisted as a JSON
object (on which, more later), and so I had to convert the entry’s cache key
to something hashable first.
And how to generate that hashable key? Well, it turned out that as I
already had a general-purpose serialiser around, I’d made the decision to
reuse the JSON encoder as a way to generate a string key from my cache key
dict
(because it’s not like performance matters, after all). Once I’d
replaced that with something simpler, tree scans dropped from 5.2s to 2.9s.
Success!
I’d also noticed something odd while I was hacking about: when I removed
some of my DEBUG
level logging statements, things sped up a bit, even
though I was only running at INFO
level. I briefly considered that
perhaps Python’s logging was just slow, then decided to take another look at
how I was setting up the logging in the first place:
logging.basicConfig(
level=logging.DEBUG,
format=('%(levelname).1s %(asctime)s [%(filename)s:%(lineno)s] '
'%(message)s'))
logging.getLogger().handlers[0].setLevel(
logging.DEBUG if args.verbose else logging.INFO)
Python’s logging is similar to Java’s, so this creates a logger that logs
everything, then sets the default (console) log handler to only display
messages at INFO
level and above. Oops.
I’d stolen the code from another project where I’d had an additional
always-on DEBUG
handler that wrote to a file, but here I was just wasting
time formatting log records I’d throw away later. I changed the logging to
set the level of the root logger instead, and sped things up by another
second. More success!
Finally, I decided to take a look at the way I was writing out my cache.
This is a fairly large in-memory dict
mapping string keys to dict
values
for each file I knew about. I’d known that Python’s json
module wasn’t
likely to be particularly fast, but almost three seconds to write an 11MB
file still seemed pretty slow to me.
I wasn’t actually writing it directly, though; the output was quite big and
repetitive, so I was compressing it using gzip
first:
with gzip.open(self._cache_filename, 'wb') as cache_fh:
json.dump(non_empty_values, cache_fh, sort_keys=True,
default=_json_encode_datetime)
I noticed that if I removed the compression entirely, the time to write the cache dropped from about 2900ms to about 800ms, but by this point I was assuming that everything was my fault instead, so I decided to measure the time taken to separately generate the JSON output and write to the file.
To my surprise, when I split up the two (using json.dumps()
to produce a
string instead of json.dump()
), the total time dropped to just 900ms. I
have no real idea why this happens, but I suspect that something is either
flushing each write to disk early or that calls from Python code to Python’s
native-code zlib module are expensive (or json.dump()
is slow).
In total, that brought my no-change runtime down from twelve seconds to just about six.
So, in summary, once I realised that I should actually measure where my time was spent rather than guessing what it might be spent on, I was able to reduce my runtime by about half, quite a big deal in an edit-compile-run cycle. It took it from “I’m bored waiting; maybe read Slashdot instead” to something that was tolerable.
And so success, and scene.
But.
But there’s actually a larger lesson to learn here, too (and in fact I very
nearly titled this post “The \sqrt
of all evil” in light of that): I
didn’t need to do any of this at all.
A few weeks after the above, I realised that I could solve my local editing problem an entirely different way. I moved MathJax out of the content tree entirely, and now (on local runs) just drop a symlink into the output tree after the tool is done. So if you take a look at the page as it serves now, you’ll see I’m back to sourcing MathJax via their CDN.
This means that I’m back down to O(200) input files rather than O(30k), and my no-change builds now take 30ms. It was a fun journey, but I’m not entirely sure that cutting 45ms from a 75ms run was worth all the effort I put in…
Why? Mostly because I can: it’s fun. But also because it gives me an excuse to play around: originally with C and SQLite, and more recently with Python, a language I don’t use much otherwise. I could also say that static blog generators weren’t as common in 2006, but to be honest I don’t think I bothered looking. ↩
This is a good thing: hard links would be pretty tricky otherwise. ↩
Nope. git
is multi-threaded when packing repositories, but as far as I’m aware that’s the only time it is. ↩
Last month1, John Regehr asked:
As it happens, this is something I’ve occasionally wondered about, too.
The short answer is that there is a standard (though not particularly fast) way to do this called the “Combinatorial Number System”. As usual, Wikipedia has the details, though in this case I found their explanations a little hard to follow, so I’m going to go through it here myself as well.
First, let’s back up for a bit and make sure we know what we’re talking about.
If you wanted a way to compactly encode any combination of — let’s say — 32 choices, your best bet (assuming a uniform distribution) would be to use a 32-element bit-vector, with one bit set per choice made.
If we were implementing that in C++, we could do something like this:
uint32_t encode(const std::set<int>& choices) {
uint32_t result = 0;
for (int choice : choices) {
assert(0 <= choice && choice < 32);
result |= 1 << choice;
}
return result;
}
This pattern is common enough that in practice you’d be more likely to use
the bit-vector directly, rather than materialise anything like an explicit
set
of choice numbers. But while it’s a good solution for the general
case, what if your problem involved some fixed number of choices instead
of an arbitrary number?
For the purposes of the rest of this discussion, let’s pretend we’re going to make exactly four choices out of the 32 possible. That gives us about 36,000 different possible combinations to encode, which we should be able to fit into a 16-bit value.
Actually, for almost all purposes, there’s still nothing really wrong with the implementation above, even though it is using twice as many bits as needed — unless perhaps you had a very large number of combinations to store (and are willing to trade simplicity for space). However, as we’ll see later, the solution to this problem has a few other uses as well. Plus, I think it’s an interesting topic in itself.
If you just want to read about how the combinatorial number system itself works, feel free to skip the next section, as I’m going to briefly take a look at a ‘better’ (though still non-optimal) alternate to the above.
As an improvement to the above one-bit-per-choice implementation, we could simply encode the number of each choice directly using four groups of five bits each:
uint32_t encode(const std::set<int>& choices) {
assert(choices.size() == 4);
uint32_t result = 0;
for (int choice : choices) {
assert(0 <= choice && choice < 32);
result = (result << 5) | choice;
}
return result;
}
This is still fairly simple, and at 20 bits output, it’s more compact than a
We can actually see why this is: the encoding we’ve chosen is too expressive for what we actually need to represent2. In this case, there are two ways in which our encoding distinguishes between inputs that could be encoded identically.
The biggest waste occurs because we’re encoding an ordering. For example, \( \lbrace 1, 2, 3, 4 \rbrace \) and \( \lbrace 4, 3, 2, 1 \rbrace \) have different encodings under our scheme, yet really represent the same two combinations of choices.
The second (and much more minor) inefficiency comes from the ability to encode ‘impossible’ combinations like \( \lbrace 4, 4, 4, 4 \rbrace \). Optimally, choices after the first would be encoded using a slightly smaller number of bits, as we have less valid choices to choose from by then.
In this case, we can actually quantify precisely the degree to which this encoding is sub-optimal: being able to represent an ordering means that we have \( 4! = 24 \) times too many encodings for the ‘same’ input, while allowing duplicates means that we have \( 32^4 \div ( 32 \cdot 31 \cdot 30 \cdot 29 ) \approxeq 1.2 \) times too many (i.e. the number of choices we can encode, divided by the number we should be able to encode). Combining the two factors gives us the difference between an optimal encoding and this one.
It’s interesting to think about how we might improve on the above: perhaps we could canonicalise the input by ordering the choices by number, say, then rely on that fact somehow: if we knew that the choices were in decreasing order, for example, it’s clear to see that we could identify the combination \( \lbrace 3, 2, 1, 0 \rbrace \) entirely from the information that the first choice is number 3.
But let’s move on to something that is optimal.
And so we arrive at the “combinatorial number system”. This system describes a bijection between any combination of (a fixed number of) \(k\) choices and the natural numbers.
Interestingly, this means that this scheme does not depend upon knowing the number of things you’re choosing from: you can convert freely between the two representations just by knowing how many choices you need to make.
First, a brief refresher on binomials. The number of ways we can choose \(k\) things from \(n\) is the binomial coefficient, which can be defined recursively:
\[ \eqalign{ \binom n 0 &= \binom n n = 1 \cr \binom n k &= \binom {n-1} {k-1} + \binom {n-1} k } \]
… or, in a way that’s simpler to compute directly — as long as the numbers involved are small — in terms of factorials:
\[ \binom n k = {n! \over k!(n - k)!} \]
For example, we can compute that there are exactly 35,960 different ways to choose four things from a set of 32:
\[ \binom {32} 4 = {32! \over 4!(32 - 4)!} = 35960 \]
Note that in some cases (for example, the one immediately below), we may also need to define:
\[ \binom n k = 0 \text { when } k \gt n \]
That is, it’s not possible to choose more things from a set than were originally present.
But that’s enough about binomials. The combinatorial number system, then, is then defined as follows:
Given a combination of choices \( \lbrace C_k, C_{k-1}, \dots, C_1 \rbrace \) from \(n\) elements such that \( n \gt C_k \gt C_{k-1} \gt \dots \gt C_1 \ge 0 \), we compute a value \(N\) that encodes these choices as:
\[ N = \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} \]
Somewhat surprisingly, this produces a unique value \( N \) such that:
\[ 0 \le N \lt \binom n k \]
And since the number of possible values of \( N \) is equal to the number of combinations we can make, every \( N \) maps to a valid (and different) combination.
\( N \) will be zero when all the smallest-numbered choices are made (i.e. when \( C_k = k - 1 \) and so \( C_1 = 0 \)), and will reach the maximum value with a combination containing the largest-numbered choices.
We could implement this encoding using something like the following:
uint32_t binom(int n, int k) {
assert(0 <= k);
assert(0 <= n);
if (k > n) return 0;
if (n == k) return 1;
if (k == 0) return 1;
return binom(n-1, k-1) + binom(n-1, k);
}
uint32_t encode(const std::set<int>& choices) {
std::set<int, std::greater<int>> choices_sorted(
choices.begin(), choices.end());
int k = choices.size();
uint32_t result = 0;
for (int choice : choices_sorted) {
result += binom(choice, k--);
}
return result;
}
… though in reality, we’d choose a much more efficient way to calculate
binomial coefficients, since the current implementation ends up calling
binom()
a number of times proportional to the resulting \(N\)!
Decoding can operate using a greedy algorithm that first identifies the greatest-used choice number, then successively removes the terms we added previously:
std::set<int> decode(uint32_t N, int k) {
int choice = k - 1;
while (binom(choice, k) < N) {
choice++;
}
std::set<int> result;
for (; choice >= 0; choice--) {
if (binom(choice, k) <= N) {
N -= binom(choice, k--);
result.insert(choice);
}
}
return result;
}
We could also choose to remove the initial loop and just start choice
at
the greatest possible choice number, if we knew it in advance.
As to how all this works, consider the list produced for successive \(N\).
For \(k = 4\), the enumeration of the combinations begins:
\[ \displaylines{ \lbrace 3, 2, 1, 0 \rbrace \cr \lbrace 4, 2, 1, 0 \rbrace \cr \lbrace 4, 3, 1, 0 \rbrace \cr \lbrace 4, 3, 2, 0 \rbrace \cr \lbrace 4, 3, 2, 1 \rbrace \cr \lbrace 5, 2, 1, 0 \rbrace \cr \vdots } \]
As you can see, this list is in order. More specifically: it’s in lexicographic order. This isn’t by coincidence, but is actually a direct result of the way we construct the equation above. Let’s do that.
First, construct an (infinite) list of all possible \(k\)-combinations, with the choices that form each individual combination in descending order, as above. Sort this list in lexicographic order, as if the choices were digits in some number system, again as shown above.
Pick any entry in that sorted list. We’re going to count the number of entries that precede our chosen one.
To do so, for each ‘digit’ (choice) in our chosen entry, count all valid combinations of the subsequence that extends from that digit to the end of the entry, while only making use of the choices with numbers smaller than the currently-chosen one. Sum those counts, and that’s the number of preceding entries.
That sounds much more complex than it really is, but what we’re doing is equivalent to, in the normal decimal number system, saying something like: “the count of numbers less than 567 is equal to the count of numbers 0xx–4xx, plus the count of numbers 50x–55x, plus the count of numbers 560–566”. Except in this case, the ‘digits’ in our numbers are strictly decreasing.
I skipped a step. How do we find the number of combinations for each subsequence? That’s actually easy: if the choice at the start of our subsequence currently has the choice number \(C_i\) and the subsequence is of length \(i\), then the number of lexicographically-smaller combinations of that subsequence, \(N_i\), is the number of assignments we can make of \(C_i\) different choices3 to the \(i\) different positions in the subsequence.
Or alternatively,
\[ N_i = \binom {C_i} i \]
and so:
\[ \eqalign{ N &= \sum_{i=1}^k N_i \cr &= \binom {C_k} k + \binom {C_{k-1}} {k-1} + \dots + \binom {C_2} {2} + \binom {C_1} {1} } \]
\(N\), of course, is both the count of entries preceding the chosen entry in our sorted list, and the index that we assign to that entry.
Okay, so perhaps it’s not that common to need to encode a combination as an integer value, but there’s another way to use this to be aware of here: if you pick a random \(N\) and ‘decode’ it, you end up with a uniformly-chosen random combination. That’s something that I have wanted to do on occasion, and it’s not immediately clear how you’d do it efficiently otherwise.
(For completeness, a third usage that Wikipedia mentions is in being able to construct an array indexed by \(N\) in order to associate some information with each possible \(k\)-combination. I can see what they’re saying, but I can’t see many cases where this might actually be something that you need to do.)
A simple bit-vector is optimal for representing any number of choices \( [0, n] \) made from \(n\) items. The combinatorial number system is optimal for representing a fixed number of choices \(k\).
What’s an optimal way to represent a bounded number of choices \( [0, k] \)? Or more generally, any arbitrarily-bounded number of choices?
I’m a slow typist. ↩
The other possibility is that the encoding may be wasteful in a straightforward information-theoretic sense, akin to using six bits to represent “a number from 0 to 37” rather than ~5.25. While this is strictly a superset of the over-expressiveness problem mentioned in the main text, it seems useful to differentiate the two, and consider the semantics expressed by the encoded representation we’ve decided upon. ↩
That is, the count of those choice numbers that are smaller than \(C_i\), since choice numbers start from zero. ↩