farblog

 by Malcolm Rowe

Collecting Coupons

I’ll freely admit that I don’t need to use a lot of maths1 in my day job: pretty much the only things I really need to understand are basic statistics and, occasionally, how to run a t-test to see whether an optimisation has actually improved things (though I guess I don’t even need to understand the maths for that one).

But sometimes maths ends up happening anyway, and it can be interesting when that happens. Here’s one such story:

Let’s say that we have some frontend service that we’d like to understand the performance of in a production configuration. Maybe we want to understand its RAM usage under load, or the latency for some given types of requests, that kind of thing.

In our example system, we’ll send requests to a single frontend task, but there’s a wrinkle: that frontend uses a sharded or replicated backend RPC service to service these requests, and the frontend will only connect2 to a given backend task when needed. That connection setup will take a little extra time whenever it occurs, which we don’t want to count in our test.

If we keep sending requests, then eventually the frontend task should have connected to all the backend tasks, and we’ll have a steady-state we can start analysing.

The question then is, how long will it take for our system to set up all of these connections and reach that steady-state? Or, more precisely, how many warm-up requests do we need to send before we can start our real measurements?

Let’s say that our frontend has a hundred backend tasks that it can connect to, and that we’ll model the selection of each backend task as uniformly random for each request. Do we need to send a few hundred requests to our frontend before all of those backends are connected, or a few thousand? Or more?

Incidentally, when I did something like this for real, I used a pragmatic solution instead: keep sending requests until the metric I was monitoring looked like it had settled down, then start measuring. But it did make me wonder how we might go about calculating this mathematically.

Back-of-the-envelope maths time: we know that it has to be at least 100 requests, since we can only make at most one connection to a backend task for any given frontend request. We also know that doing it in exactly 100 requests should both be possible and extremely unlikely:

  1. The first request we send will definitely trigger a connection to some backend task, since we haven’t made any connections at all yet.
  2. The second request will probably make another connection. Since we have 100 backend tasks, the chance is only $1 / 100$ that we pick the same backend as we used for the first request, so in $99 / 100$ cases we’ll connect to a new backend task.
  3. The third request will connect to a third backend in $98 / 100$ cases, and so on.

After we’ve sent that 100th request, the chance that we’ve connected to all 100 backends is therefore:

$$ \eqalign{ &\frac{100}{100} \times \frac{99}{100} \times \frac{98}{100} \times \dots \times \frac{2}{100} \times \frac{1}{100}\cr = &\frac{100!}{100^{100}} } $$

… which is a very small number: about 1 in $10^{42}$.

So if we want to have a high chance of having connected to all of the backend tasks, we’ll definitely need more than 100 requests, but it’s still not very clear how many: in theory, it could take an unbounded number of requests, since we might never pick all 100!

It turns out that this problem has a name: it’s the coupon collector’s problem, based on the idea of collecting all of $n$ different coupons by drawing randomly, one-by-one.

Calculating the expected number of requests we need is fairly straightforward: we can just sum the expected number of requests we need to make each new connection:

  1. For the first connection, we always need exactly one request.
  2. For the second connection, the probability of making a new request is $99 / 100$, so the expected number of requests we need is $100 / 99$.
  3. For the third connection, the expected number of requests is $100 / 98$, and so on.

To make the last connection, we expect that it’ll take $100 / 1$ or 100 requests, which makes sense: we have to pick the one unconnected backend from 100 options (in the same way that on average you’d expect to need to make six die rolls on a regular six-sided die in order to roll some pre-chosen number).

So, as Wikipedia tells us, if we calculate3

$$ \sum_{k=1}^{n} \frac{n}{k} $$

for n=100, we get about 518.74, which is the expected number of requests we’d need to send.

But what does this number actually tell us? If we send 519 requests, does that mean that we’ll probably have connected to all of the backends? Or that we have exactly a 50% chance of having done so? How many requests do we need if we want to have a 95% chance of having connected to all of the backends, say?

What we’ve just calculated is the expected value, and is an average of the number of requests.

In other words, if we were to run this experiment a large number of times, counting the number of requests until we successfully connected to all the backends ($C_1, C_2, C_3, \dots$), then the expected value is simply the average of those counts.

Since some of these counts might be quite large, this expected value isn’t actually that useful for thinking about the number of requests we’re going to need to send. What we really want to know is what the actual distribution looks like.

Since randomly picking those last few backends takes quite a few requests, the distribution is heavily skewed to the right. To cut to the chase, it looks like this:

A right-skewed probability distribution graph for the coupon collector's problem with 100 coupons. The curve rises sharply to a peak at 460 attempts, then slowly tapers off into a long tail to the right.
Probability distribution (PMF) for the coupon collector’s problem, for n=100 coupons

For every point $x$ on the x-axis, the height of the curve shows the probability of collecting 100 coupons — or connecting to all 100 backends, in our example — in exactly $x$ attempts, so for $x = 100$ we have the very small (but non-zero) probability we calculated earlier, while for $x \lt 100$ (the hatched area) the probability is zero, as those values are impossible.

The expected value (518.74) we calculated above is marked, as are the median (497) and 95th percentile (754). All of these are larger than the mode (460), which is the most-likely number of attempts we’d need.

This graph also has non-zero values everywhere for $x \geq 100$, though it drops off quickly: the probability that we’d take more than 1000 attempts is only 0.4% or so, for example.

Wikipedia goes into the details of how to calculate this distribution exactly, though I don’t feel competent enough to explain it here. One useful nugget, though, is that we can estimate some of the statistics if we map this distribution to what is apparently called a Gumbel distribution:

For example, for $n=100$ and $p=0.95$, this gives a mode of 460, which is correct, and a 95th percentile of 758, which is pretty close to the real value of 754.

In this example, we could be very sure (~99.57%) that if we were to send about a thousand requests, we’ll have connected to all of our backends.

So, the next time you need to work out how many requests it takes to warm up a cache or fill a connection pool, you could use this to work out some magic numbers. But there’s an important caveat for this specific example: all of the above relies on our initial assumption that picking a backend is uniformly random.

In our non-spherical-cow world, load balancers don’t pick backends at random, and instead use a round-robin or similar strategy that would most likely end up actively picking the unconnected backends, significantly reducing the number of requests we actually need to connect to everything.

Maths gives us an interesting worst-case upper bound, but it also suggests why the pragmatic solution was a better idea: sometimes it’s just easier to watch your latency metrics until they flatten out, and let the maths happen in the background.


  1. or “math”, for y’all in en-US. 

  2. Perhaps you’re wondering whether our hypothetical frontend could just pre-connect to all possible backend tasks upon startup? Sure, it could, but let’s pretend that there are good reasons that it doesn’t. In any case, the same principles here can also apply to other areas like cache warming, for example, so let’s just go with the maths for now. 

  3. We can actually compute a closed-form approximation of the expected value by way of harmonic numbers. It turns out that the exact value we want is $n H_n$, where $H_n$ is the nth harmonic number. We can approximate $H_n$ as $\ln{n} + \gamma + \frac{1}{2n}$, where $\gamma$ is Euler’s constant, approximately $0.577$, and so multiplying that approximation of $H_n$ by $n$ produces a reasonable approximation of the expected value for the coupon collector’s problem. 

A moderately deep dive into filesystem times on Linux

This last weekend, I spent a little time noodling around with the static site generator that I wrote to generate this website. One thing I remembered was that a few of its tests were slightly flaky, and this time I really wanted to dig into what was going on.

One such flaky test did the following:

  1. Get the current time, as start.
  2. Do some processing that should update a file.
  3. Get the current time again, as end.
  4. Expand the range of start and end so that they fall on second boundaries, because “some filesystems can only store modification times to the second”.
  5. Check that the target file’s modification time falls within [start, end).

Perhaps you already know where this is going, but while the test above might work on a strictly POSIX-compliant system (more on that later), it doesn’t work in practice on Linux, occasionally failing because the modification time we see in step 2 is slightly before the start time we obtain in step 1, albeit just by a few milliseconds!

Exactly why this happens is down to how Linux updates file modification times when a file is written to.

I’m going to dig into this below, but to avoid making this any longer, I’m also going to limit myself a bit, and make some simplifying assumptions:

First, a little history

Let’s start our story back in 1979.

While earlier Unixen had creation and modification times, it was V7 Unix that introduced the “atime”, “mtime”, and “ctime” names that later made their way into the POSIX standard. The original1 POSIX standard defined these as:

time_t    st_atime   Time of last access. 
time_t    st_mtime   Time of last data modification. 
time_t    st_ctime   Time of last status change. 

I think the first two are pretty clear, but ctime is a little confusing: from the “ctime” name, you might have assumed it was “creation time”, but as you can see from above, it’s a “status change” time. Effectively, mtime is the time that the file contents changed, while ctime is the time the file metadata changed (and since the metadata includes the mtime, any update to mtime must also update ctime).

While POSIX doesn’t include any concept of file creation time, some operating systems and filesystems do. For example, ext2 does, and Linux has a statx() system call that returns an extra btime field with a file’s creation time (“btime” from “birth time”, following prior usage in BSD).

What POSIX says about how file times are updated is actually quite interesting. Rather than just specifying that modification times are updated when a file is written to, the various times are instead “marked for update” whenever certain specific operations complete, so (e.g.) a successful write() will mark mtime and ctime as to-be-updated.

Implementations are then free to actually run that update later on (“At an update point in time, any marked fields shall be set to the current time”, says POSIX), subject to certain operations that must trigger an update (calling stat(), for instance).

All that seems okay for our test, though? While we might not see exactly the right modification time, we should still see a time that sits within the range we’re expecting? Except we don’t.

As far as I can tell, the reason for that is that Linux doesn’t strictly follow POSIX here, though I must admit that it’s not completely clear: POSIX does leave quite a lot of room for ambiguity anyway2.

What times can filesystems actually store?

Before we talk about Linux in general, it’s probably worth talking about the difference between granularity (what the filesystem can store) and accuracy (how close the stored time is to real time), since that’s something that’s confused me in the past.

Different filesystems support different granularities for the times that they can store. To pick a few examples:

To be extremely specific about ext2/3/4 for a moment: the on-disk representation supports nanosecond resolution (and dates after 2038) if the filesystem was created with an inode size of 256 bytes rather than 128 bytes (as shown in tune2fs -l output).

If you have an ext2/3/4 filesystem, it’s almost certainly using 256-byte inodes already: they became the default in 2008 when creating a filesystem of 512MiB or larger, and nowadays are used by default regardless of size. Modern Linux kernels3 support whatever the filesystem supports.

Incidentally, this does make me wonder whether I actually had access to an older filesystem when I was writing the test I mentioned at the start of this post, or whether I’d just misunderstood why the times I was seeing in my test didn’t match up with what I expected4.

So if some filesystems only support certain granularities, what happens if we try to set a finer-grained value ourselves? In this case, the kernel will truncate the value to what the filesystem supports first (see s_time_gran in the kernel source, which is how each filesystem declares its granularity). This truncation (rather than round-to-nearest, say) is also what POSIX requires.

We can see this behaviour easily by creating a file with a fixed modification time on a few different filesystems.

First, ext4. As mentioned above, this supports nanosecond granularity, so the value we specify is used directly for the access and modification times (while the current time is recorded for the status change and file creation times):

$ touch foo -d '1985-10-26T01:21:03,123456789'; stat foo
[...]
Access: 1985-10-26 01:21:03.123456789 -0700
Modify: 1985-10-26 01:21:03.123456789 -0700
Change: 2025-11-27 02:11:45.779834694 -0800
 Birth: 2025-11-27 02:11:45.779834694 -0800

If we explicitly create the ext4 filesystem with 128-byte inodes, we can see that we only have times stored to seconds resolution (and that we no longer have a creation time):

Access: 1985-10-26 01:21:03.000000000 -0700
Modify: 1985-10-26 01:21:03.000000000 -0700
Change: 2025-11-27 02:16:14.000000000 -0800
 Birth: -

While for something like FAT, we get this mix of granularities:

Access: 1985-10-25 17:00:00.000000000 -0700
Modify: 1985-10-26 01:21:02.000000000 -0700
Change: 1985-10-26 01:21:02.000000000 -0700
 Birth: 2025-11-27 02:17:57.490000000 -0800

On FAT filesystems, access times use day granularity (the value above is midnight UTC for the time I gave), modification times use two-second granularity (FAT doesn’t store a separate ctime, so ctime is always equal to mtime), while creation times use 10ms granularity.

Back to ext4: if we create a file with the current time, we can also see the weird behaviour I mentioned right at the start:

$ date --iso-8601=ns; touch foo; stat foo
2025-11-27T02:22:03,520075379-08:00
  File: foo
[...]
Access: 2025-11-27 02:22:03.518535805 -0800
Modify: 2025-11-27 02:22:03.518535805 -0800
Change: 2025-11-27 02:22:03.518535805 -0800
 Birth: 2025-11-27 02:22:03.518535805 -0800

The recorded times are 1.5 milliseconds before the time printed by the preceding date command! What’s going on?

What’s the time, Mister Linux?

So what about accuracy? How does Linux choose what mtime value to write when a file is updated, and why does it seem like time is going backwards here?

On the face of it, this seems like an odd question: why wouldn’t the kernel just set the modification time equal to the current time every time a file is written? I think that’s what most people (myself included) would have assumed was happening.

The main reason (and probably the only reason, as far as Linux is concerned) is efficiency: files are written to a lot, and recording a new modification time on every write would potentially mean doing a lot of extra work, even if everything stays in cache. Not only that, but merely fetching the current time can be surprisingly expensive5, especially on very old hardware without access to CPU cycle counters.

Instead, Linux (up until 6.13) maintains a ‘coarse’ current time that updates every timer ‘tick’ (usually once every 4ms or 10ms6), and then all writes made (to any file) during that interval will record exactly the same modification time, even if the filesystem could store a finer-grained value.

This is absolutely the explanation for my test failures above: the modification time I’m seeing is the time of the last timer tick, which is before the start time that I captured before writing the file. Having tested this explicitly, it looks like in practice the mtime I see is indeed always older than the real current time by somewhere between 0–4ms plus a constant (~700µs), which is exactly what we’d expect.

Is Linux actually being POSIX-compliant here? Not that it matters too much, but I’d say not? While POSIX allows file time updates to be delayed, I don’t see that it allows the time used during an update to be anything other than the current time at the point the update is run7, which is the same time printed by date, etc.

In other words, POSIX seems to require that the recorded time is on-or-after the actual time, and Linux records a time that’s on-or-before the actual time.

Multigrain timestamps!

While I was researching this, I ran across a new feature called multigrain timestamps. This isn’t present in the version of Debian that I’m currently using (it was added to Linux 6.13, so it’ll be in Debian 14), but it does change kernel behaviour in an interesting way.

The kernel documentation is pretty easy to read, but in summary this feature watches to see if a file’s timestamp has been observed (with stat(), e.g.) since its times were last updated, and if so, the next write will use a fine-grained timestamp (i.e. the ‘real’ time), if using a coarse-grained timestamp would make it look like the modification time hadn’t changed.

(There’s a small extra wrinkle in that, to maintain ordering, the act of using a fine-grained timestamp for any file also has to drag along the current ‘coarse’ timestamp, otherwise a file that only needs a coarse-grained timestamp could appear to have been updated before an earlier update that needed a fine-grained timestamp.)

It seems like this was primarily intended for NFSv3 exports, which want to use timestamps to see if a file has changed since they were last read, but I think it should also help in any other cases where you have something watching a file for changes.

(It won’t help fix my tests, though, since the decision about whether to use a fine-grained or coarse-grained timestamp is made when the file is written, and the first write to a file will — I assume — always use a coarse-grained timestamp.)

Fixing (‘fixing’) my tests

So how should I fix my flaky tests?

I could have fixed them by replacing “Get the current time” with “Create a temporary file on the same filesystem as the target file and read its mtime” (or alternatively I’m pretty sure I could read CLOCK_REALTIME_COARSE directly, albeit that might not be easy from Python).

If I were to do that, the three times would then be monotonically increasing. In practice, this test takes much less than 4ms to run, so it’s actually extremely likely that all three times would be exactly the same. (This should not be that much of a surprise.)

But taking a step back, it’s also worth considering why I had this test in the first place.

In this case, the functionality was inherited from an even older incarnation of this tool that relied upon Subversion’s use-commit-times feature to record when each post was last updated, and a desire to have the (HTML) output file have a last-modified timestamp close to that of the (Markdown) source file.

I don’t store posts in Subversion any more, so this whole feature wasn’t really achieving much. And by far the most-sensible thing to do here is to simply to delete the code (and tests) that was playing around with mtime, and just let the kernel pick a reasonable time by itself.

And so that’s how I fixed my flaky test.

References

In the interests of citing my sources, here’s a few more that I used while researching this post.

The Linux source and mailing lists are a great resource for finding out why things work the way they do. In particular, the following were particularly useful:

Also:


  1. I’m quoting from the older (2004) POSIX standard here for simplicity: later versions replace the time_t fields with struct timespec fields with slightly different names, and redefine the existing members as macros. 

  2. Among the other things that POSIX leaves unspecified: whether file time updates must run for all files at once, or whether it’s technically compliant for two files updated one after another to end up with mtimes in the opposite order. (Nobody does that last one in practice — it’d break tools like make that need to see if an input file has changed relative to an output file — but I don’t see how POSIX disallows it.) 

  3. If you happen to be looking at the Linux source code (as I was when writing this post), note that only fs/ext4 supports nanosecond resolution and years past 2038. In contrast, fs/ext2 does not: even though it can use ext2 filesystems with 256-byte inodes, it won’t do anything with the additional data. However, fs/ext4 is almost certainly what any modern kernel will be configured to use for ext2 (and ext3) filesystems, as fs/ext2 is essentially just kept around for reference purposes now. 

  4. That test dates from 2013, but at the time I would have been running them on a small Debian 7 box under my desk (whereas now I’m running on a small Debian 13 box under someone else’s cloud), but I don’t think that’s old enough for the filesystem to have been created with 128-byte inodes. It’s just about possible that I was referring to an even-older Slackware box that I might still have been using at the time. 

  5. Applications reading the current time repeatedly is also why nowadays userspace calls like clock_gettime() and gettimeofday() usually don’t involve a kernel syscall (and so won’t appear in strace output); see man vdso for a detailed explanation. In that case the returned time is always correct, though. 

  6. The kernel tick rate is determined by CONFIG_HZ, which on Debian is set to 250, giving an update every 1s/250 = 4 milliseconds. 

  7. I do see the (non-normative) POSIX rationale, which says “The accuracy of the time update values is intentionally left unspecified so that systems can control the bandwidth of a possible covert channel”, which is a) not what I would have guessed for the primary justification! but also b) reads more to me as not setting an upper-bound on the amount of time that can pass between a write and an (unforced) file time update, rather than allowing an arbitrary (and otherwise unmentioned) jitter to be introduced into update time values. 

Noda Time 3.0.0

Noda Time 3.0.0 came out yesterday1, bringing a shiny new parcel of date- and time-related functionality.

What’s new in 3.0? Firstly, there’s a couple of things in 3.0 that just plain make it easier to use Noda Time:

Performance

Although not as significant as the changes from Noda Time 1.x to 2.x, performance is still a key concern for Noda Time.

In 3.0.0, we’ve managed to eke out a little more performance for some common operations: finding the earlier of two LocalDate values now takes somewhere between 40–60% of the time it did in Noda Time 2.x, while parsing text strings as LocalTime and LocalDate values using common (ISO-like) patterns should also be a little faster, taking around 90% of the time it did in Noda Time 2.x.

Caveats

The change from Noda Time 2.x to 3.0 is not as big a change as the one from Noda Time 1.x to 2.0, but there are still some small incompatibilities to watch out for.

The migration document details everything that we’re aware of, but there are two points worth calling out explicitly:

In general, though, we expect that most projects using Noda Time 2.x should be able to replace it with Noda Time 3.0.0 transparently.

Availability

You can get Noda Time 3.0.0 from the NuGet repository as usual (core and testing packages), or from the links on the Noda Time home page.

Note that the serialization packages were decoupled from the main release during the 2.x releases, and so (for example) there is no new version of NodaTime.Serialization.JsonNet; the current version of that library will work just fine with Noda Time 3.0.0.

What’s next?

Good question. While Noda Time is fairly mature as a library, we do have a few areas we’d like to explore for the future: making use of Span<T> in text parsing, and providing a little more information from CLDR sources (stable timezone IDs, for example). If you’re interested in helping out, come and talk to us on the mailing list.


  1. And once again, I’m going to copy/paste this to produce the official Noda Time blog post. (The evidence suggests that this is the only way I’ll get any content on my personal site, after all.) 

Books of 2018

[Insert obligatory “well, it’s been a while since I’ve written anything for this blog” paragraph here.]

With 2018 finally complete, I thought it might be fun to take a quick look at the books I read last year.

Goodreads has a “reading challenge” each year wherein you can set a target number of books to read. In 2016, I hit my target of 34 books, albeit only by cramming both the SRE book and The Calendar of the Roman Republic (long story) on the last day of that year. Buoyed by success, I increased it to 38 books for 2017… and then got distracted by life and fell a bit short.

So, for 2018, I kept the same target as for 2017, and tried to not get distracted. A few weeks ago, I’d got a little bit ahead of that — woohoo me! — and decided it might be fun to put together a short review of each. So here are all the books I read in 2018, in (roughly) chronological order.

To sum up: I managed to read 40 books last year, almost all of which were fiction, mostly urban fantasy and sci-fi, to nobody’s surprise. (I also started and failed to finish a bunch of non-fiction books).

I think I did a better job of picking books with diverse protagonists this time round, and while most of the books I read were published in the last few years (40% were published in 2018), I managed to also seek out a few older ones (Kindred, for example, I’m really glad I got round to reading).

Onward to 2019!


  1. I’d have called it sci-fi purely because it has time-travel, but I ran across an interview with Butler in which she points out, “Kindred is fantasy. I mean literally, it is fantasy. There’s no science in Kindred.” She has a point. 

  2. … though from what I can tell, 6 Ma is squarely in the Miocene epoch, not the Pliocene. In A Pliocene Companion, Word of God resolves this by stating that, in-universe, the Pliocene is considered to start around 11 Ma (not 5.6 or 5.33 Ma, as in our reality). 

  3. And to a large extent, discrimination that’s still present today: there’s a line where our heroine says that “people would ignore what I said until [my husband] repeated it”, which sounds familiar enough.