farblog

by Malcolm Rowe

A moderately deep dive into filesystem times on Linux

This last weekend, I spent a little time noodling around with the static site generator that I wrote to generate this website. One thing I remembered was that a few of its tests were slightly flaky, and this time I really wanted to dig into what was going on.

One such flaky test did the following:

  1. Get the current time, as start.
  2. Do some processing that should update a file.
  3. Get the current time again, as end.
  4. Expand the range of start and end so that they fall on second boundaries, because “some filesystems can only store modification times to the second”.
  5. Check that the target file’s modification time falls within [start, end).

Perhaps you already know where this is going, but while the test above might work on a strictly POSIX-compliant system (more on that later), it doesn’t work in practice on Linux, occasionally failing because the modification time we see in step 2 is slightly before the start time we obtain in step 1, albeit just by a few milliseconds!

Exactly why this happens is down to how Linux updates file modification times when a file is written to.

I’m going to dig into this below, but to avoid making this any longer, I’m also going to limit myself a bit, and make some simplifying assumptions:

First, a little history

Let’s start our story back in 1979.

While earlier Unixen had creation and modification times, it was V7 Unix that introduced the “atime”, “mtime”, and “ctime” names that later made their way into the POSIX standard. The original1 POSIX standard defined these as:

time_t    st_atime   Time of last access. 
time_t    st_mtime   Time of last data modification. 
time_t    st_ctime   Time of last status change. 

I think the first two are pretty clear, but ctime is a little confusing: from the “ctime” name, you might have assumed it was “creation time”, but as you can see from above, it’s a “status change” time. Effectively, mtime is the time that the file contents changed, while ctime is the time the file metadata changed (and since the metadata includes the mtime, any update to mtime must also update ctime).

While POSIX doesn’t include any concept of file creation time, some operating systems and filesystems do. For example, ext2 does, and Linux has a statx() system call that returns an extra btime field with a file’s creation time (“btime” from “birth time”, following prior usage in BSD).

What POSIX says about how file times are updated is actually quite interesting. Rather than just specifying that modification times are updated when a file is written to, the various times are instead “marked for update” whenever certain specific operations complete, so (e.g.) a successful write() will mark mtime and ctime as to-be-updated.

Implementations are then free to actually run that update later on (“At an update point in time, any marked fields shall be set to the current time”, says POSIX), subject to certain operations that must trigger an update (calling stat(), for instance).

All that seems okay for our test, though? While we might not see exactly the right modification time, we should still see a time that sits within the range we’re expecting? Except we don’t.

As far as I can tell, the reason for that is that Linux doesn’t strictly follow POSIX here, though I must admit that it’s not completely clear: POSIX does leave quite a lot of room for ambiguity anyway2.

What times can filesystems actually store?

Before we talk about Linux in general, it’s probably worth talking about the difference between granularity (what the filesystem can store) and accuracy (how close the stored time is to real time), since that’s something that’s confused me in the past.

Different filesystems support different granularities for the times that they can store. To pick a few examples:

To be extremely specific about ext2/3/4 for a moment: the on-disk representation supports nanosecond resolution (and dates after 2038) if the filesystem was created with an inode size of 256 bytes rather than 128 bytes (as shown in tune2fs -l output).

If you have an ext2/3/4 filesystem, it’s almost certainly using 256-byte inodes already: they became the default in 2008 when creating a filesystem of 512MiB or larger, and nowadays are used by default regardless of size. Modern Linux kernels3 support whatever the filesystem supports.

Incidentally, this does make me wonder whether I actually had access to an older filesystem when I was writing the test I mentioned at the start of this post, or whether I’d just misunderstood why the times I was seeing in my test didn’t match up with what I expected4.

So if some filesystems only support certain granularities, what happens if we try to set a finer-grained value ourselves? In this case, the kernel will truncate the value to what the filesystem supports first (see s_time_gran in the kernel source, which is how each filesystem declares its granularity). This truncation (rather than round-to-nearest, say) is also what POSIX requires.

We can see this behaviour easily by creating a file with a fixed modification time on a few different filesystems.

First, ext4. As mentioned above, this supports nanosecond granularity, so the value we specify is used directly for the access and modification times (while the current time is recorded for the status change and file creation times):

$ touch foo -d '1985-10-26T01:21:03,123456789'; stat foo
[...]
Access: 1985-10-26 01:21:03.123456789 -0700
Modify: 1985-10-26 01:21:03.123456789 -0700
Change: 2025-11-27 02:11:45.779834694 -0800
 Birth: 2025-11-27 02:11:45.779834694 -0800

If we explicitly create the ext4 filesystem with 128-byte inodes, we can see that we only have times stored to seconds resolution (and that we no longer have a creation time):

Access: 1985-10-26 01:21:03.000000000 -0700
Modify: 1985-10-26 01:21:03.000000000 -0700
Change: 2025-11-27 02:16:14.000000000 -0800
 Birth: -

While for something like FAT, we get this mix of granularities:

Access: 1985-10-25 17:00:00.000000000 -0700
Modify: 1985-10-26 01:21:02.000000000 -0700
Change: 1985-10-26 01:21:02.000000000 -0700
 Birth: 2025-11-27 02:17:57.490000000 -0800

On FAT filesystems, access times use day granularity (the value above is midnight UTC for the time I gave), modification times use two-second granularity (FAT doesn’t store a separate ctime, so ctime is always equal to mtime), while creation times use 10ms granularity.

Back to ext4: if we create a file with the current time, we can also see the weird behaviour I mentioned right at the start:

$ date --iso-8601=ns; touch foo; stat foo
2025-11-27T02:22:03,520075379-08:00
  File: foo
[...]
Access: 2025-11-27 02:22:03.518535805 -0800
Modify: 2025-11-27 02:22:03.518535805 -0800
Change: 2025-11-27 02:22:03.518535805 -0800
 Birth: 2025-11-27 02:22:03.518535805 -0800

The recorded times are 1.5 milliseconds before the time printed by the preceding date command! What’s going on?

What’s the time, Mister Linux?

So what about accuracy? How does Linux choose what mtime value to write when a file is updated, and why does it seem like time is going backwards here?

On the face of it, this seems like an odd question: why wouldn’t the kernel just set the modification time equal to the current time every time a file is written? I think that’s what most people (myself included) would have assumed was happening.

The main reason (and probably the only reason, as far as Linux is concerned) is efficiency: files are written to a lot, and recording a new modification time on every write would potentially mean doing a lot of extra work, even if everything stays in cache. Not only that, but merely fetching the current time can be surprisingly expensive5, especially on very old hardware without access to CPU cycle counters.

Instead, Linux (up until 6.13) maintains a ‘coarse’ current time that updates every timer ‘tick’ (usually once every 4ms or 10ms6), and then all writes made (to any file) during that interval will record exactly the same modification time, even if the filesystem could store a finer-grained value.

This is absolutely the explanation for my test failures above: the modification time I’m seeing is the time of the last timer tick, which is before the start time that I captured before writing the file. Having tested this explicitly, it looks like in practice the mtime I see is indeed always older than the real current time by somewhere between 0–4ms plus a constant (~700µs), which is exactly what we’d expect.

Is Linux actually being POSIX-compliant here? Not that it matters too much, but I’d say not? While POSIX allows file time updates to be delayed, I don’t see that it allows the time used during an update to be anything other than the current time at the point the update is run7, which is the same time printed by date, etc.

In other words, POSIX seems to require that the recorded time is on-or-after the actual time, and Linux records a time that’s on-or-before the actual time.

Multigrain timestamps!

While I was researching this, I ran across a new feature called multigrain timestamps. This isn’t present in the version of Debian that I’m currently using (it was added to Linux 6.13, so it’ll be in Debian 14), but it does change kernel behaviour in an interesting way.

The kernel documentation is pretty easy to read, but in summary this feature watches to see if a file’s timestamp has been observed (with stat(), e.g.) since its times were last updated, and if so, the next write will use a fine-grained timestamp (i.e. the ‘real’ time), if using a coarse-grained timestamp would make it look like the modification time hadn’t changed.

(There’s a small extra wrinkle in that, to maintain ordering, the act of using a fine-grained timestamp for any file also has to drag along the current ‘coarse’ timestamp, otherwise a file that only needs a coarse-grained timestamp could appear to have been updated before an earlier update that needed a fine-grained timestamp.)

It seems like this was primarily intended for NFSv3 exports, which want to use timestamps to see if a file has changed since they were last read, but I think it should also help in any other cases where you have something watching a file for changes.

(It won’t help fix my tests, though, since the decision about whether to use a fine-grained or coarse-grained timestamp is made when the file is written, and the first write to a file will — I assume — always use a coarse-grained timestamp.)

Fixing (‘fixing’) my tests

So how should I fix my flaky tests?

I could have fixed them by replacing “Get the current time” with “Create a temporary file on the same filesystem as the target file and read its mtime” (or alternatively I’m pretty sure I could read CLOCK_REALTIME_COARSE directly, albeit that might not be easy from Python).

If I were to do that, the three times would then be monotonically increasing. In practice, this test takes much less than 4ms to run, so it’s actually extremely likely that all three times would be exactly the same. (This should not be that much of a surprise.)

But taking a step back, it’s also worth considering why I had this test in the first place.

In this case, the functionality was inherited from an even older incarnation of this tool that relied upon Subversion’s use-commit-times feature to record when each post was last updated, and a desire to have the (HTML) output file have a last-modified timestamp close to that of the (Markdown) source file.

I don’t store posts in Subversion any more, so this whole feature wasn’t really achieving much. And by far the most-sensible thing to do here is to simply to delete the code (and tests) that was playing around with mtime, and just let the kernel pick a reasonable time by itself.

And so that’s how I fixed my flaky test.

References

In the interests of citing my sources, here’s a few more that I used while researching this post.

The Linux source and mailing lists are a great resource for finding out why things work the way they do. In particular, the following were particularly useful:

Also:


  1. I’m quoting from the older (2004) POSIX standard here for simplicity: later versions replace the time_t fields with struct timespec fields with slightly different names, and redefine the existing members as macros. 

  2. Among the other things that POSIX leaves unspecified: whether file time updates must run for all files at once, or whether it’s technically compliant for two files updated one after another to end up with mtimes in the opposite order. (Nobody does that last one in practice — it’d break tools like make that need to see if an input file has changed relative to an output file — but I don’t see how POSIX disallows it.) 

  3. If you happen to be looking at the Linux source code (as I was when writing this post), note that only fs/ext4 supports nanosecond resolution and years past 2038. In contrast, fs/ext2 does not: even though it can use ext2 filesystems with 256-byte inodes, it won’t do anything with the additional data. However, fs/ext4 is almost certainly what any modern kernel will be configured to use for ext2 (and ext3) filesystems, as fs/ext2 is essentially just kept around for reference purposes now. 

  4. That test dates from 2013, but at the time I would have been running them on a small Debian 7 box under my desk (whereas now I’m running on a small Debian 13 box under someone else’s cloud), but I don’t think that’s old enough for the filesystem to have been created with 128-byte inodes. It’s just about possible that I was referring to an even-older Slackware box that I might still have been using at the time. 

  5. Applications reading the current time repeatedly is also why nowadays userspace calls like clock_gettime() and gettimeofday() usually don’t involve a kernel syscall (and so won’t appear in strace output); see man vdso for a detailed explanation. In that case the returned time is always correct, though. 

  6. The kernel tick rate is determined by CONFIG_HZ, which on Debian is set to 250, giving an update every 1s/250 = 4 milliseconds. 

  7. I do see the (non-normative) POSIX rationale, which says “The accuracy of the time update values is intentionally left unspecified so that systems can control the bandwidth of a possible covert channel”, which is a) not what I would have guessed for the primary justification! but also b) reads more to me as not setting an upper-bound on the amount of time that can pass between a write and an (unforced) file time update, rather than allowing an arbitrary (and otherwise unmentioned) jitter to be introduced into update time values.