The fastest Linux timestamps · hmpc
We believe that this document is fully human-written.
Hacker News Article AI Analysis
Content Label
Human
AI Generated
0%
Human
100%
Window 1 - Human
TL;DR: We can speed up timestamps on x86 Linux by 30% and
maintain the same precision as the standard system clock by implementing our own timers without
relying on vDSO. Almost nobody should do this.
Table of contents
Timing the timers
The TSC
When syscalls aren’t
Faster monotonic clocks
Making our own vDSO
Measuring tails
Stable timers
Conclusion
Appendix: Methodology
Timing the timersOne of my pet projects at my last job was to introduce distributed tracing to a low-latency pipeline
(think 1–10 microseconds per stage) using OpenTelemetry. As part of this effort I spent
a considerable amount of time designing, implementing, and optimising our own C++ tracing client
library, as the official one has too much
overhead. My goal was for the latency impact per component to stay under 5% so both developers and
users would feel comfortable leaving traces always on in production; this translated to a budget of
about 50–100 ns (a few hundred clock cycles) per span.As you might imagine, at this scale you must carefully consider every aspect of the design and
implementation, from ID generation to serialisation. One of these not-so-small details is how to
timestamp spans. OTLP uses two time
fields,
one each for the start and end of the span as measured by the local wall clock. Although the end
time is an absolute timestamp, it’s expected that it will always be later than the start time, as
its primary purpose is to measure the span duration. The official client handles
this
roughly as:Span::Span(/* ... */)
{
// ...
start_time_ = std::chrono::system_clock::now();
start_steady_time_ = std::chrono::steady_clock::now();
// ...
}
void Span::End(/* ... */)
{
// ...
auto end_steady_time = std::chrono::steady_clock::now();
auto
Window 2 - Human
duration = end_steady_time - start_steady_time_;
end_time_ = start_time_ + duration;
// ...
}
It takes the start time from the real-time clock and uses two timestamps from the monotonic clock to
calculate a nonnegative span duration without interference from discontinuous system clock
adjustments. The end time is a synthetic timestamp obtained by adding the duration to the start
time, rather than directly from any clock.Does querying the system clocks three times per span have any significant performance impact? If
you’re at all familiar with Linux internals, you might expect the answer to be no: after all, in
practically any application using the C library the clock_gettime() syscall (indirectly called by
the now() functions) will be routed through vDSO to avoid context-switching into the kernel.
Let’s do a quick benchmark to confirm:void BM_NaiveBackToBack(benchmark::State& state)
{
for (auto _ : state) {
auto ts = std::chrono::system_clock::now();
auto start = std::chrono::steady_clock::now();
auto duration = std::chrono::steady_clock::now() - start;
benchmark::DoNotOptimize(ts + duration);
}
}
On my laptop1 (see the appendix for details on the setup) this
yields iteration times between 46 and 49 ns—almost our entire time budget for a span, spent just on
timestamping! Clearly this will not suffice.2If we’re to meet our latency constraints, we’ll need to understand how Linux clocks work under the
hood and find out how much weight we can shed. We’ll see what the x86 timestamp counter is and how
it works, do a deep dive into the implementation of vDSO, and use our newly acquired knowledge to
chop over 50% of the timing overhead from our initial attempt. All the benchmarking code is
available. If you’re already familiar with the TSC and
vDSO internals feel free to skip to the good stuff.This post focuses on x86 Linux, although vDSO works largely the same way in other
architectures. I will be using Linux 6.8 as a reference, as that is my current kernel version.
Window 3 - Human
If
you want to replicate these results on more recent kernels, be aware that the layout of the data
page was modified
in version 6.15.The TSCAlmost anyone who has written microarchitecture benchmarks or otherwise needed fast and accurate
timestamps on x86 platforms is well acquainted with the CPU’s timestamp counter, or TSC. Quoting
from Intel’s System Programming
Guide:
The time-stamp counter […] is a 64-bit counter that is set to 0 following a RESET of the
processor. Following a RESET, the counter increments even when the processor is halted by the HLT
instruction or the external STPCLK# pin.
The time stamp counter in newer processors may support an enhancement, referred to as invariant
TSC. […] The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is
the architectural behavior moving forward. On processors with invariant TSC support, the OS may use
the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more
efficient and do not incur the overhead associated with a ring transition or access to a platform
resource.
(By “newer processors” in the second paragraph we should understand any relatively modern CPU,
really. You can confirm whether this is the case on your machine by looking for the constant_tsc
and nonstop_tsc flags in /proc/cpuinfo.)An invariant TSC behaves exactly as you would expect from a clock: it’s fully synchronised across
cores (so it doesn’t matter from which one you read it), runs at a constant rate independent of
frequency scaling3, and doesn’t stop even when the system is idle or suspended. It’s still,
of course, subject to frequency deviations like any other clock, which is one reason why NTP or PTP
synchronisation is important.Although, as the manual suggests, TSC reads are in fact much more efficient than the alternatives,
they’re not free.
Window 4 - Human
The cost of reading the TSC is twofold: the instruction itself is slow (rdtsc
has a reciprocal throughput of 25 core clock
cycles on Skylake) and the instruction
stream must be serialised first, either through an explicit lfence or by using rdtscp4
(32 cycles), so that all preceding instructions execute before we read the counter; otherwise we can
easily introduce errors of 10-30%, depending on the workload.When syscalls aren’tI won’t spend much time explaining what the vDSO itself is, as the man
page does a pretty good job. The long and short
of it is that when a user-space process invokes certain system calls through the C library (most
notably, clock_gettime()), the library directs that call to a small shared library mapped into the
process (the vDSO), which avoids the overhead of switching into ring 0 by reading the required
information from a memory region shared with the kernel, called the vvar or data page.The data page is
placed four
pages before the vDSO mapping itself and
contains two vdso_data
structures (one for the high-resolution clock, subject to frequency adjustments, and one for the raw
clock) at offset
128. We can find the
vDSO code to read from these structures in
lib/vdso/gettimeofday.c.
We’re particularly interested in the do_hres()
function, which is used
for all calls to high-resolution clocks such as CLOCK_REALTIME and CLOCK_MONOTONIC
(corresponding to system_clock and steady_clock, respectively). Here’s a simplified version of
the function assuming an x86 target without time namespaces and an invariant TSC:static int
do_hres(const struct vdso_data *vd, clockid_t clk, struct __kernel_timespec *ts)
{
const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
u64 cycles, last, sec, ns;
u32 seq;
do {
seq = vdso_read_begin(vd);
cycles = __arch_get_hw_counter(vd->clock_mode, vd);
ns = vdso_ts->nsec;
last =
Window 5 - Human
vd->cycle_last;
ns += (cycles - last) * vd->mult;
ns = ns >> vd->shift;
sec = vdso_ts->sec;
} while (unlikely(vdso_read_retry(vd, seq)));
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return 0;
}
The way it works is as follows: on every kernel tick, the CPU responsible for updating the timers
does a bunch of math
to advance the system clock and
updates the data page
with the resulting values. This is the kernel’s best estimate of the current time for each clock,
along with the current cycle count of the underlying clock
source (on x86, of
course, the TSC) and
a multiplier/shift pair to efficiently convert cycles into nanoseconds (essentially a fixed-point
representation of the estimated clock period). The whole structure is protected by
a seqlock: the kernel increments the sequence number before
and after each update so readers can recognise when an update is in progress (the sequence number is
odd) or when it occurred during the read (the sequence number changes).Back in userspace, all do_hres() needs to do is load the values from the page, get the current
cycle count from the TSC (through __arch_get_hw_counter(), which uses lfence+rdtsc or rdtscp
if available), convert the difference to a nanosecond offset, and finally fold the excess amount
into the seconds value.Faster monotonic clocksBecause all the heavy lifting is done in the kernel, calls to clock_gettime() through vDSO are
quite efficient for most purposes. But there is still some extraneous work slowing down our use
case:
Even though the only difference between the wall clock and the monotonic clock is the base time,
we need to do two initial calls to get both timestamps, duplicating the critical region, several
loads, the conversion from cycles to nanoseconds, the division into seconds, and above all the
expensive TSC read. (Reading the TSC twice also means we use two slightly different timestamps
for the start of the span.)
Window 6 - Human
Because we only use the monotonic clock to measure a time interval, we could skip the calculation
of timestamps and convert only the cycle difference to nanoseconds using the multiplier and shift
values: delta = ((cycles_end - cycles_start) * mult) >> shift.
Because OpenTelemetry timestamps are in nanoseconds since the epoch, we don’t need to normalise
the seconds field or even calculate it at all. Although this does not
use an actual division
instruction, it still carries a cost.
Let’s see if we can reach a solution that addresses all of these points. To start, we’ll refactor
our initial timing logic into its own class so we can more easily benchmark different approaches:class NaiveTimer
{
public:
using duration = std::chrono::steady_clock::duration;
using time_point = std::chrono::system_clock::time_point;
time_point start() noexcept
{
const auto ts = std::chrono::system_clock::now();
start_ = std::chrono::steady_clock::now();
return ts;
}
duration elapsed() const noexcept
{ return std::chrono::steady_clock::now() - start_; }
private:
std::chrono::steady_clock::time_point start_;
};
As I mentioned earlier, the standard way to measure time intervals with low overhead on x86 is by
directly reading the TSC. We can replace our calls to the monotonic clock as follows5:static inline __attribute__((always_inline)) uint64_t rdtsc()
{
uint32_t id;
return __rdtscp(&id);
}
template <typename Estimate>
class TscTimer
{
public:
using duration = std::chrono::duration<int64_t, std::nano>;
using time_point = std::chrono::system_clock::time_point;
time_point start() noexcept
{
const auto ts =