The fastest Linux timestamps · hmpc

H hmpcabral.com ↗

▲ 71 points • 18 comments • by hmpc • 4w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

4 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,719

PEAK AI % 5% · §6

Analyzed

Apr 27

backend: pangram/v3.3

Segments scanned

6 windows

avg 287 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,719 words · 6 segments analyzed

Human AI-generated

§1 Human · 3%

TL;DR: We can speed up timestamps on x86 Linux by 30% and maintain the same precision as the standard system clock by implementing our own timers without relying on vDSO. Almost nobody should do this. Table of contents

Timing the timers The TSC When syscalls aren’t Faster monotonic clocks Making our own vDSO Measuring tails Stable timers Conclusion Appendix: Methodology

Timing the timersOne of my pet projects at my last job was to introduce distributed tracing to a low-latency pipeline (think 1–10 microseconds per stage) using OpenTelemetry. As part of this effort I spent a considerable amount of time designing, implementing, and optimising our own C++ tracing client library, as the official one has too much overhead. My goal was for the latency impact per component to stay under 5% so both developers and users would feel comfortable leaving traces always on in production; this translated to a budget of about 50–100 ns (a few hundred clock cycles) per span.As you might imagine, at this scale you must carefully consider every aspect of the design and implementation, from ID generation to serialisation. One of these not-so-small details is how to timestamp spans. OTLP uses two time fields, one each for the start and end of the span as measured by the local wall clock. Although the end time is an absolute timestamp, it’s expected that it will always be later than the start time, as its primary purpose is to measure the span duration. The official client handles this roughly as:Span::Span(/* ... */) { // ... start_time_ = std::chrono::system_clock::now(); start_steady_time_ = std::chrono::steady_clock::now(); // ... }

void Span::End(/* ... */) { // ... auto end_steady_time = std::chrono::steady_clock::now(); auto

§2 Human · 1%

duration = end_steady_time - start_steady_time_; end_time_ = start_time_ + duration; // ... } It takes the start time from the real-time clock and uses two timestamps from the monotonic clock to calculate a nonnegative span duration without interference from discontinuous system clock adjustments. The end time is a synthetic timestamp obtained by adding the duration to the start time, rather than directly from any clock.Does querying the system clocks three times per span have any significant performance impact? If you’re at all familiar with Linux internals, you might expect the answer to be no: after all, in practically any application using the C library the clock_gettime() syscall (indirectly called by the now() functions) will be routed through vDSO to avoid context-switching into the kernel. Let’s do a quick benchmark to confirm:void BM_NaiveBackToBack(benchmark::State& state) { for (auto _ : state) { auto ts = std::chrono::system_clock::now(); auto start = std::chrono::steady_clock::now(); auto duration = std::chrono::steady_clock::now() - start;

benchmark::DoNotOptimize(ts + duration); } } On my laptop1 (see the appendix for details on the setup) this yields iteration times between 46 and 49 ns—almost our entire time budget for a span, spent just on timestamping! Clearly this will not suffice.2If we’re to meet our latency constraints, we’ll need to understand how Linux clocks work under the hood and find out how much weight we can shed. We’ll see what the x86 timestamp counter is and how it works, do a deep dive into the implementation of vDSO, and use our newly acquired knowledge to chop over 50% of the timing overhead from our initial attempt. All the benchmarking code is available. If you’re already familiar with the TSC and vDSO internals feel free to skip to the good stuff.This post focuses on x86 Linux, although vDSO works largely the same way in other architectures. I will be using Linux 6.8 as a reference, as that is my current kernel version.

§3 Human · 1%

If you want to replicate these results on more recent kernels, be aware that the layout of the data page was modified in version 6.15.The TSCAlmost anyone who has written microarchitecture benchmarks or otherwise needed fast and accurate timestamps on x86 platforms is well acquainted with the CPU’s timestamp counter, or TSC. Quoting from Intel’s System Programming Guide: The time-stamp counter […] is a 64-bit counter that is set to 0 following a RESET of the processor. Following a RESET, the counter increments even when the processor is halted by the HLT instruction or the external STPCLK# pin.

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. […] The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource. (By “newer processors” in the second paragraph we should understand any relatively modern CPU, really. You can confirm whether this is the case on your machine by looking for the constant_tsc and nonstop_tsc flags in /proc/cpuinfo.)An invariant TSC behaves exactly as you would expect from a clock: it’s fully synchronised across cores (so it doesn’t matter from which one you read it), runs at a constant rate independent of frequency scaling3, and doesn’t stop even when the system is idle or suspended. It’s still, of course, subject to frequency deviations like any other clock, which is one reason why NTP or PTP synchronisation is important.Although, as the manual suggests, TSC reads are in fact much more efficient than the alternatives, they’re not free.

§4 Human · 1%

The cost of reading the TSC is twofold: the instruction itself is slow (rdtsc has a reciprocal throughput of 25 core clock cycles on Skylake) and the instruction stream must be serialised first, either through an explicit lfence or by using rdtscp4 (32 cycles), so that all preceding instructions execute before we read the counter; otherwise we can easily introduce errors of 10-30%, depending on the workload.When syscalls aren’tI won’t spend much time explaining what the vDSO itself is, as the man page does a pretty good job. The long and short of it is that when a user-space process invokes certain system calls through the C library (most notably, clock_gettime()), the library directs that call to a small shared library mapped into the process (the vDSO), which avoids the overhead of switching into ring 0 by reading the required information from a memory region shared with the kernel, called the vvar or data page.The data page is placed four pages before the vDSO mapping itself and contains two vdso_data structures (one for the high-resolution clock, subject to frequency adjustments, and one for the raw clock) at offset 128. We can find the vDSO code to read from these structures in lib/vdso/gettimeofday.c. We’re particularly interested in the do_hres() function, which is used for all calls to high-resolution clocks such as CLOCK_REALTIME and CLOCK_MONOTONIC (corresponding to system_clock and steady_clock, respectively). Here’s a simplified version of the function assuming an x86 target without time namespaces and an invariant TSC:static int do_hres(const struct vdso_data *vd, clockid_t clk, struct __kernel_timespec *ts) { const struct vdso_timestamp *vdso_ts = &vd->basetime[clk]; u64 cycles, last, sec, ns; u32 seq;

do { seq = vdso_read_begin(vd); cycles = __arch_get_hw_counter(vd->clock_mode, vd); ns = vdso_ts->nsec; last =

§5 Human · 2%

vd->cycle_last; ns += (cycles - last) * vd->mult; ns = ns >> vd->shift; sec = vdso_ts->sec; } while (unlikely(vdso_read_retry(vd, seq)));

ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); ts->tv_nsec = ns;

return 0; } The way it works is as follows: on every kernel tick, the CPU responsible for updating the timers does a bunch of math to advance the system clock and updates the data page with the resulting values. This is the kernel’s best estimate of the current time for each clock, along with the current cycle count of the underlying clock source (on x86, of course, the TSC) and a multiplier/shift pair to efficiently convert cycles into nanoseconds (essentially a fixed-point representation of the estimated clock period). The whole structure is protected by a seqlock: the kernel increments the sequence number before and after each update so readers can recognise when an update is in progress (the sequence number is odd) or when it occurred during the read (the sequence number changes).Back in userspace, all do_hres() needs to do is load the values from the page, get the current cycle count from the TSC (through __arch_get_hw_counter(), which uses lfence+rdtsc or rdtscp if available), convert the difference to a nanosecond offset, and finally fold the excess amount into the seconds value.Faster monotonic clocksBecause all the heavy lifting is done in the kernel, calls to clock_gettime() through vDSO are quite efficient for most purposes. But there is still some extraneous work slowing down our use case: Even though the only difference between the wall clock and the monotonic clock is the base time, we need to do two initial calls to get both timestamps, duplicating the critical region, several loads, the conversion from cycles to nanoseconds, the division into seconds, and above all the expensive TSC read. (Reading the TSC twice also means we use two slightly different timestamps for the start of the span.)

§6 Human · 5%

Because we only use the monotonic clock to measure a time interval, we could skip the calculation of timestamps and convert only the cycle difference to nanoseconds using the multiplier and shift values: delta = ((cycles_end - cycles_start) * mult) >> shift. Because OpenTelemetry timestamps are in nanoseconds since the epoch, we don’t need to normalise the seconds field or even calculate it at all. Although this does not use an actual division instruction, it still carries a cost. Let’s see if we can reach a solution that addresses all of these points. To start, we’ll refactor our initial timing logic into its own class so we can more easily benchmark different approaches:class NaiveTimer { public: using duration = std::chrono::steady_clock::duration; using time_point = std::chrono::system_clock::time_point;

time_point start() noexcept { const auto ts = std::chrono::system_clock::now(); start_ = std::chrono::steady_clock::now(); return ts; }

duration elapsed() const noexcept { return std::chrono::steady_clock::now() - start_; }

private: std::chrono::steady_clock::time_point start_; }; As I mentioned earlier, the standard way to measure time intervals with low overhead on x86 is by directly reading the TSC. We can replace our calls to the monotonic clock as follows5:static inline __attribute__((always_inline)) uint64_t rdtsc() { uint32_t id; return __rdtscp(&id); }

template <typename Estimate> class TscTimer { public: using duration = std::chrono::duration<int64_t, std::nano>; using time_point = std::chrono::system_clock::time_point;

time_point start() noexcept { const auto ts =