Linux latency measurements and compositor tuning | farnoy.dev

F farnoy.dev ↗

▲ 131 points • 43 comments • by GalaxySnail • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

1 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,783

PEAK AI % 2% · §4

Analyzed

Jun 11

backend: pangram/v3.3

Segments scanned

5 windows

avg 357 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,783 words · 5 segments analyzed

Human AI-generated

§1 Human · 0%

Ever since moving back from Windows, I’ve been paranoid about latency in games on Linux. Slight changes to the environment or settings can, all of a sudden, make the mouse feel very floaty. There have been many community discussions on this topic and I’m certainly not alone in this. To investigate, I used a small Teensy microcontroller to measure click-to-photon latency. It acts as a USB HID mouse and is paired with a light sensor pressed against the screen. I flashed it with an existing Open Source LDAT sketch, with slight modifications. The resulting setup can log hundreds of samples to a CSV file, unattended. Measurements were done on two computers: a desktop and a laptop. They each have an Ada-generation RTX card and a Zen 4 CPU. I used virtually the same NixOS config on both, along with an up-to-date Windows 11 install on each. They were connected to the same display for most of the tests: an LG C1 at 120 Hz over HDMI. I have Radeon GPUs lying around, and plan on testing gamescope on them in particular, but that’s going to have to wait until the next batch of tests. App settings were selected to avoid hardware bottlenecks. My goal was to easily hit 120 FPS on a 120 Hz output and test for any queueing effects in the software stack. I used KDE Wayland 6.6.4, Proton-GE 10-33, MangoHud 0.8.2 for FPS limiting (using the late method), Nvidia 595.58.03. Originally, I had meant to compare with X11 sessions as well, but with KDE removing them soon, I dropped it. On Windows, I used either the Nvidia control panel or RTSS for frame-rate limiting, interchangeably. Despite the automated nature of the tool, launching & cataloging the runs still ends up being a lot of work. Controlling all the variables is a major pain, and I often discovered new things partway through the testing, which invalidated prior measurements. A few examples of odd behaviors are:

LG webOS toggling Black Frame Insertion when you connect a different computer on the same port.

§2 Human · 1%

Using KDE Konsole to mark the start of my test run initiates big wl_shm presentation surfaces, which take a long time to copy over PCIe to GPU VRAM. This trains the compositor to be extra pessimistic about timings that just spiked. Switching V-Sync modes in specific games not applying the change immediately.

Synthetic tests As a quick validation run and an easy test of display settings, I built my own latency testing tool. It’s just a black square that goes white immediately when you click on it; perfect for the tool to react to. I added a configurable delay to simulate input processing. The test was performed on a clean Chromium profile with nothing except for out-of-the-box defaults. How to read the chartsEach chart varies one parameter, shown on the Y axis. Latency is horizontal: represented as milliseconds on the bottom X axis, or in the number of frames at 120 Hz on the top. Charts may have multiple facets and those partition another variable - one facet per value. Within each facet, each Y value can have multiple horizontal boxplots - one per combination of other tunables. Same color across Y rows = only the Y parameter changes between the boxes. You can hover each box, or check the legend, to see which settings it corresponds to. Each bar is an IQR boxplot over many samples, with min-max whiskers and a vertical line representing the median.

This looks exactly as expected - the medians and minimums are shifted roughly by the amount of delay. Except, why is my desktop computer slower than the laptop? They are running the same versions of software from a reused NixOS config, very similar hardware too. I expected them to match, or for the desktop to have a slight edge, if anything. To minimize the differences further, I created a brand new user account on the desktop and ran the test again:

There it was, something about my desktop profile was introducing at least 3 ms of latency! From here, I tried a bunch of things: plasma-manager to diff my existing profile against a clean one, removing all virtual desktops and disabling all KWin effects and any display scaling. While randomly closing apps, I found the culprit: the Zed editor. Apparently, an open Zed window can add latency to all my other apps even while idle in the background. Thankfully, this does not affect fullscreen games.

§3 Human · 1%

I identified this issue after measuring everything else, so I’m glad this finding didn’t invalidate my existing in-game measurements. More on this later, in the KWin section. LG display settings Next up, I tested various settings on the TV.

Setting the input mode to PC (which locks out a bunch of picture settings) made no impact, while Black Frame Insertion seemed to add exactly one frame’s worth of delay. This one really hurts because I love using it. It seems like their implementation adds extra buffering, even though it could be done with a lagless rolling scan.

HDR had a tiny but measurable effect. I planned on testing HDMI Auto Low Latency Mode - displays are supposed to apply their low-latency settings when the source requests it. When I was daily-driving a Radeon GPU on Windows, I remember the driver applied ALLM unconditionally in all fullscreen apps. I had to resort to faking EDIDs to stop it from seeing the mode as supported. Linux doesn’t seem to support ALLM, and I didn’t find an option on Nvidia’s Windows driver either. Game tests I hoped to find a game that supported all three major graphics APIs so I could compare between them. There are a handful of those, typically based on Unreal Engine, but they describe all but one of the APIs as experimental. I ended up with three games tested, each with a reproducible measurement setup. Comparing across games is pointless, since they’re all going to have different animation timings. Instead, the focus is on the different tunables available for each API. Doom Eternal (Vulkan) This one was easy to set up - just load a dark level (so any of them) with infinite ammo and observe the heavy cannon’s muzzle flash against a dark wall. The game uses Vulkan, so it doesn’t need a translation layer on Linux. I couldn’t get it to run directly on Wayland, despite this exact issue having been fixed last year. Starting off, the only difference between platforms is the wider tail (at p75) on Linux:

If we don’t cap FPS below refresh rate, it starts buffering frames when V-Sync is enabled. That latency can be recovered by disabling it, as seen in the next chart. This still doesn’t produce frame tearing because the game is running through XWayland.

§4 Human · 2%

VRR by itself isn’t a significant factor:

Neither are Nvidia’s Windows-exclusive settings that I tested:

Borderlands 3 (DX11, DX12) I modded my save game to remove the magazine attachment on a weapon so it doesn’t drain any ammo. This makes it ideal for looping 500 muzzle flash measurements back to back. Windows had consistently lower latency, sometimes significantly so when V-Sync was used:

Going with native Proton Wayland (PROTON_ENABLE_WAYLAND=1, shown as proton_wayland in the charts) can claw back the latency in these cases:

DX12 is consistently slower across both operating systems. There might be some Unreal Engine hack that improves it, like the OneFrameThreadLag CVar, but I did not test any.

I tried various platform- and API-specific switches:

Nvidia’s Ultra Low Latency Mode on Windows VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 and VKD3D_SWAPCHAIN_IMAGES=2 for DX12 DXVK_CONFIG="d3d9.maxFrameLatency=1;dxgi.maxFrameLatency=1" for DX11

The only one that had an impact was VKD3D_SWAPCHAIN_LATENCY_FRAMES=1, but even then, it was still significantly lagging DX11:

Capping FPS and not letting frames queue up at the refresh rate mark makes the biggest difference:

Hades 2 (DX12) This game had a longer wind-up in the animation, but it was consistent. Measurements showed similar behavior as in prior tests. The following settings help, all things being equal, but are not necessarily cumulative:

Capping at/below refresh rate. Using wine_wayland/PROTON_ENABLE_WAYLAND=1. Setting VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 - though with wine_wayland at a fixed refresh rate, this capped frame rate at half of refresh.

Summary and recommendations My takeaway is to prioritize wine_wayland, use late FPS limiting, VKD3D_SWAPCHAIN_LATENCY_FRAMES=1 in DX12 games, and VRR if the game can’t hit a stable target or has bad frame pacing.

§5 Human · 1%

I really hoped to push the V-Sync, non-VRR results lower, but I don’t see how to get there. It definitely seems like having XWayland in the mix breaks some signaling, which lets the swapchain buffer more when FPS is at refresh rate and V-Synced. Gaming over the network All of these experiments were done with Borderlands 3 (DX11); results here can be compared against local tests shown earlier. This was a direct 2.5GbE network between the two hosts, with an RTT of ~0.3 ms. Egress delay was added with # tc qdisc replace dev $DEVICE root netem delay ${DELAY}ms. Typical networks have symmetrical delay, but I wanted to confirm which specific direction impacts latency, so I tested asymmetric scenarios too. First up, USB/IP is used to send input between the computers, but display output is captured directly from the host running the game:

The results look exactly as expected. With no injected network delay, it comes out exactly as fast as local results. USB/IP should be a good solution if you want to put your hardware somewhere in the basement. You could save a bunch of money on active USB cables or fancy docks this way? Then, I tested routing input with Moonlight, but still capturing the direct video output of the host, not the encoded stream:

This confirms that Moonlight matches USB over IP in sending input over the network. There were no meaningful cross-platform differences here, either. Adding egress delay on the Sunshine host made no difference - Moonlight keeps sending fresh input without stalling for acknowledgement. Next, I tested the typical round trip of click —> Moonlight —> Sunshine —> Moonlight —> display. This is where I encountered a recent regression in kernel 7.0 which manifested as the video stream never starting, but there was a simple workaround.

Finally, I compared across platforms by running Moonlight on Windows as well, keeping Sunshine on the Linux host in these scenarios:

Windows delivered a slightly more responsive experience overall. Some of the gap may be explained by the next section on KWin. But I can’t explain the impact of network delay - it looks like Windows was time travelling in this test. The long tail seen in experiments that used Moonlight only for relaying input is surprising as well. KWin deep dive Why is KWin slower than DWM on Windows?