How Cloudflare responded to the “Copy Fail” Linux vulnerability

B blog.cloudflare.com ↗

▲ 103 points • 82 comments • by mobeigi • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

20 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 3 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,565

PEAK AI % 48% · §4

Analyzed

May 7

backend: pangram/v3.3

Segments scanned

5 windows

avg 313 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,565 words · 5 segments analyzed

Human AI-generated

§1 Human · 28%

How Cloudflare responded to the “Copy Fail” Linux vulnerability2026-05-078 min readOn April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes. There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off. BackgroundOur Linux kernel release processCloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.About the Copy Fail vulnerabilityIt helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.AF_ALG and the kernel crypto APIThe Linux kernel's internal crypto API manages functions like kTLS and IPsec.

§2 Mixed · 31%

Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.An unprivileged program follows these steps:Opens an AF_ALG socket and binds to an AEAD template.Sets a key and accepts a request socket.Submits input via sendmsg() or splice().Executes the operation using recvmsg().The splice() system call is critical here, as it moves data by passing page cache references.Memory mechanics: page cache and in-place cryptoThe page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.The vulnerability: out-of-bounds writeWhen the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1); By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:File: Any readable file.Offset: Tunable via assoclen and splice parameters.Value: Controlled via AAD bytes 4-7 in sendmsg()The exploit, step by step The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.

§3 Human · 30%

Write Construction: For each 4-byte shellcode chunk:sendmsg() with AAD bytes 4–7 containing the shellcode.splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.How we responded When the vulnerability was disclosed, many workstreams started in parallel:Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.There was no customer impact at any point during this response.Validating detection coverageOne of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

§4 Mixed · 48%

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.Hunting for exploitationWhile our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean. Incident timeline and impact

Time (UTC) Event

2026-04-29 16:00 Copy Fail publicly disclosed. 2026-04-29 ~21:00 Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process 2026-04-29 22:52 Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes. 2026-04-29 23:01 Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique. 2026-04-29 (evening) First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected. 2026-04-29 (overnight) Engineering drafted bpf-lsm mitigation program.

§5 Human · 5%

2026-04-30 03:14 Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems. 2026-04-30 (morning) Engineering tested the bpf-lsm mitigation program and made it production-ready. 2026-04-30 14:25 Engineering incident declared to coordinate mitigation program and Linux patch rollout. 2026-04-30 ~17:00 Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime. 2026-04-30 (afternoon) Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users. 2026-04-30 (evening) bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works. 2026-05-04 (morning) Reboot automation resumed at normal pace with the patched kernel. 2026-05-04 onward Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.How did we mitigate it?Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.Removing the moduleThe bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.