Zero-Copy Pages in Rust: Or How I Learned To Stop Worrying And Love Lifetimes

R redixhumayun.github.io ↗

▲ 80 points • 7 comments • by ingve • 3mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

4 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,458

PEAK AI % 6% · §5

Analyzed

Apr 20

backend: pangram/v3.3

Segments scanned

6 windows

avg 243 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,458 words · 6 segments analyzed

Human AI-generated

§1 Human · 0%

You can find the source code for the project hereZero-copy is a way to elide CPU copies between the kernel and user space buffers that is particularly useful in high throughput applications like database engines. It makes a huge difference in performance under high load, particularly when your working set is no longer cache resident.What Is Zero-CopyHere is what a typical database engine looks like. For this post, focus on two copy boundaries: the OS boundary, and the path from the buffer pool into the layers above it. ┌─────────────────────────────────────────────────────────┐ │ Query Layer │ └────────────────────────┬────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────┐ │ Execution Engine │ └────────────────────────┬────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────┐ │ Transaction Manager │ └──────────┬─────────────┴─────────────────┬──────────────┘ │ │ ┌──────────▼──────────┐ ┌────────────▼────────────┐ │ Lock Manager │ │ Log Manager │ └─────────────────────┘ └─────────────────────────┘ │ fresh copies into higher layers ┌────────────────────────▼────────────────────────────────┐ │ Buffer Pool │ └────────────────────────┬────────────────────────────────┘ │ copy at OS boundary ┌────────────────────────▼────────────────────────────────┐ │ Disk │ └─────────────────────────────────────────────────────────┘ Trying to build a high performance engine requires eliding any non-useful work as far as possible and copying data falls squarely in this category.

§2 Human · 0%

Think of each copy operation as an equivalent of memcpy()memcpy can actually cause pipeline stalls which is something you want to avoid in high perf applications. which requires the CPU to copy data from a source and put it into a destination. You’re spending cycles on non-essential work and this can cause eviction of hot data from CPU caches.Here’s a great example of the lifecycle that a typical read or write operation goes through. All those CPU copies are useless work that are burning cycles. Image taken from https://www.linuxjournal.com/article/6345 Now, let’s focus on eliminating copies at the layer between the buffer pool and disk first.The Buffer Pool And Direct IO⊕Here’s a famous tirade from Linus Torvalds on the direct IO interface. He doesn’t like database developers.The buffer pool opens and stores file descriptors with the open() syscall. When we call read() and write() on those file descriptors it goes through the whole cycle you saw earlier with copies between userspace, kernel and DMA.An easy win here is to use direct IO with the O_DIRECT flagA large number of modern databases use this approach, although there are holdouts like Postgres.. This will force the application to bypass the OS page cacheThis assumes 64-bit DMA capable hardware. On systems with 32-bit DMA devices or confidential computing VMs (AMD SEV, Intel TDX), the kernel may silently introduce a SWIOTLB bounce buffer, reintroducing a CPU copy. See the Linux kernel docs on swiotlb..O_DIRECT requires that the buffers submitted are pointer aligned, along with I/O length and file offset. In Rust, we guarantee the former with #[repr(align(4096))] on the buffer holding our page, and 4 KiB page-sized reads and writes at page-aligned offsets satisfy the rest.

§3 Human · 1%

Without this, O_DIRECT reads or writes would often fail with EINVALHere’s a gist showing this in C — the first program uses malloc (not 4096-aligned) and the write fails, the second uses posix_memalign and succeeds..Since we’re bypassing the kernel page cache we don’t get useful boosts like readahead or write coalescing but this is exactly why a buffer pool is so important in a database.The buffer pool is a replacement for the OS page cache designed with specific workloads in mind. It’s always helpful to think of this in terms of mechanism + policy.Mechanism: A fixed size page table which serves page requests for the layers above and evicts some pages to make room for others.Policy: A way to decide which page to evict (eviction policy) ┌─────────────────────────────────────────────────────────┐ │ Query Layer │ │ │ │ TableScan / IndexScan / BTreeScan │ │ (iterates rows, calls next(), get_value()) │ └────────────────────────┬────────────────────────────────┘ │ uses ┌────────────────────────▼────────────────────────────────┐ │ Transaction │ │ │ │ tx_id | ConcurrencyManager | RecoveryManager │ └────────────────────────┬────────────────────────────────┘ │ pin() / unpin() ┌────────────────────────▼────────────────────────────────┐ │ Buffer Pool │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Frame 0 │ │ Frame 1 │ │ Frame

§4 Human · 4%

2 │ ... │ │ │ │ │ │ │ │ │ │ │ file:3 │ │ file:7 │ │ empty │ │ │ │ pins: 1 │ │ pins: 0 │ │ pins: 0 │ │ │ │ │ │ │ │ │ │ │ │ 4KB data │ │ 4KB data │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ policy: LRU | CLOCK | SIEVE │ └────────────────────────┬────────────────────────────────┘ │ │ ┌────────────────────────▼────────────────────────────────┐ │ Disk │ │ │ │ [ file:1 ] [ file:3 ] [ file:7 ] [ file:9 ] │ └─────────────────────────────────────────────────────────┘ Choosing the right policy for your system depends on the characteristics of your workload but most systems typically go with CLOCK which is an LRU approximation. Here’s a non-exhaustive list of replacement policies used in buffer pools.O_DIRECT removes the copy at the OS boundary. The next problem is avoiding fresh copies once the page is already in the buffer pool.Eliminating Copies From The Read PathSo far, zero-copy has meant removing copies between the kernel and the buffer pool. From here on, I’m going to broaden it slightly to mean removing redundant copies inside the engine too.Rust has a great and terrible way to avoid dealing with copies of data - references. It’s great because it’s a single character (&), it’s terrible because now we have to learn to deal with lifetimes.The simplest way to think about lifetimes is that you are proving to the compiler that any reference held by type A will not outlive the data it points to.

§5 Human · 6%

Let’s start with defining the raw bytes for a single page like thispub struct PageBytes { bytes: [u8; PAGE_SIZE_BYTES as usize], } Now, we’ll define the data that is held within a single buffer pool frame. The RwLock<T> type here is our page latch.#[derive(Debug)] pub struct BufferFrame { page: RwLock<PageBytes>, } Now, we’re going to store this frame’s data inside a PageReadGuard.// To keep the example small, the next type is schematic rather than literal. // I'm using it to show the ownership tradeoff, not the exact implementation // details of `RwLockReadGuard`. /// Read guard providing shared access to a pinned page. pub struct PageReadGuard { page: PageBytes, } This version is simple to model, but it bakes copying into the design. If every higher-level page object owns its own PageBytes, then constructing those objects from buffer-pool storage means materializing fresh owned values.What we actually want is not ownership, but a borrowed view into bytes that already live somewhere else. We can model that by introducing a lifetime.pub struct PageReadGuard<'a> { page: &'a PageBytes, } With this lifetime annotation, we are proving to the compiler that PageReadGuard will not outlive PageBytes, which means higher-level page objects can become views into existing bytes rather than owned copies.In the real implementation, the field is RwLockReadGuard<'a, PageBytes> rather than &'a PageBytes, but the ownership story is the same: the guard borrows the page bytes instead of owning them, and our wrapper carries that borrow forward.pub struct PageReadGuard<'a> { page: RwLockReadGuard<'a, PageBytes> } Typical database engines have two core page types - heap and btree pages. So let’s focus on the former. The structure of the page is going to be a standard slotted page layout.At this point, the bytes are already borrowed through PageReadGuard<'a>. Now, the question is where the guard should live and where the parsed references into the page should live.

§6 Human · 1%

The most natural thing to try is to keep everything in one struct.struct HeapPage<'a> { guard: PageReadGuard<'a>, header: &'a [u8], line_pointers: &'a [u8], record_space: &'a [u8], } This leads us to the classic self-referential struct issue in Rust, which makes pointer invalidation very hard. Imagine for a moment, you have a struct and it has two fields - A and B, with B pointing to A. Now, your struct A moves. What does B point to? It’s going to continue pointing to where A was but that’s invalid and would lead to UB. There are ways around this in Rust with Pin, unsafe raw pointers, Arc pointers and external crates like ouroboros. But, all of these have overhead associated with them.So, the next thing to try is to separate ownership of the guard from the parsed view into the page.pub struct HeapPage<'a> { guard: PageReadGuard<'a>, }

pub struct HeapPageView<'a> { header: &'a [u8], line_pointers: &'a [u8], record_space: &'a [u8], layout: &'a Layout, }

let page = HeapPage::new(guard); let view = HeapPageView::new(&'a page, layout); // borrows from page and page stays alive on stack This works but runs into an issue when we want to mutate the bytes. Imagine that we want to insert a record into the heap page. This requires mutating record_space and also mutating line_pointers. Now, the bounds of both might have changed which means we have stale references in our struct. We need to drop the struct and re-create it again. While it is cheap, it’s a leaky abstraction.