How we made our OCI filesystem 47× faster

M microsandbox.dev ↗

▲ 63 points • 38 comments • by appcypher • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully AI-generated

94 %

AI likelihood · overall

0% human-written 100% AI-generated

SEGMENTS · HUMAN 0 of 5

SEGMENTS · AI 5 of 5

WORD COUNT 1,647

PEAK AI % 100% · §1

Analyzed

May 23

backend: pangram/v3.3

Segments scanned

5 windows

avg 329 words each

Distribution

0 / 100%

human / AI fraction

Verdict

Pangram v3.3

Article text · 1,647 words · 5 segments analyzed

Human AI-generated

§1 AI · 100%

A user in our Discord said microsandbox felt slow. Listing every file in the Python standard library took 5.3 seconds inside a sandbox; in Docker it took milliseconds. We went digging. We fixed it in v0.4: we replaced our user-space filesystem with a Linux disk image that the VM mounts directly. The geometric mean speedup across our mixed guest-visible filesystem suite is 47×, with the worst-case rows more than 1,000× faster, and the host filesystem code is about 5,300 lines shorter.

My first try was monofs: a content-addressed filesystem with block-level dedup, compression, and distributed read replicas. It stored images at 1.3× their original size on disk, and microsandbox is local-first, so the long-tail dedup payoff wasn't worth the up-front cost. For v0.3 I switched to OCI plus a user-space overlay built on a libkrun hook; we got layer dedup and identical behavior on Linux and macOS, but everything still ran outside the kernel.

Every file operation inside the VM had to bounce out to the host through FUSE, which is Linux's mechanism for letting an ordinary program act as a filesystem. To open a file, the VM hands the request to our host process, which walks every layer looking for the file and sends the answer back; the same trip happens for every stat, every readdir, and every cache miss. A single Python import triggers dozens of these round trips before your code even starts running, and a ten-layer image multiplies the cost of each one. We spent the next stretch of v0.3 trying to make that path faster: better caching, fewer syscalls, smaller responses. Each change shaved a few percent. None of them changed the order of magnitude. Docker doesn't have this problem because Docker uses the kernel's own layered-filesystem driver (overlayfs), so file operations never leave the kernel. We were trying to match a kernel filesystem from outside the kernel; no cache could close that gap. So we deleted the filesystem.

The new plan was to stop bouncing every file operation between the VM and the host. We'd build a Linux filesystem image ahead of time, hand it to the VM as a virtual disk, and let the VM's own kernel mount it.

§2 AI · 100%

With FUSE out of the loop, file operations inside the VM would stay inside the VM. Beforeappguest VFSvirtiofs / FUSE boundaryhost filesystem codelayer lookup / overlay logicresponse back into the VMAfterappguest VFSguest overlayfsguest EROFSvirtio-blkcached block-backed imageBefore, every lookup crossed the VM/host boundary. After, normal reads and lookups stay inside the guest kernel. The filesystem we picked is EROFS: read-only, in-tree since the kernel needed it for Android, and easy to author. EROFS also solved the macOS problem: the VM's own kernel is Linux regardless of what's running outside it, so once the disk image is built, the host's filesystem stops mattering.

microsandbox runs on both Linux and macOS, and macOS lacks the host-side tools you'd normally use to build a filesystem image: no mkfs.ext4, no mkfs.erofs, no loopback mounts. If our image pipeline depended on any of them, we'd either have to ship a helper VM (heavy, slow to start) or live with a permanent split between platforms, and neither option fit microsandbox's "single self-contained binary" promise. So we wrote the image writers ourselves in Rust. A filesystem is a byte layout on disk; the writers just produce that layout. Three small pieces do the work:

An EROFS writer that emits the read-only image of an OCI layer. An ext4 writer that emits the sparse, journaled scratch area each sandbox gets. A VMDK descriptor that stitches everything into one virtual disk.

Nothing in the pipeline shells out, asks for root, or mounts a loopback device, and the same Rust code path builds the images on Linux and Apple Silicon without depending on host-only filesystem tools. The EROFS artifacts round-trip through a reader we also wrote, and CI boots the full stack under the real VM kernel. If a byte is wrong, two different readers tell us about it.

The obvious way to use these writers was one EROFS image per OCI layer. The VM would get one virtual disk per layer plus one for the scratch area, and the kernel's overlayfs would merge them at boot. It worked: the first measurements landed between 10× and 175× faster than v0.3 depending on the workload, and we were ready to ship.

§3 AI · 100%

First cutlayer 1/dev/vdalayer 2/dev/vdblayer 3/dev/vdc⋮⋮layer 30/dev/vd?One EROFS image per OCI layer. Python images attached ~10 disks; some custom builds pushed past the microVM's virtio device cap. Then we counted the layers. A Python image runs around ten; CUDA images more; some user-built ones push thirty or forty. microVMs cap how many devices they can carry, and we were attaching one disk per layer. We raised the cap, but the real fix was to stop using virtual disks to tell the VM "this image has layers" when the filesystem could carry that information itself.

The EROFS folks pointed us at a feature we hadn't been using: EROFS can build a metadata-only image, just the merged directory tree plus a pointer per file saying which underlying blob holds its bytes and at what offset. The kernel reads that image, treats the whole bundle as one virtual disk, and answers every lookup with a single calculation instead of a search across layers. The pipeline becomes:

Pull the OCI layers as usual. Build one small metadata image describing the merged tree. Hand the VM one virtual disk that stitches the metadata and the layer blobs together.

The VM now only has to attach two rootfs block devices, no matter how many layers the original image had: one read-only VMDK-backed stack for the image (which internally references the merged-metadata image plus the per-layer EROFS extents), and one writable ext4 upper for the sandbox. Overlayfs only ever combines those two. This is the version we shipped, with a small libkrunfw kernel config tweak (CONFIG_EROFS_FS_XATTR + CONFIG_EROFS_FS_SECURITY) so EROFS exposes the xattrs overlayfs needs for whiteouts.

At pull time, the host materializes each OCI layer into an EROFS artifact keyed by its diff ID, merges the layer metadata with provenance, writes fsmeta.erofs, and emits a VMDK descriptor over fsmeta.erofs plus the layer extents. At sandbox create time, microsandbox creates a sparse upper.ext4 for that sandbox.

§4 AI · 100%

At boot, the guest sees /dev/vda for the read-only lower stack and /dev/vdb for the writable upper, and Linux overlayfs assembles /. Image stackOCI layersper-layer EROFS artifactsmerged metadata + provenancefsmeta.erofsVMDK descriptor over fsmeta + layer extents/dev/vda (read-only lower)Sandbox uppersparse upper.ext4/dev/vdb (writable upper/work)guest overlayfs/Two block devices per sandbox boot, regardless of how many layers the original image declared. The host pays the layer-walking cost once at pull time; the guest pays nothing for it at runtime.

We ran the same benchmark suite three times against both versions on a python image, with fresh state between runs. Across fourteen mixed guest-visible filesystem workloads, the geometric mean speedup is 47.18×, and the eight biggest movers are below. The bars fall into two groups:

Rootfs path: the cleanest measure of the new OCI path; these operations now stay inside the guest kernel instead of bouncing through the host. /tmp tmpfs: real guest-visible wins, but from cutting out the FUSE round-trip on guest tmpfs workloads rather than from the new EROFS lower-rootfs path.

1×10×100×1,000×file_delete_1k1109.94×rename_1k876.58×small_file_create_1k240.78×metadata_scan_stdlib240.28×read_all_py_stdlib116.40×deep_tree_traverse47.16×concurrent_read_4t20.93×random_read_stdlib4.01×Rootfs path/tmp tmpfsLog scale. v0.3.14 baseline = 1×. Higher is better. metadata_scan_stdlib scans the metadata of every file in the Python standard library. It used to take half a second. It now takes about 2 milliseconds.

Linux's overlayfs is a large spec, covering whiteouts, opaque directories, hardlinks across copy-up, directory renames, and a handful of xattr conventions that all have to behave exactly right.

§5 AI · 100%

Our v0.3 reimplemented most of it in user space, and we were still chasing edge cases the day we deleted it. v0.4 doesn't reimplement any of it, because the VM's own kernel does the merging, and the bugs we used to have aren't fixed; they're gone. The host still has to understand OCI layer semantics, but only once, at pull time. Whiteouts, opaque directories, hardlinks, xattrs, and case-sensitive paths get normalized into the merged metadata tree before fsmeta.erofs is written. After that, the runtime path is ordinary kernel EROFS plus overlayfs. macOS's APFS is case-insensitive by default. Plenty of Linux images contain files whose names differ only by case, and extracting them onto a Mac used to collapse the second into the first. v0.4 never extracts to the host filesystem; the EROFS writer streams the tar straight into a binary image where both names live as distinct entries on disk.

Because the rootfs is now a real disk image, the surrounding product surface gets cheaper.

OCI patches. Rootfs patches users want on top of the image get baked into upper.ext4 before boot, instead of bolted on through a runtime overlay protocol. Shared lower layers. The per-layer EROFS artifacts are content-addressed by diff ID, so two sandboxes that share a base image share those bytes on disk and in cache. Snapshots. A sandbox's writable state is a single ext4 file; preserving or copying it is a file copy. Disk-image roots. Custom non-OCI disk-image rootfs reuses the same block-device boot machinery, minus the fsmerge step in front of it.

OCI rootfs only. Bind volumes (host directories you share into the VM) still go through the old path. Their contents can change at any time while the VM is reading from them, which a read-only disk image cannot represent. First pulls aren't faster. We do more work at pull time now to build the images, though it is parallel across layers and bounded by tar decompression, so it lands close to where it was. Subsequent sandbox creates are faster, because we only emit a sparse scratch image. Writes to the image are still copy-on-write through overlayfs. Modifying a file from the image copies it up into the writable upper, exactly as in any overlay setup.