Skip to content
HN On Hacker News ↗

What is Firecracker?

▲ 18 points 0 comments by Kylejeong21 2w ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, and human-written content

79 %

AI likelihood · overall

Mixed
20% human-written 80% AI-generated
SEGMENTS · HUMAN 0 of 6
SEGMENTS · AI 6 of 6
WORD COUNT 1,706
PEAK AI % 100% · §1
Analyzed
May 14
backend: pangram/v3.3
Segments scanned
6 windows
avg 284 words each
Distribution
20 / 80%
human / AI fraction
Verdict
Mixed
Pangram v3.3

Article text · 1,706 words · 6 segments analyzed

Human AI-generated
§1 AI · 100%

Every day, AWS Lambda runs trillions of function invocations. AWS Fargate schedules millions of containers. Every one of those is a full virtual machine, with its own kernel, booted in a fraction of a second.How? About 50,000 lines of Rust called Firecracker, which exists because the industry finally admitted that a Linux container that controls resource usage was never designed to be a security boundary.[1]The isolation problemEvery Docker container on your laptop is three Linux kernel features in a trench coat:Namespaces are blindfolds. A process inside one gets a private view of the system: its own PID list, network stack, mount table, hostname, and user IDs. PID 1 inside the container is some random PID on the host; the container can't even see the other processes.cgroups are budgets. Control groups are the kernel's accounting and rate-limiting layer. They cap how much CPU, memory, disk IO, and network bandwidth a process tree is allowed to consume.seccomp + capabilities are allowlists. capabilities chop root's powers into ~40 separate privileges (bind low ports, load kernel modules, mount filesystems, etc.) so you can grant only the ones you need. seccomp is a per-process filter that decides which syscalls (userspace's only API into the kernel) the process is even allowed to make.You can prove it yourself without Docker installed:# spin up your own "container" in one line unshare --user --map-root-user --mount --pid --net --uts --ipc --fork --mount-proc bashEverything else Docker does (image layers, registries, DNS) is orchestration on top.All of that protection funnels through a single Linux kernel, around 30 million lines of code exposing 400+ syscalls. Every container on the host calls into that same kernel. One bug in any one of those syscalls and it's game over for every tenant on that machine.Full virtual machines solve isolation by brute force: every VM gets its own kernel.Modern CPUs have a "guest mode" that runs guest instructions on the real silicon. The host only gets pulled in when the guest does something privileged (touches real hardware, faults, gets interrupted). A hypervisor is the thin layer that arbitrates those moments.

§2 AI · 100%

Linux ships its hypervisor as a kernel module called KVM, exposed at /dev/kvm. It rides on hardware virt extensions (vmx on Intel, svm on AMD):# do you have hardware virt? grep -E 'vmx|svm' /proc/cpuinfo | head -1 ls -l /dev/kvmThe problem with full VMs is they're slow and fat. A classic QEMU VM emulates a whole imaginary PC (BIOS, PCI bus, IDE controller, VGA card, PS/2 keyboard) because that's what a 1998 OS expected to boot against. The image is hundreds of megabytes. Boot takes seconds. Memory footprint is hundreds of MiB before your workload even starts. For a web request that lives 40ms, you'd spend 40× that booting the machine.So you're caught between:Containers: 50ms boot, 5 MiB overhead, shared-kernel attack surface.VMs: 5+ second boot, 300+ MiB overhead, hardware-isolated.Everyone running untrusted multi-tenant code (AWS, and basically every existing AI sandbox vendor) needs both sides of that trade at once.Enter microVMsA VMM (Virtual Machine Monitor) is the user-space process that drives the hypervisor: it sets up guest memory, plugs in virtual devices, and tells KVM to start running guest code.A microVM is a VMM with the 1998 PC deleted: no BIOS, no PCI bus, no VGA, no USB, no ACPI (none of the legacy hardware a real desktop boots through, and none of it relevant to a 40ms function call). What's left: KVM, a serial console, and a handful of virtio devices (net, block, vsock).virtio is the standard "I know I'm running in a VM" device interface. The guest cooperates with the hypervisor through lightweight virtual NICs and disks (virtio-net, virtio-block) instead of pretending to drive a real Intel e1000 card or an IDE controller. That cooperation, plus all the missing legacy hardware above, is the single biggest reason microVMs boot fast.

§3 AI · 100%

The result:~125ms boot from VMM launch to guest userspace running init.<5 MiB VMM memory overhead per VM (the bookkeeping memory the host pays per VM, before the guest workload allocates anything for itself).150 VMs/second creation rate on a single host.~2–8% runtime performance hit vs bare metal.Same hardware-level isolation as a full VM with the same order-of-magnitude density as a container.Bare container / gVisor / Firecracker / full VM: hover each to see what's shared vs isolatedFirecracker is the VMM, the process that actually talks to /dev/kvm and boots the microVM. The rest of this post is that stack end to end.In November 2018, AWS open-sourced Firecracker at re:Invent. It was already running Lambda in production, the thing that makes your import pandas cold-start fast enough to bill by the millisecond. In 2020, the team published the architecture at NSDI '20[2].The architectureForked from Google's crosvm, rewritten in Rust, with more than half the code removed. Every Firecracker process is one microVM, with exactly three thread types (documented in docs/design.md):API thread is the order desk. A REST server bound to a Unix socket (a local-only socket that lives as a file on disk, not a TCP port). Accepts configuration before boot and limited actions after.VMM thread is the hardware shop floor. It pretends to be every device the guest can see. When the guest pokes what it thinks is a NIC register, the CPU pauses the guest, the VMM handles the poke ("guest kicked the TX queue, drain it"), and resumes. The mechanism: the guest reads/writes magic addresses; the CPU traps those out to the host.[^mmio]vCPU threads are the runners. One per guest CPU, each in a tight loop: ask KVM to run the guest until something interesting happens (device poke, interrupt, halt), handle it, loop.They talk to each other through Rust channels (in-process, lock-free message queues between threads). The guest sees exactly four devices.The four devicesvirtio-net is the VM's NIC, no 1998 emulation.

§4 AI · 100%

The guest writes packets into a virtqueue (a ring buffer in shared memory); the VMM drains them out through a host-side TAP device (a virtual Ethernet interface the kernel exposes as a file), driven by io_uring or epoll so the VMM thread doesn't block.virtio-block is the VM's disk, just file IO on the host. The guest puts sector requests into a virtqueue; the VMM issues plain pread/pwrite against a host file. No IDE, no AHCI, no SCSI.virtio-vsock is the VM's intercom to the host. Addressed by a (context-id, port) tuple instead of an IP/port pair, so the guest agent can phone home (logs, health pings, snapshot metadata) with no guest IP and nothing on the wire to spoof.8250 serial UART is the boot console. A tiny legacy serial chip emulated at a fixed address. Used for early-boot logs and crash dumps before virtio comes up. Cheap, universal, never going away.Booting a microVM, end to endThe API is the entire control plane: the configuration channel, kept deliberately separate from the data plane (the vCPU threads that actually run guest code). You start the binary pointed at a Unix socket:rm -f /tmp/fc.sock ./firecracker --api-sock /tmp/fc.sock &Then you PUT configuration into it:# 1. Configure boot source curl --unix-socket /tmp/fc.sock -X PUT 'http://localhost/boot-source' \ -H 'Content-Type: application/json' \ -d '{ "kernel_image_path": "./vmlinux-6.1", "boot_args": "console=ttyS0 reboot=k panic=1 pci=off" }'

# 2. Configure rootfs curl --unix-socket /tmp/fc.sock -X PUT 'http://localhost/drives/rootfs' \ -H 'Content-Type: application/json' \ -d '{ "drive_id": "rootfs", "path_on_host": "./rootfs.ext4", "is_root_device": true, "is_read_only": false }'

# 3.

§5 AI · 100%

Configure network curl --unix-socket /tmp/fc.sock -X PUT 'http://localhost/network-interfaces/eth0' \ -H 'Content-Type: application/json' \ -d '{ "iface_id": "eth0", "guest_mac": "06:00:AC:10:00:02", "host_dev_name": "tap0" }'

# wait for async config writes to apply sleep 0.015

# 4. Trigger actions (start VM) curl --unix-socket /tmp/fc.sock -X PUT 'http://localhost/actions' \ -H 'Content-Type: application/json' \ -d '{ "action_type": "InstanceStart" }'Four HTTP calls. That's the entire control plane.Watch the four PUT calls flow into the VMM, vCPUs spin up, and the guest kernel hit init in ~125msThe security onionA single KVM boundary is already strong. Firecracker wraps two more layers around it.The jailer is a sandbox-builder. Its only job is to box up the VMM before it ever runs. It creates a chroot (a Linux feature that locks a process to a single directory subtree as if that directory were the root of the filesystem; the process literally cannot name anything above it), drops into a new PID namespace so it can't see the host's other processes, switches to an unprivileged uid/gid, applies cgroup CPU/memory limits, and only then execs the Firecracker binary inside that jail:jailer \ --id vm-42 \ --uid 1000 --gid 1000 \ --chroot-base-dir /srv/jailer \ --exec-file /usr/local/bin/firecracker \ -- \ --api-sock /run/fc.sockNow the VMM process itself has no filesystem except a dedicated chroot, no view of other processes on the host, and no root capabilities. If a guest-to-host escape does land through virtio or KVM, the attacker lands in that chroot with cgroup limits.Seccomp is a per-thread syscall allowlist.

§6 AI · 100%

Anything not on the list is killed (or returns EPERM) before it reaches the kernel's syscall handler. Firecracker ships three levels:Level 0: off. Don't use in prod.Level 1: allow-list by syscall number.Level 2: also constrain argument values (e.g. ioctl is fine, but only with KVM_RUN as the command). Default and recommended.Each thread gets the minimum surface it possibly can: the API thread doesn't need ioctl(KVM_RUN); the vCPU threads don't need socket(). A simplified view of what one rule looks like:{ "vcpu": { "default_action": "trap", "filter": [ { "syscall": "ioctl", "args": [{ "index": 1, "value": "KVM_RUN" }] }, { "syscall": "read" }, { "syscall": "write" }, { "syscall": "epoll_wait" } ] } }Each layer has to fail independently for an attacker to reach the host.Snapshots: the cheat code behind Lambda SnapStartTake a Snapshot of a running microVM. Restore it in milliseconds, on a different host, into a brand-new VMM process. Skip kernel boot, skip init, skip JIT warmup.You freeze the running VM and dump memory + device state to disk:curl --unix-socket /tmp/fc.sock -X PATCH 'http://localhost/vm' \ -d '{"state": "Paused"}'

curl --unix-socket /tmp/fc.sock -X PUT 'http://localhost/snapshot/create' \ -d '{ "snapshot_type": "Full", "snapshot_path": "/snap/vm.state", "mem_file_path": "/snap/vm.mem" }'A snapshot captures the post-warmup state, so the restored VM wakes up in the middle of its life, not at the beginning of it.This is exactly what AWS Lambda SnapStart does: initialize a Java Lambda once, snapshot the microVM, and restore that snapshot on every subsequent cold start (announcement). JVM cold starts suddenly go from 8+ seconds to sub-second.