Cutting inference cold starts by 40x with LP, FUSE, C/R, and cuda-checkpoint
Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,859 words · 5 segments analyzed
We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media, author software, and fold proteins at massive scale. Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load. But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability. At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did. There are four key ingredients: Cloud buffers: maintain a small buffer of healthy, idle GPUs to take on new load Custom filesystem: serve container images lazily out of a content-addressed, multi-tier cloud-native cache Checkpoint/restore: fast-forward through CPU-side initialization by directly restoring processes into memory CUDA checkpoint/restore: fast-forward through GPU-side initialization by directly restoring CUDA contexts into memory Together, they take AI inference server replica scaling from multiple kiloseconds to just tens of seconds. We’ve shared bits and pieces of this work along the way, because we believe that secrecy is a bad moat. And if more people learn how to use GPUs efficiently, there will be more available in the market for us! But this blog post represents the first time we’ve put the entire story together in one place. We hope it convinces you that our system is worth buying into — or joining us to build it. Why care about serverless GPUs? To maximize GPU Allocation Utilization for inference workloads. First, let’s frame the problem clearly. GPUs are expensive and scarce, so we want to maximize their utilization, where “utilization” is the following unitless quantity: Utilization := Output achieved ÷ Capacity paid for There are many ways to measure utilization — to define output and capacity.
The most sophisticated and most stringent here is probably “Model FLOP/s Utilization”, which divides raw algorithmic operation requirements by aggregate arithmetic bandwidth. This is catnip for engineers. It’s also especially critical for “hero run” large-scale training, so it draws a lot of investment and attention, e.g. recently as everyone dunked on xAI’s ~10% MFU. But at the other end of the stack, there’s a more basic form of utilization that wrecks the relationship between achieved output and allocated capacity for inference workloads, GPU Allocation Utilization: GPU Allocation Utilization := GPU-seconds running application code ÷ GPU-seconds paid for Aside on "GPU Utilization" terminology The "GPU utilization" reported by nvidia-smi and similar tools is in between these two extremes. It reports the fraction of the time that kernel code is running on the GPU — literally, the fraction of time there is a CUDA stream running on the GPU. Read more here. Inference applications have highly variable scale. Unlike training, the demand for capacity is outside the direct control and management of the engineering organization. Instead, it is driven by external user behavior — by markets or social media algorithms or product teams. Here’s a sample trace of requests per minute from a time-varying Poisson process we use to model inference applications. Notice not only the seasonal variation (daily cycles) but also the long-term trend of increasing variability in demand as the average demand increases. Spiky demand raises serious engineering problems. To borrow from Marc Brooker of AWS: “the cost of a system scales with its (short-term) peak traffic, but for most applications the value the system generates scales with the (long-term) average traffic.” Spiky demand means high peak-to-average ratios, which challenge system economics. Concretely, imagine the capacity planning for such an application. You might have demand (measured in GPUs required to service requests within latency targets) that looks like this: With a fixed, over-provisioned GPU allocation, utilization is low To properly service your anticipated load, you allocate (rack-and-stack, rent on a hyperscaler) 140 GPUs. But most of those GPUs sit idle most of the time — the GPU Allocation Utilization is low. You might accuse us of talking our book here. But we aren’t the only ones to call this out!
See the excellent blog post by Hebbia. And we have data, not just vibes: according to the State of AI Infrastructure at Scale report in 2024, the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand. Actual GPU Allocation Utilizations are commonly often closer to 10-20%. With fixed allocations, demand can also exceed supply during unanticipated spikes. Trying to anticipate them just increases cost further — more than it increases revenue. What’s so hard about serverless GPUs? Startup latency. The immediate solution is to provision auto-scaling capacity: when demand increases, increase your supply. Done naïvely, this actually worsens the problem: If allocation is slow, utilization and QoS suffer Without optimization, going from hyperscaler API request to a running service replica can take tens of minutes. You need to do the following: spin up a new instance and health-check it (minutes to tens of minutes) load application program and filesystem state (minutes) start the application program on the host, ready it to service requests (tens of seconds) start the application program on the device, ready it to service requests (minutes to tens of minutes) During all of this time, load is in excess of capacity, and QoS typically degrades (absorbed into higher concurrency or queues and thus inflated tail latencies or, worse, 503s). That means angry users. If the capacity takes too long to come online, it can even miss a transient spike. But given the unpredictability of demand and the difficulty of allocation, that capacity typically sticks around, under-utilized, for an extended period. At Modal, we’ve optimized the spin-up of inference applications on GPUs from many tens of minutes down to a few seconds or tens of seconds. With these optimizations, a wide variety of inference applications of GPUs can run “truly serverlessly”: with provisioned supply tightly matched to system demand. With fast, automatic allocation, utilization and QoS can both be high In the rest of this document, we will explain the engineering approach we took and the performance optimizations we implemented for each of the four steps above, which span the stack from cloud storage systems and machine management to local disks, CPUs, and, of course, GPUs.
Together, these optimizations allow inference on Modal to spin up 40x faster: 50 seconds instead of 2k. Inference servers that take upwards of 2 kiloseconds to boot naïvely boot in ~50 seconds on Modal. The key architectural optimizations that achieve this speedup are indicated and color-coded by the key system component that they target -- GPUs and GPU RAM, CPUs and CPU RAM, local solid state disk (SSD), or machine/instance management. We use this color scheme and schematic throughout the post. You can remove tens of minutes of latency by taking instance allocation and health checks out of the hot path. Consider the first step in replica spin-up: spin up a new instance and health-check it (minutes to tens of minutes) We can remove this from the hot path by doing it ahead of time: running a buffer of idle, healthy GPUs, shared by many applications, scheduling new replicas onto those units, and spinning up new devices into the buffer asynchronously. We can also de-allocate units when the buffer grows too large, as replicas spin down. Maintaining a small buffer of ready-to-use, but unused, machines on top of allocated machines allows for new replicas to be quickly scheduled onto empty machines (indicated by bright colors). Servicing requests from the buffer removes tens of minutes of latency from replica spin-up. Aside on system- and application-level buffers If you're running a single workload, rather than a multi-workload system, you might ask why we are only moving instance allocation out of the "hot path" and into the buffer. Can't we move more of our setup work into the buffer? And you can! Modal users can maintain an application layer buffer of replicas ready to service requests with buffer_containers. But even then, the size of the buffer you need to absorb spikes of a given magnitude scales with the speed you can create new replicas, and so the optimizations described below are still important for single-workload systems. Managing both active instances and this buffer is a fun linear programming problem, as we’ve written elsewhere. Very roughly, it looks like: We use Google’s GLOP solver, feeding it scraped prices from cloud providers and tasks from users. Because cloud providers don’t always have capacity at prices and in regions that they advertise, we need to also feed back in the observed supply.
Running a buffer limits the peak allocation utilization below 100%. This is a reasonable trade-off to make, since 100% utilization is generally a mirage. Consider that it is common practice to spin up new replicas and even page engineers when utilization of other resources, like CPU or IOPS, gets too high! This is important for robustness. A 100% utilized system has no margin for error, and so faults routinely become failures. We can personally recommend adding more buffers to your life — keep an extra toothbrush in your bathroom; keep a charger for your critical devices at home, the office, and on your person. This buffer is especially useful for accommodating a wider variety of workloads on a single system. At Modal, we’ve leaned into supporting a variety of “development” workloads, not just production serving, because we can quickly create a new development environment. As an extra win, these environments are reproducible-by-default and on prod-ready infra. Closing the gap with production infrastructure also improves development velocity. The devil is, of course, in the details. One key piece: health checks are critical for GPUs, which fail at a much higher rate than other hardware, including notoriously finicky components like spinning disks. We wrote about our GPU health-checking system in detail here. The tl;dr is that, in our experience, you need to run a short active health check on boot and monitor for health issues that arise later, but you can defer more intense checks (like dcgmi diag) to a slower cadence (for us, weekly). Critical level Xid errors per hour per GPU, grouped by (anonymized) cloud. Failure rates are far from negligible! You can cut container start from minutes to seconds by serving files lazily out of a content-addressed cache. Now let’s consider the next step: load application program and filesystem state (minutes) In contemporary practice, this generally means booting up one or more containers or VMs. Roughly, a container is a root filesystem backing a process with limited permissions. For distributed deployments of many containers, performance is bottlenecked by the construction of the root filesystem on the worker instance. Root filesystems of operating system distros are thicc — tens of thousands of files, gigabytes in size.