Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model
Pangram verdict · v3.3
We believe that this document is primarily human-written, with some AI-generated content detected
AI likelihood · overall
MixedArticle text · 1,619 words · 7 segments analyzed
The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller. You can make it smaller lossily. TurboQuant is a (somewhat controversialPerformance exploration, accusations of academic misconduct.) recent example, dropping the bit-width of K and V and absorbing the resulting quality loss. The cost of that route is that the loss isn’t something you specify in advance: you find out what degraded by running evals and hoping they catch whatever you killed. Lossless compression sidesteps the question entirely, by reconstructing the cache exactly. We explored lossless compression of LLM weights in a recent post. The numbers for KV cache look much the same: empirically, the bytewise entropy of a bf16 cache is about 11 bits per scalar, around 30% smaller than the raw representation. Worse, as we showed thenThat post was about weight compression, but KV cache is reasonably similar. See here for a nice paper., this slack collapses as the bitwidth comes down. FP4 weights are within 5-7% of saturating their format, and the same goes for caches stored at lower precision. Given that low bitwidths have such obvious benefits for performance, we ought to treat them as the baseline. In this post we introduce Speculative KV coding, a method for losslessly compressing the KV cache of a large target model by up to ~4×4\timesThis 4×4\times is on top of lossy fp8 compression of the cache - the gross benefit is ~8×8\times. using a cheaper predictor model — a faster model whose forward pass on the same prompt predicts what the target’s cache will be. By analogy with speculative decoding (Leviathan et al., 2022), the predictor runs in parallel on both encode and decode sides; an arithmetic coder then encodes the true cache at a bitrate set by how well the predictor fits the target.
A KV cache isn’t really random§ Quick refresher on entropy codingI’ve written about the mechanics of arithmetic coding before — see the rANS and tANS posts. For this post all you need is the bitrate formula above; the coder is a black box that achieves it.. You have a stream of symbols drawn from some distribution pp. Shannon’s source coding theorem says the best any lossless coder can do, on average, is H(p)H(p) bits per symbol. Real coders work with a model distribution qq instead of the true pp, and end up paying H(p,q)=H(p)+KL(p ∥ q)H(p, q) = H(p) + \mathrm{KL}(p \,\|\, q) bits per symbol. The KL term is the slack: the price you pay for your model not being the truth. The KV cache isn’t really a list of samples from a random source.
The whole cache is the deterministic output of a forward pass through known weights on a known prompt. There’s exactly one tensor it can be. So the “true” distribution pp is a delta on that tensor, and a delta has zero entropy. Every bit the coder spends is pure KL: H(p,q)=−lnq(KVtrue)H(p, q) = -\ln q(KV_\mathrm{true}). The bitrate is a direct measurement of how much weight your model qq gives the correct KV cache. What we need, then, is a calibrated model of the specific forward pass that produced one, in a form an arithmetic coder can consume. What should qq look like?§ Suppose for the moment that we had access to something that, given the prompt, produced a reasonable per-scalar prediction μ\mu of the KV cache and a calibrated sense σ2\sigma^2 of how wrong that prediction tends to be. The natural way to turn that into a real distribution is a Gaussian centred on μ\mu with variance σ2\sigma^2, so that q(x)=N(x; μ,σ2)q(x) = \mathcal{N}(x;\, \mu, \sigma^2), in which case the cost (in nats per scalar) of encoding the true value splits cleanly into two terms: −lnq(KVfull) = 12ln(2πσ2)⏟spread cost + (KVfull−μ)22σ2⏟miss cost.-\ln q(\text{KV}_\text{full}) \;=\; \underbrace{\tfrac{1}{2}\ln(2\pi\sigma^2)}_{\text{spread cost}} \;+\; \underbrace{\frac{(\text{KV}_\text{full} - \mu)^2}{2\sigma^2}}_{\text{miss cost}}.The two terms pull in opposite directions: the spread cost wants σ2\sigma^2 small, while the miss cost punishes us when KVfull\text{KV}_\text{full} lands far from μ\mu relative to σ\sigma, and blows up entirely if we set σ2\sigma^2 too small for the actual error.
Taking the expectation over the data and minimising in σ2\sigma^2 puts the optimum at σ2=E[(KVfull−μ)2]\sigma^2 = \mathbb{E}[(\text{KV}_\text{full} - \mu)^2], with an expected per-scalar bitrate of 12ln(2πe σ2)\tfrac{1}{2}\ln(2\pi e\, \sigma^2), which is just the log of the typical error magnitude. Better μ\mu buys us bits directly; miscalibrated σ2\sigma^2 wastes them either way, paying for spread we didn’t need if too large and seeing the miss cost run away if too small. What predicts a KV cache?§ The natural class of things to use as a μ\mu predictor is any model whose forward pass on the same prompt carries information about the target’s KV cache. The bitrate we can hope to reach is bounded below by the conditional entropy H(KVfull∣Mpred(prompt))H(\text{KV}_\text{full} \mid M_\text{pred}(\text{prompt})), and the question of how much compression is available reduces to how small that conditional entropy can be made by choosing MpredM_\text{pred} well. The full pipeline is then: Both sides re-run the predictor on the prompt to reconstruct the same per-scalar (μ,σ)(\mu, \sigma). Only the encoder touches the target model. The arithmetic coder turns (KVfull,μ,σ)(\text{KV}_\text{full}, \mu, \sigma) into bits on the encode side, and turns (bits,μ,σ)(\text{bits}, \mu, \sigma) back into KVfull\text{KV}_\text{full} on the decode side. ENCODEDECODEprompttarget modelpredictorKV_full(μ, σ)coderpromptpredictor(μ, σ)coderKV_fullbits Both sides hold the promptOr the prompt is transferred with the bits — it’s of negligible size relative to the KV cache itself.. Both sides run the predictor and reconstruct the same (μ,σ)(\mu, \sigma) deterministically.
The encoder additionally runs the target model, feeds (KVfull,μ,σ)(\text{KV}_\text{full}, \mu, \sigma) into the arithmetic coder, and emits the bitstream; the decoder consumes the bitstream alongside its locally-reconstructed (μ,σ)(\mu, \sigma) and recovers KVfull\text{KV}_\text{full} exactly. The scheme is lossless because the coder is, and because both sides reconstruct (μ,σ)(\mu, \sigma) from the same prompt and the same predictor. What this leaves us picking is the predictor, and the choice is a cost-versus-bits tradeoff bounded by two extremes. At one end the predictor is the target model itself, so μ=KVfull\mu = \text{KV}_\text{full} and σ=0\sigma = 0, no bits ever cross the wire, and we’ve paid an extra full target-model forward pass to compress nothing of our own forward pass, which is silly. At the other end the predictor outputs pure noise, so μ\mu is arbitrary and σ\sigma has to be set large enough to cover the bf16 range; the predictor is essentially free, but the coder ends up paying close to 16 bits per scalar. Real predictors sit between the two, and the question for any particular construction is how cheaply we can drive the conditional entropy down. An optimised version of the same model§ The most direct predictor we can pick is an optimised version of the same model: same architecture, same prompt, with some structure-preserving optimisation applied to make it cheaper to run while keeping its forward pass close to the original. The output KVopt\text{KV}_\text{opt} lives in the same shape as KVfull\text{KV}_\text{full} and lies close to it elementwise, because a small, well-behaved perturbation of the parameters perturbs the forward pass in a small, structured way.
The residual KVfull−KVopt\text{KV}_\text{full} - \text{KV}_\text{opt} is small and structured, which is exactly the regime in which μ=KVopt\mu = \text{KV}_\text{opt} pays cheaply under the Gaussian model from the previous section. For example, take quantisation. The optimised model is the target with its weights cast to a narrower format (FP8, INT4, MXFP4, and so on)The predictor doesn’t have to be specially constructed; quantised variants of major open-weights models ship alongside their full-precision counterparts, so both sides of the pipeline can pull the same artifact off the shelf., and the perturbation in question is the rounding noise introduced at every weight tensor. Quantisation is well-studied enough that we know its forward-pass error is small for any reasonable scheme; that property is what makes serving a quantised model a viable substitute for the full one in the first place, and the same property gives us KVopt\text{KV}_\text{opt} as a low-residual μ\mu for free, with no training step in sight. The per-channel statistics of the residual (which we plug in as σ\sigma) can be measured once on a small calibration set and then frozen, so encoding and decoding both reduce to running the quantised model on the prompt and feeding (μ,σ)(\mu, \sigma) into the coder. Early results§ The simplest concrete instance: predictor is the FP8 version of the target, μ=KVquant\mu = \text{KV}_\text{quant} taken directly, and σ2\sigma^2 is fitted once on training data as the per-(kv, head, channel) empirical residual variance, pooled across positions. We use the Qwen3 model family, since we have both:
A wide range of model sizes Off the shelf fp8 block-quants for each model in the family
On top of (μ,σ)(\mu, \sigma), we found that encoding under a three-component mixture boosted compression ratesThe idea is that in practise, the distribution of ‘residuals’ KVfull−KVquantKV_\mathrm{full} - KV_\mathrm{quant} isn’t really a Gaussian; it’s long-tailed, with outliers.:
q(x) = 0.95 N(x; μ,σ2) + 0.03 N(x; μ,(3σ)2) + 0.02 p^bf16(x)q(x) \;=\; 0.95\,\mathcal{N}(x;\, \mu, \sigma^2) \;+\; 0.03\,\mathcal{N}(x;\, \mu, (3\sigma)^2) \;+\; 0.02\, \widehat{p}_\text{bf16}(x)where p^bf16\widehat{p}_\text{bf16} is the empirical bf16-symbol distribution measured on training data. The wide-Gaussian component covers moderately-mispredicted values; the empirical-marginal component absorbs deep outliers at roughly the unconditional bf16 cost (~11 bits) instead of letting the Gaussian’s tail blow up on them. The following results are for the held-out C4-validation set, after calibrating on 128 train examples from C4 at 1024 tokens each. Swept across Qwen3 target sizes:
targetN(μ,σ)\mathcal{N}(\mu,\sigma)mixtureratio0.6B6.86516.74192.37×1.7B6.63856.53032.45×4B6.41886.32902.53×8B6.25766.17932.59×14B6.07606.00972.66×32B5.97855.91852.70×