The economics of speculative decoding

F fergusfinn.com ↗

▲ 30 points • 6 comments • by kkm • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

16 %

AI likelihood · overall

Mixed

86% human-written 14% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,738

PEAK AI % 0% · §4

Analyzed

Jun 11

backend: pangram/v3.3

Segments scanned

6 windows

avg 290 words each

Distribution

86 / 14%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,738 words · 6 segments analyzed

Human AI-generated

§1 Human · 0%

Speculative decoding is one of the cleanest performance wins in inference optimisation: it’s lossless, it hits decode latency when not much else does, and in its standard formulation it’s simple and elegant. It works by looking forwards: speculative decoding takes a position on what tokens will come next. For dense transformers the bet is riskless: accepted tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory bandwidth. A burst of research activity has recently pushed the envelope on how far forwards we can take that bet, for example Eagle 3.1, DFlash, SSD. This post looks at two architectural shifts that have changed the underlying economics of speculation: what mixture-of-experts routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free. Then it works through what they mean for when, and how far ahead, we should speculate. The expert tax§ FFN layers in older, dense transformers (like the venerable LlamaI wrote about this model before, here. series) have a simple roofline with batch size: arithmetic intensity climbs linearly with batch size as weights get reused across the batch, then flattens onto the compute ceiling. The win for speculative decoding is clear. If you’re on the slope of the roofline you’re memory bound, and speculated tokens increase the amount of compute you’re doing without increasing the memory transfer. So both accepted & rejected tokens are free until they push you over the knee. Modern models almost invariablyWith some interesting exceptions. use mixture-of-experts (MoE) layers in place of simple dense FFNs. Each token passes first through a ‘routing’ layer, which orders the relevant experts by affinity. The token hidden state is sent to the top kk experts, then the results are recombined. This routing means that the arithmetic intensity of the MoE layer can depend on the actual content of the hidden state inputs, not just the shape. In practice, one training objective (for training and large scale inference reasons) is to keep the experts balanced — that is, if BB tokens come in, each expert of EE total should process a fraction B/EB/E of the total.

§2 Human · 0%

From here on, take DeepSeek-V4-Flash as an example: k=6k=6 routed experts of E=256E=256, plus one always-on shared expert. The intensity-vs-batch curve changes in two ways vs. a dense equivalent.

Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts (at batch 2 the chance the new token’s experts already match is small), so it drags its own weights across the bus and gets little to no amortisation. The intensity leaves the origin at only half its eventual slope, so a token added here, speculated or not, pays close to full freight for its experts. Shallower slope / distant knee, same ceiling. Once every expert is being triggered, the MoE line climbs more gently, reaching the same ceiling only at a far larger batch. The free-token band is much wider.

Dense climbs steeply; the MoE is shallower by a factor (k+1)/(E+1)(k+1)/(E+1). The shaded region under each line is the memory-bound stretch, where speculated tokens are roughly free; it runs much wider for the MoE. Assuming uniform routing to experts, which is a good assumption for DeepSeek, and single-node deployment (expert parallelism changes stuff a bit). We’re using the fp4 threshold since DeepSeek’s experts are natively mxfp4. Not visible on this plot, because of the shallowness of the MoE roofline: the curve between B=0B=0 and ~B=43B=43, where new experts are being brought in.

The whole idea of speculative decoding is to amortise the weight transfer in autoregressive decoding between multiple steps. Notably, the chart tells us at batch size 11 this barely works for the MoE layers. But, as batch size grows past this low region, there’s a much larger space in which speculative decoding might pay. The implications for speculative decoding are that:

The win when speculative tokens are accepted is no longer so big The penalty when speculative tokens are rejected is no longer zero. Both the win & the penalty from speculative decoding changes nonlinearly with batch size.

§3 Human · 0%

The changing face of attention§ The ‘expert tax’ at low batch size is part of the story that’s changed. The other part is attention. A recap: the term for the ratio of FLOPs to memory transferred for an operation is arithmetic intensity. You can figure out whether an operation is memory bound or compute bound by comparing its arithmetic intensity to the ratio of available flops and memory bandwidth, for the hardware you’ll run the operation on. Generically, we can write the arithmetic intensity of the attention operation as: AI=f⋅TSmc⋅S+mq⋅TAI = \frac{f\cdot TS}{m_c \cdot S + m_q \cdot T}for TT query tokens over SS context tokens, where ff is the (bf16) FLOPs per query-context pair and mcm_c, mqm_q are the bytes transferred per context and query token. For models in the Llama-3 vein, at decode, where S≫TS \gg T, this goes as ∼T\sim TFor pure MHA, it truly goes as ∼T\sim T with no constant. Llama-3 is not quite so optimisation-blind so it uses GQA, which makes it something like 8T8T.. The ridge for a B200 is 281281 FLOPs/byte (bf16). Assuming we don’t have a speculator that can produce hundreds of correct tokens at a time (if we did, we might as well just use it in place of the target model), pretty much any reasonable number of speculation tokens you verify wring more compute out of a KV read you had already paid for. This means speculation can still be a win for global throughput at high batch sizes, even when the GEMMs hit their ridgepoint, something that maybe goes underappreciated. The trend in attention implementations, driven by the binding pressure of KV cache sizes, has been KV cache compression — driving down mcm_c, the bytes stored and transferred per token in the sequence, and often correspondingly ff.

§4 Human · 0%

One successful attention implementation, DeepSeek’s Multihead Latent Attention (MLA) does this by storing only a single latent vector per token, for all the attention heads The architecture we’ve been discussing is DeepSeek-V4, which is to Attention is All You Need MHA what ASML’s EUV machines are to spirographs. Its variants get a full breakdown in the appendix. The upshot is the same qualitative shape as MLA, but the exact thresholds move with the compression ratio and sequence length. For the calculations on MLA + Deepseek’s attention variants, see the appendix.. The arithmetic intensity is:

S \ TS \,\backslash\, T12485121933224846451,0242153876459678,192238469910171916,38424047693818201,048,5762424839671932 Compare the bf16 ridge (≈281\approx 281 FLOP/byte)Attention stays in bf16 even when the FFN GEMMs and the KV cache itself drop to fp8/fp4, because the softmax is more sensitive to the precision.. Bold is compute-bound. Decode (T=1T=1) is just memory-bound at every context length. Any number of speculation tokens makes MLA immediately compute bound!It’s a little more subtle than this. MLA has two algebraically-equivalent formulations: an MQA one (a single latent KV shared across all heads — what the table assumes), used at decode, and an MHA one (the latent up-projected to per-head K/V), used in prefill.

§5 Human · 0%

The MHA form’s attention runs at intensity ∼T\sim T rather than ∼nhT\sim n_h T, so it stays memory-bound far longer — but only by up-projecting the whole KV context to per-head K/V, a fixed cost that amortises across the attending tokens and so only pays for itself past T≈170T \approx 170. Speculation never gets near that (we assume ≤100\le 100 tokens), so we’re always in the MQA regime, where the table holds. So there’s no free lunch. When you speculate with DeepSeek, you pay close to full price for your speculated tokens. How to price a speculated token§ We’ve talked about two different things that have changed the cost landscape for speculative decoding. When figuring out how well speculation is going to work as a system, there are two things that matter:

The extra cost that comes from running the draft model. This cost can come to bear in throughput (the FLOPs used on the draft model could have been used on the original model), and in latency (i.e. in the standard formulation the draft model has to run synchronously in the forward pass)Realistically the draft model will also have its own roofline, which adds straightforwardly to the per token marginal cost. Eagle / MTP use a fast autoregressive model conditioned on the hidden states of the base model, DFlash uses bidirectional attention with a masked language modelling objective..

How much each token costs to verify. Accepted, we book it as profit over generating the token anew; rejected, a tax for having speculated. For a dense, memory-bound model this is roughly zero. That’s no longer quite true — and not just for MoE, since the compressed attention eats the same slack from the other side.

In order to choose how to build a speculation system, we need to pick parameters that balance the value we get from new tokens, with the cost we pay for producing, then verifying those tokens. The chart tells us how much a new speculated token costs to produce + verify, relative to a new token. 1×1\times is break-even. Toggle the components to see how the different parts of the model contribute.

§6 Human · 0%

avg seq len16,845 tokdrafter cost10%show components How far ahead should we speculate§ The cost model tells us that we need to be careful with speculated tokens, because they’re no longer free. Speculated tokens that are expensive to verify need to be likely to be accepted, otherwise they don’t pay their way relative to tokens generated anew. To figure out how many speculated tokens to work with, we need a model of acceptance. Pick the simplest speculation model: each draft token gets accepted i.i.d. with fixed probability α\alpha, draft length γ\gammaConstant per-position α\alpha is the optimistic case; real acceptance decays with draft depth in some complicated way, and also depends on the content & length of the preceding sequence. So read γ⋆\gamma^\star as an upper bound. Drafter cost would add to cc; I’m holding it fixed here. This is just a finite geometric series.. The expected number of tokens committed by one verifier pass goes like: N(α,γ)=1−αγ+1(1−α),N(\alpha, \gamma) = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)},The cost of verifying γ+1\gamma+1 tokens in the target model is: Cverify(B,γ+1,S)=Cattn proj(B⋅(γ+1))+CMoE(B⋅(γ+1))+Cattn(B,γ+1,S)C_\mathrm{verify}(B, \gamma+1, S) = C_\mathrm{attn~proj}(B\cdot(\gamma+1)) + C_\mathrm{MoE}(B\cdot(\gamma+1)) + C_{\mathrm{attn}}(B, \gamma+1, S)Writing the no-speculation decode cost as C0(B,S)=Cverify(B,1,S)C_0(B,S)=C_\mathrm{verify}(B,1,S), the throughput speedup relative to ordinary decode is then: Sp(α,γ)=N(α,γ) C0(B,S)Cverify(B,γ+1,S)+Cdraft model(B,γ,S)\mathrm{Sp}(\alpha,