I restarted a 10 year old Xeon 174 times to delete twelve flags and gain four tokens a second - point.free
Pangram verdict · v3.3
We believe that this document is a mix of AI-generated, AI-assisted, and human-written content
AI likelihood · overall
MixedArticle text · 2,050 words · 6 segments analyzed
Published on June 15, 202622 minutes readA follow-up to running Gemma 4 on a 2016 Xeon. I took that 25-flag config apart one flag at a time, to find which ones actually do the work, which are harmful, and which are just pitfalls. Most of my tuned config will do nothing for the typical user.A couple of weeks ago I got Gemma 4, a 26-billion-parameter model, running at reading speed on a 2016 Xeon with no GPU and 128 GB of DDR3. That post spent about eight hours on the front page of Hacker News, which means a lot of people now have a 25-flag command sitting in a terminal somewhere, copied from a blog, with no real idea which of those flags is doing the work.I have some bad news about that command. You’re likely holding it wrong.What I said in that post in passing, is that half of those flags would not take just by being present. Some need the right hardware. Some need the right host setup. Some only help on the right workload. The engine accepts all of them and tells you almost nothing about which ones actually fired. So this post is me going back and finding out what will actually work for YOU.The way you find out is an ablation. You take the working config, switch off exactly one flag, measure what changes, put it back, and do the same for every flag in turn. The word is borrowed from neuroscience by way of machine learning, where it normally means knocking out a piece of the model itself, an attention head or a layer, to see what it was for. I am using it a little improperly here, for inference flags rather than model internals, but the idea carries over: turn one thing off, measure, repeat, and the differences tell you what each piece was worth.It was a lot of work. Like a lot. One hundred and seventy-four runs, each one a fresh server reloading twenty-five gigabytes of weights off a spinning disk before it can answer a single token. Three prompts per launch, several repetitions each, and one entire overnight run I had to throw in the bin over a deadlock I will get to. I am telling you the count because the count is the point.
The reason nobody knows which of these flags matter is that finding out is slow and tedious, so almost nobody does it. I did, so here is the answer.The setupThe box is the same one from the Xeon post. A Xeon E5-2620 v4, eight physical cores, sixteen threads, 128 GB of DDR3, no GPU, no swap. The engine is ik_llama.cpp on the feat/gemma-4-mtp branch. The verifier is gemma-4-26B-A4B-it at Q8_0, paired with its MTP drafter at Q8_0.I’ll use 3 test prompts:a short chat turna roughly five-thousand-token document to summarizea code generation request.Greedy decoding, fixed seed, 256 new tokens, three repetitions, median reported.Two things to clear out of the way before the numbers.I ran the benchmark under llama-server rather than the llama-cli from the original command. The server is where the per-request speculative-decoding telemetry lives, which is the only clean way to check whether the drafter actually fired on a given request, and it is what the upstream pull request drives its own benchmarks through. Nothing about the config changes, just the harness around it.And every number below is the full config with one flag changed, so each delta is that flag’s contribution given everything else is still on, not its effect in a vacuum. Flags interact. Speculation changes how much the thread count matters. Repacking changes what flash attention has to read. The deltas do not add up to the gap between this config and a naive one, and I will not pretend they do.The whole boardHere is every lever that moves the needle, in one place, before I walk through them. Decode speed, tokens per second, median of the repetitions, each row the full config with exactly one thing changed. The percentage is against the published config, so a flag that helps shows up as a loss when you remove it.lever changedchatlong doccodepublished config (autotune drafter)12.36.815.6drafter, fixed draft
115.6 (+27%)9.8 (+44%)15.9 (+2%)drafter, fixed draft 216.1 (+31%)7.4 (+8%)17.8 (+14%)drafter, fixed draft 3, autotune off13.9 (+13%)7.1 (+4%)18.3 (+17%)drafter off entirely12.2 (wash)10.5 (+54%)12.2 (−22%)flash attention off6.7 (−46%)4.3 (−37%)7.5 (−52%)threads -t 47.9 (−35%)5.0 (−28%)9.5 (−39%)threads -t 1610.8 (−12%)7.1 (+3%)16.2 (+4%)run-time-repack off11.8 (−4%)9.2 (+35%)13.7 (−12%)Everything else I tried, --mla-use, --cpu-moe, --merge-up-gate-experts, --no-kv-offload, the -sm graph cluster, lands within a few percent of the published config on chat and code, which is to say inside the noise. The long-document column is messier than the other two and I will come back to why in the repack section. Read the negatives as the interesting part. Most of these levers do nothing or cost you, and the work is concentrated in flash attention, the physical-core thread count, and the drafter once it is configured the right way.The drafter is workload-dependent, and one setting was wrongSpeculative decoding is the most tuned part of the config, and it is where I found the most.The first finding is that the drafter is not a free win across the board. Measured against turning it off entirely:On chat it is roughly a wash at the default setting.On code it is worth about twenty-eight percent. A real win.On the long document it makes things slower. Turning the drafter off is fifty-four percent faster. Summarization is hard to predict, the drafter’s acceptance rate sits down around 0.15. A draft that gets rejected most of the time is mostly overhead.So the drafter wants to be gated by workload.
On for code, off for long-context work. The config has it on globally, which quietly costs you about a third of your throughput every time you summarize something. This was not something I had clocked when I wrote the original post, and it is the single most useful thing in this one.The same flag being a win, a wash, and a loss depending on the prompt is worth sitting with, because it is a tiny hand-measured example of a routing gap. The model already routes internally: a mixture-of-experts gate sends every token to the experts it judges most useful, learned and conditional, per token. But that gate optimizes for the next token, not for which inference machinery pays off on a given kind of work. The choice I am making by hand here, speculate on code, do not speculate on the summary, is a second and coarser routing decision sitting one level up, per request instead of per token, and the config hardcodes one answer to it and applies that everywhere. You could imagine a cheap classifier in front of the engine that picks the decoding policy the way the expert gate picks experts, and --spec-autotune is a clumsy first stab at that idea, aimed at a single knob and, on this workload, aimed badly. I am not saying the expert routing is broken. I am saying there is a routing decision above it that my numbers say is worth making, and I made this one with a shell script and a stopwatch… metaphorically. The stopwatch is a metaphor to be clear.The second finding is a setting I had wrong. In the Xeon post I described --spec-autotune as the way to squeeze the most speed out of speculation. Over these generations it is the worst speculation setting I tested. Every fixed --draft-max beats it.chat: autotune 12.3 tokens per second, a fixed draft of 2 gets 16.1. Twenty-five flags, and the biggest single win in the file was deleting one of them and typing a number.code: autotune 15.6, a fixed draft of 3 with autotune off gets 18.3.long document: autotune 6.9, a fixed draft of 1 gets 9.8.The mechanism is acceptance rate. A shorter, fixed draft gets accepted by the verifier more often.
A draft of 1 on chat sits at 0.74 acceptance against autotune’s 0.37, and acceptance is the entire currency speculation trades in. The autotuner spends the generation hunting for a draft length, and over a couple of hundred tokens it never settles, so the average gets dragged below the steady state.The honest caveat, because it matters: 256 tokens is a short generation. It is entirely possible that over a long session, a couple of thousand tokens, the autotuner converges up to the fixed rate and my numbers are simply under-reporting it. I have a test designed to settle that and I have not run it yet. So the careful claim is narrow: at short generation lengths, on this workload, a fixed draft beats autotune. Whether that survives at length is the next experiment. Either way it is a free win today, because picking a number puts the interactive prompts up around sixteen to eighteen tokens per second.Flash attention is the one doing the heavy liftingIf you keep one flag from the whole spell, keep this one. Flash attention is worth roughly two times. Turn it off and chat falls from 12.3 tokens per second to 6.7, code from 15.6 to 7.5, the long document from 6.8 to 4.3. Forty-six, fifty-two, and thirty-seven percent gone, from a single flag.It also turns out to be holding up the drafter, which I did not expect. With flash attention off, speculative acceptance collapses to near zero. So the flag doing the most obvious work is also quietly propping up the flag I just spent a whole section on. That is the kind of dependency you only see when you turn things off one at a time.Threads want physical cores, and there is a wall behind themThis one I got right in the original, and now I have the number to back it. -t 8, one thread per physical core, is the sweet spot, holding chat at 12.3 tokens per second. -t 16, using all sixteen hyperthreads, slips to 10.8, slightly worse, because the work is waiting on memory and extra threads add scheduling overhead without adding bandwidth. The cores are stalled on DDR3, not on each other.
-t 4 is about thirty-five percent slower on decode and closer to forty-five percent slower on prefill, which is the memory wall poking through.While we are here, that wall is worth making concrete, because it tells you when to stop fiddling with flags. Gemma 4 26B-A4B activates about four billion parameters per token, which at Q8_0 is roughly 4.25 gigabytes read for every token generated. Divide that into the memory bandwidth this box can sustain and you get a ceiling around ten to eleven tokens per second for the verifier on its own. That is almost exactly where the thread sweep flattens out, which is how you know it is a real physical wall and not a tuning problem. Past that point, no flag saves you. The only move left is to read fewer bytes per token, which is the entire subject of the next post.Repacking earns its keep on prefill--run-time-repack mostly helps prefill, about nineteen percent on the chat and code prompts, with a small and noisy effect on decode for those two. It spends a few seconds at startup physically rearranging the weight matrices to match how the CPU wants to read them. On a Q8_0 model that is a modest, real win, and worth the startup cost.There is one honest oddity the board above will have flagged. On the long document, turning repack off reads about thirty-five percent faster on decode, which is the opposite of what a prefill optimization should do. It is not a prefill effect. It is the drafter’s long-document acceptance jumping, from around 0.15 to around 0.41, when the weights are not repacked. I do not yet know why repacking the tensors should move acceptance at all, so I am flagging it as a loose thread rather than dressing it up as a result. It is also most of why the long-document column on the board is noisier than the other two.It also has a much more interesting and much worse behavior on a different quant, which is a cliffhanger for the next post.The flags that do not take for freeThis is the half I warned you about. None of it is dramatic, and most of it is not “useless flag,” it is “flag that needs something you might not have”.