Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,838 words · 6 segments analyzed
Published on June 01, 202617 minutes readThe previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it.I have a recycled server. To its credit, it has a whopping 128 GB RAM, but it’s DDR3… That RAM is 5-6 times slower than the current best laptop ram. It also has a single Intel Xeon E5-2620 v4 from 2016, which is about 5 times slower than my laptops CPU…Oh, and as I did mention, we have no GPU. And no, the Xeon does not have an integrated GPU.But, just hear me out…If we were to just break out ollama here, well… as explained in earlier blog posts, we can’t. And we’d be lucky if we could in 6 months when they add support for the model we need, if they ever do. Might be they never do. And even still, ollama simply doesn’t expose enough knobs for us to ever make this run well, neither does even the standard llama-cpp.But. Why would that stop us?I’ve recieved feedback that some of the previous posts were too high level, I’ll try to make things as clear as reasonably possible here. If you’re a tech worker, or a Linux enthusiast that has built a computer and used something like ChatGPT, most of this should be approachable.So, just to really set the stage fully. The hardware, per lscpu:CPU: Intel Xeon E5-2620 v4 @ 2.10 GHzCores: 8 physical, 16 threadsInstruction sets: AVX2 (no AVX-512, no AVX-VNNI, no BF16)Cache: 20 MiB L3, 2 MiB L2 totalMemory: 128 GB DDR3GPU: noneFor LLM inference, memory bandwidth is the limiting resource. Every token generated requires hauling gigabytes of weights from RAM into the CPU cache.When you use a tool like ChatGPT and watch the text stream onto your screen word by word, you are watching the “decoder pass”.
During this phase, the model generates the output one piece (or “token”) at a time.In this step, the system’s raw processing power is rarely the bottleneck. Instead, the limitation is memory bandwidth. To calculate that next word, the processor has to constantly pull massive amounts of data. That data is the “weights” that contain the model’s learned knowledge. It moves this from memory into the compute cores.The processor executes the required matrix calculations so quickly that it is left sitting idle, waiting for the hardware to physically move the next chunk of weights across the memory bus. In traditional software terms, decoding is heavily memory-bound, not compute-bound.This is the so called “memory wall”, one of the single biggest performance hurdles now, whether you’re on a Xeon or an H100.Naively running llama-cli on a DDR3 machine without a GPU is horrendously slow, even if it can run it, because it’s optimized for a generic GPU usecase, and often leaves a lot of improvements on the table. Further, it simply doesn’t have most of the actual optimizations that the state of the art currently uses to run these at scale.The remedy is to pull every optimization lever ik_llama.cpp exposes. Most of them are slightly obscure.Here is the magic spell that makes this actually run.llama-cli \ --model gemma-4-26B-A4B-it-Q8_0.gguf \ --model-draft gemma-4-26B-A4B-it-assistant-GGUF/\ wikitext-2-raw_ik-llama-mtp_drafter-conservative/\ gemma-4-26B-A4B-it-assistant-Q8_0.gguf \ --spec-type mtp --draft-max 3
--draft-p-min 0.0 --spec-autotune \ -cnv --color --jinja --special \ -sm graph -smgs -sas -mea 256 --split-mode-f32 \ --temp 0.7 -t 8 --parallel 8 \ --cpu-moe --merge-up-gate-experts \ --flash-attn on --mla-use 3 \ --mlock --run-time-repack --no-kv-offload Under a blackbox tool like ollama you never see this line. On aging hardware you have to understand what each flag does, because half of them won’t take, and the engine will tell you so in passing.Speculative decoding.--spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune This pairs the 26B verifier with the small drafter from the previous post. Up to three tokens per draft (--draft-max 3), all probabilities accepted (--draft-p-min 0.0), --spec-autotune adjusting the chain length per workload.This ties directly back to our previous discussion about the memory-bound decoder pass.When a model uses a long reasoning chain, it is generating those “thinking” tokens one by one. Even if the internal reasoning is hidden from the user and all you see is a short final answer, the hardware still has to perform a full decoder pass for every single token in that hidden chain.In fact, speculative decoding is currently one of the most brilliant software workarounds the AI industry has invented to bypass the “memory wall,” and spec autotune is how you squeeze the maximum speed out of it.The argument for speculative decoding is stronger on CPU than on GPU. CPU compute is cheap relative to the cost of streaming the verifier’s weights through cache, so spending extra cycles on a tiny drafter whose active layers easily fit in L3 buys tokens at very little marginal cost. The drafter’s working set fits in L3. The verifier however spills out of everything.
CPU and MoE routing.--cpu-moe --merge-up-gate-experts -t 8 --parallel 8 Gemma 4 26B-A4B has 128 experts with 8 active per token, giving about 3.8B active parameters out of ~25.2B total. --cpu-moe tunes the routing for CPU cache hierarchies.CPUs handle memory very differently than GPUs. While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.In an MoE model, constantly jumping around between 128 different experts can cause “cache thrashing”, where the CPU constantly has to dump its cache and fetch new weights from the much slower main system RAM (normally DDR4/DDR5, we’re on DDR3!).This flag tells the router to be smarter about how it picks experts, optimizing the sequence so the weights stay neatly inside the CPU’s local cache for as long as possible.--merge-up-gate-experts fuses two per-expert projections into a single matmul, which the logs confirm:fused_up_gate = 1 This is a software trick to bypass the memory bandwidth bottleneck we discussed earlier.Inside the experts, the math operations require data to be passed through different layers. Normally, the processor would calculate an “up projection”, write the result to memory, then load the weights for a “gate projection”, calculate that, and combine them. That requires moving data across the memory bus multiple times.Instead of doing two separate trips over the memory bus, it combines the operations into a single step.-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other.Memory pinning, repacking, KV cache.--mlock --run-time-repack --no-kv-offload --run-time-repack reorganizes weight matrices in memory immediately before inference to match the CPU’s cache layout.
The logs confirm:============ Repacked 265 tensors Processors have their own ultra-fast, built-in memory called caches (L1, L2, and L3). However, these caches expect data to be fed to them in very specific shapes and sizes.If the AI’s weight matrices are sitting in system RAM in a generic layout, the CPU has to awkwardly pull the data in pieces, resulting in “cache misses” where the CPU stalls. --run-time-repack tells the engine to spend a few seconds during startup to physically reorganize the massive tables of numbers in the RAM so they perfectly align with how the CPU wants to ingest them. It pays a small time penalty upfront to guarantee maximum memory bandwidth during the actual text generation.--mlock is meant to pin the model in RAM so the OS cannot swap any of it to disk.mlock stands for “memory lock”, suprising, I know! In standard operating systems, if the system starts running out of RAM, it will quietly take data that hasn’t been used in a few seconds and “swap” (or page) it to the physical hard drive.If an OS tries to swap out 27GB of AI weights to a disk, the generation speed will instantly drop to zero while the system chokes trying to read it back. --mlock tells the Linux kernel: “Pin this 27GB strictly in physical RAM. Do not ever move it to the disk.”Notice that if you’re not careful, you’ll see this:warning: failed to mlock 27628376064-byte buffer (after previously locking 0 bytes): Cannot allocate memory Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root). The flag is fine; the kernel-side memlock limit isn’t set high enough to pin a 27 GB buffer. This is not an LLM-shaped problem at all — it’s a ulimit default — and it’s the kind of footgun the blackbox tools paper over by simply not asking for the optimization in the first place.Consider that for a moment, that many tools by default will just have no problem putting your model into swap if it decided that’s the best option. You can imagine how much this can hurt performance…--no-kv-offload tells the engine not to look for a GPU for the KV cache.
There isn’t one to find, but the flag short-circuits the check.The KV (Key-Value) cache is the AI’s short-term memory — it stores the context of the current conversation so the model doesn’t have to re-read the entire prompt for every new token.Because the KV cache is constantly being read from and written to, AI engines usually try to “offload” it to a GPU, which has much faster memory than we do.Since this specific setup is highly optimized to run purely on a CPU, letting the engine search the hardware buses for a GPU that doesn’t exist is a waste of time and could throw an error. This flag explicitly short-circuits that check, telling the engine to just keep the short-term memory in the system RAM alongside the weights.Graph layout.I’ve tried my best to keep this easy to understand, but this part is just plain hard to make explain in a single blog post.Now onto dark arts. A common frustration in bleeding-edge AI software is that the engine is being developed so fast that the developers don’t have time to write official documentation. If you want to know how to optimize the engine, you have to dig through the raw code or read the Github Pull Request (PR) comments between the developers.-sm graph -smgs -sas -mea 256 --split-mode-f32 These flags govern how the computational graph is allocated across memory regions. The full documentation ultimatley lives in the code, even if it has some documentation.The flag -sm graph tells the engine to use Split Mode in the Graph mode (often known in the industry as Tensor Parallelism). This is entirely about how you divide the massive math workload across multiple processors or memory regions (like multiple CPU sockets or GPUs).Layer Split (The Default/Fallback): The engine slices the model horizontally. Processor A calculates Layers 1–10, then sends the data over the system bus to Processor B, which calculates Layers 11–20. While Processor A is working, Processor B is sitting idle.Graph Split (The Goal): The engine slices the computational graph vertically. Processor A and Processor B calculate different halves of Layer 1 at the exact same time, combine their answers, and move to Layer 2 together. This keeps all hardware running at 100% simultaneously, drastically improving generation speed.