GitHub - antirez/ds4: DeepSeek 4 Flash local inference engine for Metal

G github.com ↗

▲ 499 points • 159 comments • by tamnd • 2w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

19 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 5 of 5

SEGMENTS · AI 0 of 5

WORD COUNT 1,549

PEAK AI % 13% · §4

Analyzed

May 7

backend: pangram/v3.3

Segments scanned

5 windows

avg 310 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,549 words · 5 segments analyzed

Human AI-generated

§1 Human · 1%

ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue. This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors. Now, back at thsi project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:

DeepSeek v4 Flash is faster because of less active parameters. In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions. The model features a context window of 1 million tokens. Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters. It writes much better English and Italian. It feels a quasi-frontier model. The KV cache is incredibly compress, allowing long context inference on local computers and on disk KV cache persistence. It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM. We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.

That said, a few important things about this project:

The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works.

§2 Human · 9%

The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory. This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without llama.cpp and GGML, largely written by hand. This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. The KV cache It is actually a first class disk citizen. Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there. This is Metal-only, may implement CUDA support in the future? Perhaps, but nothing more. The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virutal memory implementation that will crash the kernel if you try to run the CPU code. Remeber? Software sucks. I was not possible to fix the CPU inference to avoid crashing, since each time there is to restart the computer, which is not funny. Help us, if you have the guts.

Acknowledgements to llama.cpp and GGML ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path.

§3 Human · 11%

Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain Metal kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file. Model Weights This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality. Download one main model: ./download_model.sh q2 # 128 GB RAM machines ./download_model.sh q4 # >= 256 GB RAM machines The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf, stores files under ./gguf/, resumes partial downloads with curl -C -, and updates ./ds4flash.gguf to point at the selected q2/q4 model. Authentication is optional for public downloads, but --token TOKEN, HF_TOKEN, or the local Hugging Face token cache are used when present. ./download_model.sh mtp fetches the optional speculative decoding support GGUF. It can be used with both q2 and q4, but must be enabled explicitly with --mtp. The current MTP/speculative decoding path is still experimental: it is correctness-gated and currently provides at most a slight speedup, not a meaningful generation-speed win. Then build: make ./ds4flash.gguf is the default model path used by both binaries. Pass -m to select another supported GGUF from ./gguf/. Run ./ds4 --help and ./ds4-server --help for the full flag list.

§4 Human · 13%

Speed These are single-run Metal CLI numbers with the q2 GGUF, --ctx 32768, --nothink, greedy decoding, and -n 256. The short prompt is a normal small Italian story prompt. The long prompt is 11709 tokens and exercises chunked prefill plus long-context decode.

Machine Prompt Prefill Generation

MacBook Pro M3 Max, 128 GB short 58.52 t/s 26.68 t/s

MacBook Pro M3 Max, 128 GB 11709 tokens 250.11 t/s 21.47 t/s

Mac Studio M3 Ultra, 512 GB short 84.43 t/s 36.86 t/s

Mac Studio M3 Ultra, 512 GB 11709 tokens 468.03 t/s 27.39 t/s

CLI One-shot prompt: ./ds4 -p "Explain Redis streams in one paragraph." No -p starts the interactive prompt: ./ds4 ds4> The interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat transcript and the live Metal KV checkpoint, so each turn extends the previous conversation. Useful commands are /help, /think, /think-max, /nothink, /ctx N, /read FILE, and /quit. Ctrl+C interrupts the current generation and returns to ds4>. The CLI defaults to thinking mode. Use /nothink or --nothink for direct answers. --mtp MTP.gguf --mtp-draft 2 enables the optional MTP speculative path; it is useful only for greedy decoding, currently uses a confidence gate (--mtp-margin) to avoid slow partial accepts, and should be treated as an experimental slight-speedup path. Server Start a local OpenAI/Anthropic-compatible server: ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 The server is Metal-only.

§5 Human · 13%

It keeps one mutable graph/KV checkpoint in memory, so stateless clients that resend a longer version of the same prompt can reuse the shared prefix instead of pre-filling from token zero. Request parsing and sockets run in client threads, but inference itself is serialized through one Metal worker. The current server does not batch multiple independent requests together; concurrent requests wait their turn on the single live graph/session. Supported endpoints:

GET /v1/models GET /v1/models/deepseek-v4-flash POST /v1/chat/completions POST /v1/completions POST /v1/messages

/v1/chat/completions accepts the usual OpenAI-style messages, max_tokens/max_completion_tokens, temperature, top_p, top_k, min_p, seed, stream, stream_options.include_usage, tools, and tool_choice. Tool schemas are rendered into DeepSeek's DSML tool format, and generated DSML tool calls are mapped back to OpenAI tool calls. /v1/messages is the Anthropic-compatible endpoint used by Claude Code style clients. It accepts system, messages, tools, tool_choice, max_tokens, temperature, top_p, top_k, stream, stop_sequences, and thinking controls. Tool uses are returned as Anthropic tool_use blocks. Both APIs support SSE streaming. In thinking mode, reasoning is streamed in the native API shape instead of being mixed into final text. Minimal OpenAI example: curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model":"deepseek-v4-flash", "messages":[{"role":"user","content":"List three Redis design principles."}], "stream":true }' Agent Client Usage ds4-server can be used by local coding agents that speak OpenAI-compatible chat completions. Start the server first, and set the client context limit no higher than the --ctx value you started the server with: ./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 You can use larger context and larger cache if you wish.