DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles - LMSYS Blog
Pangram verdict · v3.3
We believe that this document is primarily human-written, with a small amount of AI content detected
AI likelihood · overall
HumanArticle text · 1,503 words · 5 segments analyzed
We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights. Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details.TL;DRSGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash). Inference (caching & attention): ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment): fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU. Launch Commands: SGLang CookbookModel Key Features & New CapabilitiesDeepSeek-V4 (1.6T Pro, 284B Flash) extends its predecessor DeepSeek-V3.2 along three axes: Hybrid sparse-attention: each layer mixes sliding window attention with one of two compression mechanisms (4:1 top-k or 128:1 dense), keeping the 1M-token context window tractable. mHC (Manifold-Constrained Hyper-Connections): a generalization of standard residual connections that improves gradient flow and representation quality. FP4 expert weights: native FP4 MoE experts for efficient serving on the latest Blackwell hardware.
Designs, Features and Performance OptimizationsShadowRadix: Native Prefix Caching for Hybrid AttentionEvery layer of DeepSeek-V4 combines SWA (sliding window attention over the last 128 raw tokens) with either C4 (top-512 sparse over 4:1-compressed KV) or C128 (dense over 128:1-compressed KV). Also, for maintaining the inflight compressing KV slots, each compression layer has a state pool that stores the in-progress compression state. This complex mechanism breaks traditional prefix caching assumption: three heterogeneous KV pools and two compression-state pools must stay coherent across prefill, decode, and speculative decoding. The following figure shows the per-layer hybrid attention scope for an N = 1024 example.To solve this coherence problem, we introduce ShadowRadix -- a native prefix caching mechanism for hybrid attentions.One core idea. A radix tree indexes virtual full-token slots -- a unified coordinate system shared by all layers. From each slot, we project shadows (per-pool index mappings) into the physical pools (SWA / C4 / C128). Compression-state ring buffers sit in their own pool, but a second-level arithmetic shadow maps each ring slot from the SWA page index -- logically nested inside SWA, physically independent. This lets lifetimes decouple: tombstoning frees a node's SWA slots once the sliding window moves past it, while its C4/C128 shadows stay alive and shareable. So a 10k-token request keeps only 128 SWA tokens plus its full C4/C128 compressed KV -- and that compressed KV is what other prefix-matching requests reuse. Figure 2. ShadowRadix storage layout.Because shadows are just per-pool projections of the same source, each pool can manage its own lifetime independently. We exploit this with a two-counter lock per node -- full_lock_ref covers the source (and therefore its C4/C128 shadows), while swa_lock_ref only tracks whether the node still falls inside someone's sliding window. When the SWA counter hits zero we drop the node's SWA slots, but the node stays in the tree and its compressed shadows keep serving prefix matches. Matching stays SWA-safe by requiring 128 consecutive live tokens from the match point before extending into the window.
The nested ring shadows inherit this for free: since a ring slot's address is swa_page * ring_size + pos % ring_size, releasing an SWA page automatically invalidates the rings sitting inside it -- without any extra tracking, and ring sizes are picked so MTP rollback can't race a wrap.Speculative decoding works without any change to this design. The one issue is that draft tokens are written into the ring before verification decides which ones to accept, so a rejected-then-retried step can wrap around and overwrite slots that still belong to the live window. Doubling the ring sizes under spec (C4: 8 -> 16, C128: 128 -> 256) leaves enough headroom that any in-flight speculative write lands outside the active window, so EAGLE works out of the box.Speculative DecodingDeepSeek-V4 ships a single-layer MTP head -- a separately trained DSv4 decoder layer that runs SWA-only attention (no compressor, no indexer), with the previous-step hidden state (h_proj) and the next-token embedding (e_proj) combined as its input. We support it Day-0; the real systems work sits one level below. Hybrid attention metadata is heavy, and preparing it eagerly on the scheduler stream becomes the launch bottleneck under speculative decoding -- so we fuse that preparation into the CUDA graph for both draft and verify passes.Two optimizations drive the speedup: In-graph metadata preparation. Hybrid attention's per-pass metadata is heavy -- SWA page indices, shadow-mapped pool slots, compressor/indexer plans, per-pool write locations -- but it's all index arithmetic over page tables and lengths, which fits device kernels cleanly. So each captured graph only needs the raw batch state as per-replay input (active requests, current lengths, new-KV destinations), copied into fixed buffers; captured kernels rebuild the rest inside the graph, and Python never touches the per-pass path during replay. This collapses the per-step launch overhead that would otherwise dominate speculative decoding. Overlap scheduling. CPU-side work (result processing, batch preparation, deallocation) runs in parallel with GPU execution.
Figure 2. Hybrid sparse attention combined with ShadowRadix and in-graph spec metadata keeps SGLang decode throughput essentially flat from 4K to 900K -- close to the model's full 1M context window. The drop is under 10% on both B200 (199 -> 180 token/s) and H200 (266 -> 240 token/s).HiSparse: Turbocharging Sparse Attention with Hierarchical MemoryRecently introduced, HiSparse is a technique that offloads inactive KV cache to CPU memory, enabling larger batch sizes and higher throughput for sparse attention. HiSparse fits naturally with C4 layers: each step the indexer top-k touches only a small fraction of compressed positions, so most C4 KV is inactive at any moment and can live on CPU. C128 is dense (every position is touched) and SWA is already small (128 tokens), so neither benefits from offload. By using a CPU memory pool to extend just the C4 KV cache pool, we improve overall token capacity and throughput for long-context serving by up to 3x. Left: the GPU keeps only small device buffers for the active working set of the C4 KV cache, while a larger pinned CPU mirror stores the full-context KV cache. At each step, the HiSparse Coordinator swaps missed pages in from CPU and evicts inactive GPU pages using an LRU policy. Newly generated tokens are asynchronously backed up to the CPU mirror. Right: peak throughput for DeepSeek-V4-Flash on 2xB200, 200K-input / 20K-output, swa_full_tokens_ratio=0.001.Fast Kernel Integrations
New FlashMLA path for hybrid attention: DeepSeek-V4's hybrid attention combines SWA over the local window with an "extra" attention over either C4 (sparse top-k over compressed KV) or C128 (dense over compressed KV). We integrate the refreshed FlashMLA interface so SWA and the extra attention run in a single fused kernel call (the kernel takes both k_cache and extra_k_cache with their respective indices), sharing metadata construction in the forward pass. Target GPUs are Hopper/SM90 and Blackwell/SM100 where the corresponding kernels are supported.
FlashInfer TRTLLM-Gen fused MoE for MXFP8 x MXFP4: DeepSeek-V4 ships native FP4 expert weights, making small-batch MoE decode sensitive to expert-weight bandwidth. We integrate the TRTLLM-Gen fused MoE backend through FlashInfer to pair MXFP8 activations with MXFP4 expert weights and adapt SGLang's weight/scale layout to the kernel's expected format. This path targets Blackwell GPUs and relies on Blackwell-specific FP4 tensor-core machinery and tiled/persistent execution.
TileLang mHC kernels with split-K: mHC replaces the residual stream with a per-token mixture over hc_mult parallel branches. A Sinkhorn normalization over a GEMM output produces the mixture weights. In low-latency decoding, the pre-GEMM TileLang kernel can become the bottleneck because batch sizes are small and parallelism is limited. We therefore extend our two-stage split-K TileLang kernel (mhc_pre_gemm_sqrsum_splitk_kernel) to partition the K dimension across different CTAs, improving GPU utilization at small batch sizes. We also integrate a fused mhc_pre_big_fuse_tilelang path that combines RMSNorm, Sinkhorn, and residual mixing in one kernel, with PDL enabled.
DeepGEMM Mega MoE integration: We integrate DeepGEMM's Mega MoE path on Blackwell. This kernel fuses EP dispatch, the first FP8xFP4 expert GEMM, SwiGLU, the second FP8xFP4 expert GEMM, and EP combine into a single symmetric-memory-based mega-kernel, overlapping NVLink communication with tensor-core computation. Because Mega MoE requires transformed FP4 expert weights with UE8M0 scale factors, we adapt the existing EP/MoE activation path to consume the transformed layout directly. This avoids keeping two resident copies of expert tensors in GPU memory while still enabling both the DeepEP path and the low-latency Mega MoE path.
For background on kernel-level integration patterns, see our earlier blog Scaling DeepSeek on GB200.Various Kernel OptimizationsWe elaborate on two representative kernels below. Figure 5. Two general kernel-level optimizations. Left: Flash Compressor fuses the naive 5-stage compression chain into one on-chip pass, cutting HBM round-trips from 5 to 2.