deepseek-ai/DeepSeek-V4-Pro · Hugging Face

H huggingface.co ↗

▲ 159 points • 19 comments • by cmrdporcupine • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

16 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 9 of 9

SEGMENTS · AI 0 of 9

WORD COUNT 1,220

PEAK AI % 20% · §2

Analyzed

Apr 24

backend: pangram/v3.3

Segments scanned

9 windows

avg 136 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,220 words · 9 segments analyzed

Human AI-generated

§1 Human · 18%

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Technical Report👁️

Introduction We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity. Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks.

§2 Human · 20%

Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows. Model Downloads

Model #Total Params #Activated Params Context Length Precision Download

DeepSeek-V4-Flash-Base 284B 13B 1M FP8 Mixed HuggingFace | ModelScope

DeepSeek-V4-Flash 284B 13B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope

DeepSeek-V4-Pro-Base 1.6T 49B 1M FP8 Mixed HuggingFace | ModelScope

DeepSeek-V4-Pro 1.6T 49B 1M FP4 + FP8 Mixed* HuggingFace | ModelScope

*FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8. Evaluation Results

Base Model

Benchmark (Metric) # Shots DeepSeek-V3.2-Base DeepSeek-V4-Flash-Base DeepSeek-V4-Pro-Base

Architecture - MoE MoE MoE

# Activated Params - 37B 13B 49B

# Total Params - 671B 284B 1.6T

World Knowledge

AGIEval (EM) 0-shot 80.1 82.6 83.1

MMLU (EM) 5-shot 87.8 88.7 90.1

MMLU-Redux (EM) 5-shot 87.5 89.4 90.8

MMLU-Pro (EM) 5-shot 65.5 68.3 73.5

MMMLU (EM)

§3 Human · 20%

5-shot 87.9 88.8 90.3

C-Eval (EM) 5-shot 90.4 92.1 93.1

CMMLU (EM) 5-shot 88.9 90.4 90.8

MultiLoKo (EM) 5-shot 38.7 42.2 51.1

Simple-QA verified (EM) 25-shot 28.3 30.1 55.2

SuperGPQA (EM) 5-shot 45.0 46.5 53.9

FACTS Parametric (EM) 25-shot 27.1 33.9 62.6

TriviaQA (EM) 5-shot 83.3 82.8 85.6

Language & Reasoning

BBH (EM) 3-shot 87.6 86.9 87.5

DROP (F1) 1-shot 88.2 88.6 88.7

HellaSwag (EM) 0-shot 86.4 85.7 88.0

WinoGrande (EM) 0-shot 78.9 79.5 81.5

CLUEWSC (EM) 5-shot 83.5 82.2 85.2

Code & Math

BigCodeBench (Pass@1) 3-shot 63.9 56.8 59.2

HumanEval (Pass@1) 0-shot 62.8 69.5 76.8

GSM8K (EM) 8-shot 91.1 90.8 92.6

MATH (EM) 4-shot 60.5 57.4 64.5

MGSM (EM) 8-shot 81.3 85.7

§4 Human · 14%

84.4

CMath (EM) 3-shot 92.6 93.6 90.9

Long Context

LongBench-V2 (EM) 1-shot 40.2 44.7 51.5

Instruct Model DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes: Reasoning Mode Characteristics Typical Use Cases Response Format

Non-think Fast, intuitive responses Routine daily tasks, low-risk decisions </think> summary

Think High Conscious logical analysis, slower but more accurate Complex problem-solving, planning <think> thinking </think> summary

Think Max Push reasoning to its fullest extent Exploring the boundary of model reasoning capability Special system prompt + <think> thinking </think> summary

DeepSeek-V4-Pro-Max vs Frontier Models

Benchmark (Metric) Opus-4.6 Max GPT-5.4 xHigh Gemini-3.1-Pro High K2.6 Thinking GLM-5.1 Thinking DS-V4-Pro Max

Knowledge & Reasoning

MMLU-Pro (EM) 89.1 87.5 91.0 87.1 86.0 87.5

SimpleQA-Verified (Pass@1) 46.2 45.3 75.6 36.9 38.1 57.9

Chinese-SimpleQA (Pass@1) 76.4 76.8 85.9 75.9 75.0 84.4

GPQA Diamond (Pass@1) 91.3 93.0 94.3 90.5 86.2 90.1

HLE (Pass@1) 40.0 39.8 44.4 36.4 34.7 37.7

LiveCodeBench

§5 Human · 12%

(Pass@1) 88.8 - 91.7 89.6 - 93.5

Codeforces (Rating) - 3168 3052 - - 3206

HMMT 2026 Feb (Pass@1) 96.2 97.7 94.7 92.7 89.4 95.2

IMOAnswerBench (Pass@1) 75.3 91.4 81.0 86.0 83.8 89.8

Apex (Pass@1) 34.5 54.1 60.9 24.0 11.5 38.3

Apex Shortlist (Pass@1) 85.9 78.1 89.1 75.5 72.4 90.2

Long Context

MRCR 1M (MMR) 92.9 - 76.3 - - 83.5

CorpusQA 1M (ACC) 71.7 - 53.8 - - 62.0

Agentic

Terminal Bench 2.0 (Acc) 65.4 75.1 68.5 66.7 63.5 67.9

SWE Verified (Resolved) 80.8 - 80.6 80.2 - 80.6

SWE Pro (Resolved) 57.3 57.7 54.2 58.6 58.4 55.4

SWE Multilingual (Resolved) 77.5 - - 76.7 73.3 76.2

BrowseComp (Pass@1) 83.7 82.7 85.9 83.2 79.3 83.4

HLE w/ tools (Pass@1) 53.1 52.0 51.6 54.0 50.4 48.2

§6 Human · 14%

GDPval-AA (Elo) 1619 1674 1314 1482 1535 1554

MCPAtlas Public (Pass@1) 73.8 67.2 69.2 66.6 71.8 73.6

Toolathlon (Pass@1) 47.2 54.6 48.8 50.0 40.7 51.8

Comparison across Modes

Benchmark (Metric) V4-Flash Non-Think V4-Flash High V4-Flash Max V4-Pro Non-Think V4-Pro High V4-Pro Max

Knowledge & Reasoning

MMLU-Pro (EM) 83.0 86.4 86.2 82.9 87.1 87.5

SimpleQA-Verified (Pass@1) 23.1 28.9 34.1 45.0 46.2 57.9

Chinese-SimpleQA (Pass@1) 71.5 73.2 78.9 75.8 77.7 84.4

GPQA Diamond (Pass@1) 71.2 87.4 88.1 72.9 89.1 90.1

HLE (Pass@1) 8.1 29.4 34.8 7.7 34.5 37.7

LiveCodeBench (Pass@1) 55.2 88.4 91.6 56.8 89.8 93.5

Codeforces (Rating) - 2816 3052 - 2919 3206

HMMT 2026 Feb (Pass@1)

§7 Human · 17%

40.8 91.9 94.8 31.7 94.0 95.2

IMOAnswerBench (Pass@1) 41.9 85.1 88.4 35.3 88.0 89.8

Apex (Pass@1) 1.0 19.1 33.0 0.4 27.4 38.3

Apex Shortlist (Pass@1) 9.3 72.1 85.7 9.2 85.5 90.2

Long Context

MRCR 1M (MMR) 37.5 76.9 78.7 44.7 83.3 83.5

CorpusQA 1M (ACC) 15.5 59.3 60.5 35.6 56.5 62.0

Agentic

Terminal Bench 2.0 (Acc) 49.1 56.6 56.9 59.1 63.3 67.9

SWE Verified (Resolved) 73.7 78.6 79.0 73.6 79.4 80.6

SWE Pro (Resolved) 49.1 52.3 52.6 52.1 54.4 55.4

SWE Multilingual (Resolved) 69.7 70.2 73.3 69.8 74.1 76.2

BrowseComp (Pass@1) - 53.5 73.2

§8 Human · 16%

- 80.4 83.4

HLE w/ tools (Pass@1) - 40.3 45.1 - 44.7 48.2

MCPAtlas (Pass@1) 64.0 67.4 69.0 69.4 74.2 73.6

GDPval-AA (Elo) - - 1395 - - 1554

Toolathlon (Pass@1) 40.7 43.5 47.8 46.3 49.0 51.8

Chat Template This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model's text output. Please refer to the encoding folder for full documentation.A brief example:from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [ {"role": "user", "content": "hello"}, {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."}, {"role": "user", "content": "1+1=?"} ]

# messages -> string prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens import transformers tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro") tokens = tokenizer.encode(prompt)

How to Run Locally Please refer to the inference folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens. License This repository and the model weights are licensed under the MIT License.

§9 Human · 10%

Citation @misc{deepseekai2026deepseekv4, title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence}, author={DeepSeek-AI}, year={2026}, }

Contact If you have any questions, please raise an issue or contact us at service@deepseek.com.