VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

A arxiv.org ↗

▲ 394 points • 205 comments • by timhigins • 2d ago • HN discussion ↗

Pangram verdict · v3.3

We believe that the document contains a mix of AI-assisted and human-written content

40 %

AI likelihood · overall

Mixed

15% human-written 0% AI-generated

SEGMENTS · HUMAN 1 of 2

SEGMENTS · AI 0 of 2

WORD COUNT 258

PEAK AI % 44% · §1

Analyzed

Jun 23

backend: pangram/v3.3

Segments scanned

2 windows

avg 129 words each

Distribution

15 / 0%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 258 words · 2 segments analyzed

Human AI-generated

§1 Mixed · 44%

View PDF HTML (experimental) Abstract:This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.16140 [cs.AI] (or arXiv:2606.16140v1 [cs.

§2 Human · 20%

AI] for this version) https://doi.org/10.48550/arXiv.2606.16140 arXiv-issued DOI via DataCite Submission history From: Sen Xu [view email] [v1] Mon, 15 Jun 2026 02:57:19 UTC (552 KB)