Skip to content
HN On Hacker News ↗

ProgramBench: Can Language Models Rebuild Programs From Scratch?

▲ 150 points 80 comments by jonbaer 3w ago HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

17 %

AI likelihood · overall

Human
100% human-written 0% AI-generated
SEGMENTS · HUMAN 1 of 1
SEGMENTS · AI 0 of 1
WORD COUNT 239
PEAK AI % 17% · §1
Analyzed
May 7
backend: pangram/v3.3
Segments scanned
1 windows
avg 239 words each
Distribution
100 / 0%
human / AI fraction
Verdict
Human
Pangram v3.3

Article text · 239 words · 1 segments analyzed

Human AI-generated
§1 Human · 17%

Authors:John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir PressView PDF HTML (experimental) Abstract:Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.03546 [cs.SE]   (or arXiv:2605.03546v1 [cs.SE] for this version)   https://doi.org/10.48550/arXiv.2605.03546 arXiv-issued DOI via DataCite (pending registration) Submission history From: John Yang B [view email] [v1] Tue, 5 May 2026 09:17:02 UTC (1,752 KB)