Pangram verdict · v3.3
We believe that this document is fully human-written
AI likelihood · overall
HumanArticle text · 1,640 words · 5 segments analyzed
By Eric Lu, Ben Pan, Deniz Birlikci, Sam Lee, Ray Wang, Rohan Choudhury, Fermi Ma, TC Qin, Carlo Baronio, Silas Alberti, and more →06.08.26Raising the bar from correctness to qualityToday’s coding benchmarks have established that models can write correct code. But as AI-generated code becomes the dominant path to production, correctness is now table stakes. The question that we should be asking is: can models actually write good code?We’re excited to introduce FrontierCode, a benchmark that measures how well models can truly meet the standards of high-quality production codebases. What sets us apart:Would the maintainer actually merge this PR? We’re the first benchmark to measure code mergeability. Our criteria assess end-to-end code quality — correctness, test quality, scope discipline, style, and adherence to codebase standards. This employs a novel ensemble of grading techniques, including unit tests, rubrics, and new types of verifiers.Crafted by open-source maintainers. 20+ world-class open-source developers built realistic, diverse, and challenging coding tasks from the repos they maintain, spending more than 40 hours per task. They define what “mergeable” means in their repo.Rigorous quality control. Rubric grading is subjective, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review, where every task is manually reviewed by a Cognition researcher. We achieve an 81% lower false positive rate compared to SWE-Bench Pro.Our benchmark provides the strongest available signal of a model’s ability to write high-quality, maintainable code. We find that even today’s most capable models struggle on this new standard.20+ world-class open-source maintainers40 hours effort per taskManually reviewed by Cognition researchersEvery task81% lower false positive rateCompared to SWE-Bench ProFirst-ever benchmark measuring code qualityAnd subtle human preferencesResultsWe present three nested subsets of FrontierCode at increasing difficulty: Extended, Main, and Diamond.
Diamond comprises the 50 hardest tasks, Main the 100 hardest (including Diamond), and Extended the full set of 150.We report two metrics, pass rate and score:A solution passes if it clears all blocker criteria, i.e., criteria that a maintainer would consider hard stops during code review, and fails otherwise.A solution’s score is a weighted aggregate of the rubric items. Solutions that do not pass blocking criteria receive 0.Each model is run 5 times at every available reasoning effort. For each effort, we average the metric across the 5 trials, then report each model’s score at its best performing reasoning level.FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%. Other models score significantly lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less. However, GPT 5.5 consistently uses up to 4x fewer tokens than Opus 4.8, achieving a better cost-intelligence tradeoff.On FrontierCode Main and Extended, Opus 4.8 still maintains a clear lead, at 34.3% and 51.8%, respectively. We also observe a large gap between open-source models and the frontier. Kimi K2.6, the best-performing open-source model, achieves just 3.8% on Diamond, 16% on Main and 37% on Extended.The rest of this post will be a deep dive into why and how we built FrontierCode.Why we built FrontierCodeThe first generation of coding benchmarks, such as SWE-Bench Verified and Pro, were designed for less capable models. They fall short on many measures of realism and robustness.Fundamentally, they only test functional correctness, not quality. Moreover, these benchmarks are prone to misclassification errors. Experiments from METR have found that high-scoring models on these benchmarks often produce patches that wouldn’t be accepted by human maintainers.How do we define misclassifications? These fall under two categories:False Positives: The verifier should not reward solutions that are wrong. Test coverage may be incomplete, allowing the model to write an incorrect solution that’s still accepted.False Negatives: The verifier should not penalize solutions that are correct.
Tests can be either too specific, e.g. checking for exact error strings or function names, or unsolvable, testing for a behavior not in the instruction or in the codebase.We show through analysis of agent trajectories that FrontierCode produces 81% less misclassification errors than other leading benchmarks. This means that FrontierCode scores are the most accurate ranking currently available.Existing benchmarks also suffer from lack of diversity in several ways.While other benchmarks generated issues from single PRs via programmatic scraping, FrontierCode is hand-selected by repo maintainers from multi-PR chains and freeform requests. We also triple the number of represented languages from SWE-Bench Pro.It’s also known that existing benchmarks provide too much guidance in the form of overly specified and detailed prompts. Today’s frontier models need far less hand-holding. FrontierCode expects the agent to infer the maintainer’s intent, given the same context as a human contributor.Our prompts contain two parts. First is the task description. Second, the codebase guidelines for generic testing, lint, and style practices, just like those found in AGENTS.md. The task descriptions are humanlike and deliberately concise — a third the length of SWE-Bench Pro’s.Example prompts from each benchmark, shown at the same scale. Scroll within each column to compare structure, length, and specificity.Furthermore, we’ve chosen to scale the difficulty of tasks using quality rubrics, rather than simply increasing patch size. Despite having smaller patches than benchmarks like DeepSWE, FrontierCode is harder for agents to solve.To produce an evaluation for code quality as ambitious as FrontierCode, we had to embed quality into every step of the benchmark creation process.How we built FrontierCodeA Team of Open Source MaintainersFrontierCode aims to measure whether models can produce code that would be merged into production codebases. To ensure this, we collaborated directly with the maintainers of 36 flagship open-source repositories. This team of all-star experts has collectively reviewed and merged thousands of commits to their codebases. They can apply deep stylistic and design knowledge to every PR they see.Each maintainer invested more than 40 hours per task, undergoing multiple rounds of iteration with other eval engineers and Cognition researchers. They’ve distilled their judgment into concrete evaluation criteria: any PR that satisfies these standards would actually be approved.
Here’s what they say about FrontierCode:“Working with the team behind FrontierCode was a privilege. Taking on the AI evaluation problem felt like nothing less than an art… Where others grade like a CI, FrontierCode grades like a tech lead.”Tomer Nosrati, CEO and Tech Lead of Celery (28.6k stars)“What sets FrontierCode apart is the attention to detail. Each task is calibrated to a depth that simply hasn’t been seen before in LLM benchmarking. We should be moving away from benchmarks that can be gamed and instead using ones like FrontierCode to demonstrate genuine model intelligence and creativity.”Martin McKeaveney, Co-Founder and CTO of Budibase (28k stars)“I’m grateful to have worked with leading experts in the Open Source community. We had deep discussions on correctness versus quality and what mergeability means in the context of their repository. FrontierCode is a milestone for AI models respecting subjective quality in the real world.”Merlijn Vos, Core Maintainer of uppy (30.8k stars)“FrontierCode’s unique value comes from the human experience encoded in its evals: years of judgment about what makes code high-quality and worthy of merging. The almost obsessive care brought to every criterion is why I believe this benchmark sets a new bar for SWE evaluation.”Claudio Costa, Core Maintainer of Mattermost (37k stars)Beyond Unit TestsFrontierCode measures mergeability by evaluating code along the following axes:Behavioral correctness: Does the patch successfully solve the problem?Regression safety: Does it break anything in the existing codebase?Mechanical cleanliness: Does it pass the project’s build, lint, and style checks?Test correctness: Do the agent’s tests actually capture the desired behavior?Scope: Does the patch touch only what it needs to?Code quality: Does the code conform to codebase conventions, follow sound design patterns, and remain readable to collaborators?The following table describes how we use both classical unit tests and novel methods, such as adaptive classical grading, scope, and reverse-classical tests (more on these methods below) to evaluate these criteria.CategoryMethodHow it worksPasses whenBehavioral correctnessclassicalInjects test files into the repository, runs them, then cleans up.All injected tests passMechanical cleanliness, regression safetycommandRuns a shell command.
Exit code 0Test correctnessreverse-classicalRuns agent’s submitted tests against the base commit.The tests failBehavioral correctness for complex tasksadaptive classical gradingUses an LLM to adapt reference tests or application code to align with the implementation.Adapted tests passScopescopeChecks file boundaries, diff size constraints, and optionally semantic locality of changes.Diff within constraintsCode qualitypromptAn LLM reviews agent’s diff against a natural-language prompt.LLM score meets thresholdEach criterion is either a blocker or a non-blocker:Blockers represent mergeability requirements, i.e., criteria that a maintainer would consider hard stops during code review. These include correctness checks, as well as non-correctness concerns like performance or scope restrictions.Non-blockers represent quality signals such as code style, type safety, and readability, which would not necessarily block a merge.If a solution satisfies all the blockers, it is considered passing, and its score is the weighted aggregate of all the rubric items it passes. Otherwise it receives a score of zero.Novel Grading MethodsWe’ve introduced three main techniques to strengthen criteria against misclassifications, while allowing space for multiple valid solutions:Reverse-Classical: The reverse-classical criterion is a way to ensure that agent-written tests are meaningful: when we run them on the original, broken codebase, they must fail. This gives us an automated, deterministic check that the agent understood the problem well enough to write an effective test for it.Code Scope: A good PR should exercise restraint: it modifies only what it needs to, without touching unrelated files or introducing unnecessary refactors. The scope criterion is an automated check that enforces these boundaries. It combines three types of constraints:files: For fast, deterministic checks on which files can be allowed, denied, or must be deleted.size: To enforce limits on the number of changed lines, net line growth, or total files modified.semantic: For LLM-based checks that verify the locality or nature of a change within a specific part of a file (e.g., inside a single function).Adaptive Classical Grading: Open-ended coding tasks can have many valid solutions. Static unit tests are too rigid; good solutions can fail for superficial differences like function names or error wording.