Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks

L lenz.io ↗

▲ 505 points • 347 comments • by kostaj • 4w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, AI-assisted, and human-written content

80 %

AI likelihood · overall

Mixed

14% human-written 79% AI-generated

SEGMENTS · HUMAN 2 of 7

SEGMENTS · AI 4 of 7

WORD COUNT 1,264

PEAK AI % 98% · §2

Analyzed

May 28

backend: pangram/v3.3

Segments scanned

7 windows

avg 181 words each

Distribution

14 / 79%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,264 words · 7 segments analyzed

Human AI-generated

§1 AI · 92%

Lenz Research · Snapshot v1.0 · data as of May 21, 2026 67% of real fact-checks, top AI models don't agree on the answer. 1,000 claims, rated by 5 frontier LLMs.

Jordanov, Kosta · Lenz Research · kosta@lenz.io We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict. These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False). On 67% of claims, the panel splits.

Key findings 67% of claims (672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority — or no majority forms at all. 34% of claims (343 / 1,000; 95% CI: 31–37%) involve a 2+ bucket gap between the most-disagreeing pair of frontier verdicts — a substantive disagreement on the answer, not just a calibration shift. Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement. The panel converges on definitive verdicts; the middle of the rubric is where it fractures. Within the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True. Some models concentrate verdicts at the True/False poles; others spread across the middle two buckets.

§2 AI · 98%

Contents How often the frontier disagrees Substantive vs nuance disagreement Model-vs-model agreement Per-model behavior (verdict distribution + agreement-with-rest) Detailed results (by domain, majority verdict, unanimous) Data Methodology Reproducibility Limitations FAQ Ethics & data use Changelog Appendix: Example claims where the frontier fractures

1How often the frontier disagrees On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn't agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the Models split, no majority row. Most of these claims are unlikely to appear in any training corpus with a gold label attached — there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the "majority" and to "dissent from the majority." A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. Frontier verdict patternClaimsShare of corpus All 5 agreed (unanimity)32833%30–36% 1 of 5 dissented22422%20–25% 2 of 5 dissented31632%29–35% Models split, no majority (e.g. 2-2-1 or 2-1-1-1)13213%11–15% ≥1 model dissents (incl.

§3 AI · 96%

splits)67267%64–70% ≥2 models dissent (incl. splits)44845%42–48% Panel agreement: Krippendorff’s α (ordinal) = 0.639 (n=1000 claims, 5 raters). This indicates nontrivial but limited agreement: the models' verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge. Ordinal α is the standard Krippendorff variant for an ordered categorical scale (True / Mostly True / Misleading / False). See §7.5 Statistical analysis for choice of metric. Lower bound on model error. For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel's most popular bucket is the correct one — the most charitable assumption — the minimum number of models that picked a wrong verdict is: ≥1 model wrong on 67% of claims (any non-unanimous panel) ≥2 wrong on 45% of claims (3-2, 3-1-1, or no-majority splits) ≥3 wrong on 13% of claims (no bucket reaches a majority, so at most 2 can be right) Relaxing the "most popular is correct" assumption can only raise these counts, never lower them. The actual error rates are likely higher still: even the 33% of cases where all five agree can and likely does include shared blind spots.

2Substantive vs nuance disagreement On 34% of claims (343 / 1,000; 95% CI: 31–37%), at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric — a disagreement that goes beyond calibration. Not every disagreement is equal. A "True" vs "Mostly True" split is a confidence-calibration shift. A "True" vs "False" split is a substantive disagreement about the answer.

§4 AI · 81%

We measure this as the max pairwise bucket distance across the 5 verdicts on each claim, where the verdicts are ordered True (0) → Mostly True (1) → Misleading (2) → False (3). DistanceInterpretationClaimsShare 0Full unanimity (all 5 picked the same bucket)32833%30–36% 1Nuance only (e.g. True ↔ Mostly True)32933%30–36% 2Substantive (True ↔ Misleading, or Mostly True ↔ False)13213%11–15% 3Polar (True ↔ False)21121%19–24% ≥2 buckets apart (substantive or polar)34334%31–37% Caveat. Bucket distance treats True / Mostly True / Misleading / False as an ordinal scale; an equal-spaced interpretation is a simplification. A 2-bucket gap can still reflect rubric ambiguity, temporal-framing differences, or differing interpretations of "Misleading." We report it as a coarse "substantive vs nuance" indicator, not a metric of error magnitude.

3Model-vs-model agreement Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search (75%) — unsurprising, since they share a base model. Lowest: Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search and Gemini 3 Pro × Sonar Pro (53%) — three pairs tie at the floor. How often each pair of frontier models picked the same verdict label, across all claims in the corpus.

§5 Mixed · 47%

GPT-5.4Claude Opus 4.7Gemini 3 ProGemini 3 Pro + SearchSonar Pro GPT-5.4 —65%62–68%65%62–68%60%57–63%60%57–63% Claude Opus 4.7 65%62–68%—53%50–56%53%50–56%58%55–61% Gemini 3 Pro 65%62–68%53%50–56%—75%72–77%53%50–56% Gemini 3 Pro + Search 60%57–63%53%50–56%75%72–77%—58%55–61% Sonar Pro 60%57–63%58%55–61%53%50–56%58%55–61%—

4Per-model behavior Two angles on the same five models: how each one distributes its verdicts (4.1), and how often each one's verdict matches the strict majority of the other four (4.2).

4.1 Verdict distribution Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims — without ground truth, we can't separate the two. The table below shows the share of claims each model assigned to each bucket, with 95% Wilson CIs underneath each cell.

§6 Human · 21%

Model TrueMostly TrueMisleadingFalse GPT-5.4 42%39–45% 16%14–19% 12%10–14% 30%28–33% Claude Opus 4.7 38%35–41% 26%23–29% 19%17–22% 17%15–20% Gemini 3 Pro 54%51–57% 3%2–4% 3%2–4% 40%37–43% Gemini 3 Pro + Search 52%49–55% 4%3–5% 9%7–11% 35%32–38% Sonar Pro 35%32–38% 23%21–26% 16%14–18% 26%23–28%

4.2 Agreement with the rest of the panel Across the five models, peer-majority agreement ranges from 69% to 81%. This is peer-alignment in this corpus, not correctness — no model is treated as ground truth here, and eligible n differs per row. For each model, how often does its verdict match the strict majority (≥3/4) of the other four? A claim is eligible only when a ≥3/4 majority exists among the other four.

§7 Human · 15%

ModelAgreement w/ peer majorityEligible nIneligibleTier GPT-5.4 81%78–84% 650 350 parametric Claude Opus 4.7 70%67–74% 691 309 parametric Gemini 3 Pro 77%74–80% 683 317 parametric Gemini 3 Pro + Search 76%73–79% 693 307 retrieval Sonar Pro 69%66–73% 675 325 retrieval

5Detailed results

5.1 Per-domain frontier disagreement Denominator per row: claims in that domain (the Claims column). DomainClaimsAny disagreementSubstantive (≥2 buckets)No majority Finance 75 67%55–76% 39%28–50% 20%13–30% General 179 68%60–74% 40%33–48% 12%8–17% Health 171 71%64–78% 29%23–36% 12%8–17% History 131 53%44–61% 24%17–32% 13%8–20% Legal