I benchmarked caveman against two words

M maxtaylor.me ↗

▲ 89 points • 67 comments • by max-t-dev • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

33 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 2 of 3

SEGMENTS · AI 0 of 3

WORD COUNT 906

PEAK AI % 43% · §2

Analyzed

Apr 29

backend: pangram/v3.3

Segments scanned

3 windows

avg 302 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 906 words · 3 segments analyzed

Human AI-generated

§1 Human · 27%

RepoCaveman is a popular Claude Code compression plugin. The pitch is in the name: ultra-compressed responses, ~75% fewer tokens, all the technical accuracy. Six modes, slash commands, intensity dials, classical Chinese variants.I benchmarked it against two words: "be brief."Same quality. Same range of tokens. The plugin didn't beat the boring default on either axis.This article is the long version of the video. If you want the verdict in two minutes, watch it.What I testedCategoryFailure modeSkill claim testednBug diagnosisDrops the why, gives fix without cause—5Concept explanationStrips nuance, edge cases, or compresses technical terms into plain EnglishTechnical terms exact5Architectural tradeoffsDrops caveats that change the advice—4Multi-step setupCollapses or reorders steps—4Security / destructive opsMissing warnings on irreversible actionsAuto-Clarity escape3Error interpretationParaphrases or truncates the error stringErrors quoted exact324 prompts across six categories: bug diagnosis, concept explanations, architecture tradeoffs, multi-step setup, security and destructive ops, error interpretation. Each prompt has a per-prompt rubric. Facts the answer must cover (key_points), terms it must use (must_use_terms), and dangerous wrong claims to avoid (must_avoid).The dataset shape:A real entry:Five arms: baseline. Claude default, no instruction. brief. "Be brief." prepended to every prompt. lite, full, ultra. Caveman plugin at three intensity levels. Each arm ran the full 24-prompt dataset through claude -p on claude-opus-4-7. A separate Claude (claude-sonnet-4-6) scored every response against its prompt's rubric. Semantic match on key points, literal match on required terms, trap detection on avoided claims.The harness is open source here.Quality didn't moveFirst check: did compression hurt correctness?Every arm scored within 1.5% of every other arm. Baseline 0.985. Brief 0.985. Lite 0.976. Full 0.975. Ultra 0.970. Every arm hit 100% of its key_points. Zero must_avoid triggers in 120 responses.

§2 Mixed · 43%

Compression didn't drop substantive content. Setting quality aside, the only axis worth comparing is tokens.The headline resultArmmean tokensbaseline636brief419lite401full404ultra449"Be brief." cut tokens 34% versus baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three caveman arms.This looked bad for ultra. It's a false story.The category splitSplitting tokens by category gives a clearer picture.On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied with the other caveman arms. Compression is working as advertised.On multi-step setup and security warnings, every caveman mode gets more variable. Ultra catches the eye in the aggregate, but it's not specifically worse. All three caveman arms swing hard on these categories.The reason is in the skill itself. Caveman has an "Auto-Clarity" rule that explicitly drops compression for safety warnings, irreversible actions, and multi-step sequences. Exactly these two categories. When the safety escape engages, all three modes loosen toward natural prose. The compression just isn't running.That's not a bug. It's a designed feature. Caveman knowing when to stop compressing.So what's caveman actually for?If a two-word prompt matches it on tokens and quality, the value isn't compression. It's structure.Consistent output shapeEvery caveman response follows the same pattern:Predictable in a way that "be brief." isn't. If you want a uniform feel across sessions, or have downstream tooling that consumes Claude output, that consistency is real value.The intensity dialSlash command to switch lite, full, ultra mid-session. Two words can't do that.Persistence across long sessionsCaveman re-injects the ruleset on every prompt via SessionStart and UserPromptSubmit hooks.The goal is to keep the pattern from drifting across long sessions. My benchmark didn't test this. Every run was single-shot via claude -p. But the mechanism is real, and "be brief." in CLAUDE.md doesn't have an equivalent.The safety escapeAuto-Clarity dropping compression on destructive ops is the variance you saw in the chart above. Caveman explicitly distinguishes when to stop compressing. Two words don't make that distinction. On my data this didn't change outcomes. "

§3 Human · 28%

be brief." never tripped a must_avoid trap either. But the design exists.What I cut from the videoA few findings that didn't earn their place in a two-minute video but are worth flagging here.Lite missed a required term once. On a queue tradeoff question (SQS vs BullMQ vs Kafka), lite's markdown-table format compressed the comparison so tight it dropped the term "at-least-once". Score 0.70. The only row below 0.90 in the 120-row sweep. n=1, but it's a real failure mode for benchmarks that enforce specific terminology.Ultra triggered tool-use behaviour the other modes didn't. On a Dockerfile setup question, ultra opened with "Need write perms. Retry after approve, or paste inline:". It tried to call the Write tool, got blocked, and dumped the file inline anyway. That single response added ~1300 tokens to ultra's setup category mean. Caveman's terse examples seem to prime tool-first behaviour, which is a side-effect of compression style I didn't see coming.The arch_tradeoffs token inflation isn't what I thought. My initial findings doc claimed caveman's [thing] [action] [reason] pattern pushed the model toward bulleted enumerations on N-way comparison questions. Looking closer, lite and full have the same pattern but produced cleaner outputs (lite often wrote tables, full wrote prose). The pattern isn't the cause. I don't have a clean attribution.What you should actually doIf all you want is shorter outputs, start with "be brief." in your prompt or CLAUDE.md. Two words. Matched caveman's tokens and quality.Reach for caveman when you need consistent output structure across sessions. That's the differentiator that survived the benchmark.The bigger lesson: most prompt-engineering advice hasn't been measured against the boring default. Measure it.Repo: cc-compression-bench · Video: youtu.be/wijoYNiZq3M · Caveman plugin: juliusbrussee/cavemanIf you've got a compression strategy you want benchmarked against the same dataset, the harness is strategy-agnostic. Adding an arm is one shell script. PRs welcome.