Agent Skills

A addyosmani.com ↗

▲ 376 points • 212 comments • by BOOSTERHIDROGEN • 3w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is a mix of AI-generated, and human-written content

69 %

AI likelihood · overall

Mixed

28% human-written 72% AI-generated

SEGMENTS · HUMAN 1 of 5

SEGMENTS · AI 2 of 5

WORD COUNT 1,691

PEAK AI % 93% · §2

Analyzed

May 4

backend: pangram/v3.3

Segments scanned

5 windows

avg 338 words each

Distribution

28 / 72%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,691 words · 5 segments analyzed

Human AI-generated

§1 Mixed · 48%

A senior engineer’s job is mostly the parts that don’t show up in the diff. Specs. Tests. Reviews. Scope discipline. Refusing to ship what can’t be verified. AI coding agents skip those parts by default. Agent Skills is my attempt to make them not optional.

The default behaviour of any AI coding agent is to take the shortest path to “done.” Ask for a feature and it writes the feature. It does not ask whether you have a spec, write a test before the implementation, consider whether the change crosses a trust boundary, or check what the PR will look like to a reviewer. It produces code, declares victory, and moves on.

This is the same failure mode every senior engineer has spent their career learning to avoid. The senior version of any task includes work that doesn’t show up in the diff: surfacing assumptions, writing the spec, breaking the work into reviewable chunks, choosing the boring design, leaving evidence that the result is correct, sizing the change so a human can actually review it. Those steps are most of what separates engineers who ship reliable software at scale from people who push code that breaks.

Agents skip those steps for the same reason any junior would. They’re invisible. The reward signal points at “task complete” not “task complete and the design doc exists.” So we have to bolt the senior-engineer scaffolding back on.

Agent Skills is my attempt at that scaffolding. It just crossed 26K stars, so apparently I’m not alone in wanting it. This post is the part the README doesn’t quite cover: why each design choice exists, how it maps onto standard SDLC and Google’s published engineering practices, and what you should steal from the project even if you never install a single skill.

What a “skill” actually is

The word “skill” is doing a lot of work in the Claude Code / Anthropic vocabulary, and it helps to be precise. A skill is a markdown file with frontmatter that gets injected into the agent’s context when the situation calls for it. Somewhere between a system-prompt fragment and a runbook.

A skill is not reference documentation. It is not “everything you should know about testing.” It is a workflow: a sequence of steps the agent follows, with checkpoints that produce evidence, ending in a defined exit criterion.

§2 AI · 93%

That distinction is the whole game. If you put a 2,000-word essay on testing best practices into the agent’s context, the agent reads it, generates plausible-looking text, and skips the actual testing. If you put a workflow there (write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor), the agent has something to do, and you have something to verify.

Process over prose. Workflows over reference. Steps with exit criteria over essays without them. That single distinction separates a useful skill from a pretty markdown file. It also explains why so many “AI rules” repos end up doing nothing in practice. The rules are essays.

The SDLC the skills encode

The twenty skills in the repo organise around six lifecycle phases, with seven slash commands sitting on top. Define (/spec) is where you decide what you’re actually building. Plan (/plan) breaks the work down. Build (/build) implements it in vertical slices. Verify (/test) proves it works. Review (/review) catches what slipped through. Ship (/ship) gets it to users safely. /code-simplify sits across the bottom of the whole thing.

This isn’t a coincidence. It’s the same SDLC every functioning engineering organisation runs, just in different vocabulary. Google calls it design doc → review → implementation → readability review → launch checklist. Amazon calls it the working-backwards memo and the bar raiser. Every healthy team has some version of this loop.

What’s new with AI coding agents is that most agents skip most of these phases by default. You ask for a feature, you get an implementation, and the spec, plan, tests, review, and launch checklist all just don’t happen. Skills push the agent through the same phases a senior engineer forces themselves through, because shipping the code without them is how you produce incidents.

A complex feature might activate eleven skills in sequence. A small bug fix might use three. The router (using-agent-skills) decides which apply. The point is that the workflow scales to the actual scope, not to the assumed scope.

Five principles that are doing the work

Five design decisions in the project are the load-bearing ones. The rest of the system follows from them.

1. Process over prose

Already covered. Workflows are agent-actionable; essays are not.

§3 AI · 92%

The same is true for human teams. If your team handbook is 200 pages, no one reads it under time pressure. If it’s a small set of workflows with checkpoints, people actually run them.

2. Anti-rationalization tables

This is the most distinctive design decision in the project, and the one I most want other teams to steal.

Each skill includes a table of common excuses an agent (or a tired engineer) might use to skip the workflow, paired with a written rebuttal. A few examples close to the originals:

“This task is too simple to need a spec.” → Acceptance criteria still apply. Five lines is fine. Zero lines is not. “I’ll write tests later.” → Later is the load-bearing word. There is no later. Write the failing test first. “Tests pass, ship it.” → Passing tests are evidence, not proof. Did you check the runtime? Did you verify user-visible behaviour? Did a human read the diff?

The reason this works is that LLMs are excellent at rationalisation. They will produce a plausible-sounding paragraph explaining why this particular task doesn’t need a spec, or why this particular change is fine to merge without review. Anti-rationalization tables are pre-written rebuttals to lies the agent hasn’t yet told.

The pattern is just as good for human teams. Most engineering decay isn’t anyone choosing to do bad work. It’s people accepting plausible-sounding justifications for skipping the parts they don’t feel like doing. A team that writes down its anti-rationalizations is a team that has fewer of them.

3. Verification is non-negotiable

Every skill terminates in concrete evidence. Tests pass. Build output is clean. The runtime trace shows the expected behaviour. A reviewer signs off. “Seems right” is never sufficient.

This is the same principle that makes Anthropic’s harness recover from failures, that makes Cursor’s planner/worker/judge split actually catch bugs, that makes any long-running agent recoverable. The agent is a generator. You need a separate signal that the work is done. Skills bake that signal into every workflow.

4. Progressive disclosure

Do not load all twenty skills into context at session start. Activate them based on the phase. A small meta-skill (using-agent-skills) acts as a router that decides which skill applies to the current task.

§4 Mixed · 66%

This is the harness engineering lesson applied at skill granularity. Every token loaded into context degrades performance somewhere, so you load what’s relevant and leave the rest on disk. Progressive disclosure is how you get a twenty-skill library into a 5K-token slot without poisoning the well.

5. Scope discipline

The meta-skill encodes a non-negotiable I’d staple to every agent if I could: “touch only what you’re asked to touch.” Don’t refactor adjacent systems. Don’t remove code you don’t fully understand. Don’t brush against a TODO and decide to rewrite the file.

This sounds obvious until you watch an agent decide that fixing one bug requires modernizing three unrelated files. Scope discipline is the single biggest determinant of whether an agent’s PR is mergeable or has to be unwound. It’s also the principle that maps most cleanly onto Google’s code review norms, where reviewers will block a PR for doing more than one thing.

The Google DNA

The skills are saturated with practices from Software Engineering at Google and Google’s public engineering culture. This is intentional. Most of what makes Google-scale software work is documented and public, and it is exactly the part agents are most likely to skip.

A partial map of which skill encodes which practice:

Hyrum’s Law in api-and-interface-design. Every observable behaviour of your API will eventually be depended on by someone, so design with that in mind. The test pyramid (~80/15/5) and the Beyoncé Rule in test-driven-development. “

§5 Human · 28%

If you liked it, you should have put a test on it.” Infrastructure changes don’t catch bugs; tests do. DAMP over DRY in tests. Google’s testing philosophy is explicit that test code should read like a specification even at the cost of some duplication. Over-abstracted tests are a known anti-pattern. ~100-line PR sizing, with Critical / Nit / Optional / FYI severity labels in code-review-and-quality. Straight from Google’s code review norms. Big PRs don’t get reviewed; they get rubber-stamped. Chesterton’s Fence in code-simplification. Don’t remove a thing until you understand why it was put there. Trunk-based development and atomic commits in git-workflow-and-versioning. Shift Left and feature flags in ci-cd-and-automation. Catch problems as early as possible, decouple deploy from release. Code-as-liability in deprecation-and-migration. Every line you keep is one you have to maintain forever, so prefer the smaller surface.

None of these are new ideas. The point is that none of them are in the agent by default. A frontier model has read the phrase “Hyrum’s Law” in its training data, but it does not apply Hyrum’s Law when it’s designing your API at 3am. Skills are how you make sure it does.

How to actually use it

Three modes, in roughly increasing commitment.

Mode 1: install via marketplace. If you’re using Claude Code:

/plugin marketplace add addyosmani/agent-skills /plugin install agent-skills@addy-agent-skills

You get the slash commands (/spec, /plan, /build, /test, /review, /ship, /code-simplify) and the agent activates the relevant skills automatically based on context. This is the path I’d recommend most people start on.

Mode 2: drop the markdown into your tool of choice. The skills are plain markdown with frontmatter. Cursor users put them in .cursor/rules/. Gemini CLI has its own install path. Codex, Aider, Windsurf, OpenCode, anything that accepts a system prompt can read them. The tooling matters less than the workflow underneath.

Mode 3: read them as a spec.