The AI Skills Audit · Larry Ludlow

I use Claude Code as my primary development tool, not as a chatbot, but as an agent that writes code, runs tests, opens PRs, and deploys. It calls tools in a loop until the job’s done, and the quality of that loop depends on the instructions guiding it.

Those instructions are skills: reusable prompt modules for specific tasks. I’ve got about 40. Yesterday I audited every one.

The Routing Problem

The most important part of any skill isn’t the instructions. It’s the description.

When an agent starts a session, it doesn’t load every skill into memory. That’d blow through the context window before any real work started. Instead, it reads the descriptions, short summaries maybe a hundred tokens each, and uses those to decide which skill to load for a given request. The description is the routing mechanism. It’s how the agent matches “commit these changes” to the commit skill instead of the create-pr skill or the git documentation.

A bad description doesn’t just make a skill slightly worse. It makes it invisible.

I found skills in my library with descriptions like “Use for code review.” Except I have three review-adjacent skills. One for code changes, one for documentation quality, one for security posture. That description could match any of them. So the agent guesses. Sometimes it loads a security assessment when I wanted a code review.

Here’s what the fix looks like. My debugging skill had this:

description: Use when encountering any bug, test failure, or unexpected behavior.

An orchestrating agent can’t distinguish that from generic troubleshooting. After the audit:

description: Investigates bugs, test failures, and unexpected behavior using
  root-cause-first methodology. Use when debugging errors, diagnosing failures,
  tracing regressions. Applies to "why is this failing", "this broke", "find
  the bug", "trace this error". Requires hypothesis before any fix attempt.

It includes trigger phrases the agent would actually generate, hints at expected behavior, and narrows scope. The description needs to be a little pushier than feels natural. More specific than you’d think necessary.

One technical constraint that bit me: the description has to fit on a single logical line. The YAML spec allows multi-line strings, but the skills runtime reads the description field as a single value. If your formatter wraps it into a block scalar or breaks it across lines in a way the parser doesn’t expect, the skill silently vanishes from the agent’s awareness. I had two skills that’d been ghosts for weeks before I noticed.

Why Edge Cases Matter

You’d think a commit skill is simple. Check status, write a message, commit.

What happens when there’s nothing to commit? The naive approach writes a commit message anyway and fails. What about pre-commit hooks? If linting fails, does the agent bypass the hooks with --no-verify and move on? (Mine did, once. That was a fun morning.) What about .env files, API keys, credentials that got accidentally staged? A commit skill without that guardrail will happily commit your secrets to a public repo.

And the split problem. A developer fixes a bug and adds a feature in the same session. Without an edge case telling the agent to consider splitting, it creates one commit with both changes, and now your git history is useless for bisect.

Every one of those happened to me. The point isn’t that commit logic is unusually tricky. It’s that every seemingly simple skill hides failure modes until you’ve been burned by them. The create-pr skill needs to check CI before opening. The verification skill needs to ban phrases like “should work” and require actual command output as evidence. These aren’t pedantic additions. They’re the difference between a tool that helps and one that creates problems faster than a human would.

Methodology and Discipline

When I started building skills, I wrote them like checklists. Step 1, step 2, step 3. Then something unexpected would happen and the agent had no framework for a judgment call. It’d either follow the checklist blindly into a bad outcome or stall out asking for help.

The skills that hold up teach how to think about the problem, not just what to do. My debugging skill opens with a hard constraint: no fixes without root cause investigation first. Then it lays out a methodology: read errors, reproduce, check recent changes, trace data flow backward, form a single hypothesis and test it minimally. If three fixes fail, stop. You’re not debugging anymore, you’re guessing. Step back and question the architecture.

Here’s what that looks like in the actual skill:

## Iron Law
No fix without root cause. Period.
- Read the actual error. The full error.
- Reproduce it. If you can't reproduce it, you don't understand it.
- If three targeted fixes fail, stop. You're guessing, not debugging.

## Rationalizations to Reject
- "Let me just try..." → Hypothesis first.
- "Quick fix for now, investigate later" → Later never comes.
- "It works on my machine" → Then your machine is the variable.

Compare that to “Step 1: Read the error. Step 2: Try a fix. Step 3: If it doesn’t work, try another fix.” The checklist produces motion. The methodology produces thinking.

Those rationalizations matter because agents don’t give up. They’ll try fix after fix after fix, each time convinced this one will work. Without explicit instructions to stop after repeated failures, they’ll burn through an entire session producing increasingly creative but fundamentally misguided patches. Having “quick fix for now” flagged as a red flag means the agent catches itself before going down that path.

Three of my skills don’t produce output at all. They enforce process. I rewrote the TDD skill to frame it as a design discipline, not a testing methodology. The point of writing the test first isn’t to have more tests. It’s to force better interfaces. When you write the test, you’re designing the API from the consumer’s perspective. That’s the insight worth encoding. Not “write test, then write code.”

The verification skill got reframed around a specific failure mode: agents pattern-match toward “done” because they’re language models trained on text where things resolve. Confidence is their default. The skill counteracts that by requiring actual command output, not assertions that things work.

The Retrospective Loop

One thing that came out of this audit was a new skill: the retrospective. After a feature ships, this skill reviews what happened. Git history, conversation patterns, where the agent misunderstood, where I had to repeat myself, where skills triggered incorrectly.

It maps each friction point to a concrete change. Not “improve the commit skill” but “add edge case to commit skill: warn when staged files include patterns matching .env or credentials.” Concrete enough to execute. Then it waits for approval before touching anything.

This closes the loop. Skills improve because the agent identifies where they fell short, proposes changes, and after I sign off, applies them. The library gets better without me having to remember every lesson I learned three weeks ago.

The Failure Asymmetry

When you’re supervising an agent closely, a vague skill feels survivable. You notice the drift, redirect, recover. The human absorbs the failure so naturally it doesn’t register as failure.

When agents chain skills without supervision (and they increasingly do) that same vague skill produces degraded output that the next skill treats as correct. It processes it further, passes it downstream, and six steps later you’ve got a result that looks completely wrong in a way that’s impossible to trace back to the skill that introduced the error.

“Good enough when I’m watching” and “good enough for agents running at 2am” are not the same standard. Agent-ready skills need deterministic output formats, explicit failure modes, and guardrails for the fact that nobody’s going to catch a subtle error in step three of an eight-step chain.

This is why I spent hours on skills that already “worked.” They worked with my oversight. That’s a lower bar than it sounds.

Building a Skill Library

Don’t write skills from scratch. Do the work yourself, with the agent, for a few weeks. Pay attention to the moments where you repeat yourself. Where you explain the same convention. Where the agent makes a mistake you’ve corrected before. Those are your skills waiting to be written.

When you do write them, build from your actual outputs, not your intentions. Feed the agent examples of your best work and ask it to reverse-engineer the methodology. You’ll discover decisions you’ve been making automatically that you couldn’t have articulated if you tried. Those invisible decisions are exactly what a skill needs to capture.

Keep them lean. Under 500 lines. A skill that’s 1,200 lines long isn’t thorough. It’s unfocused, and the agent will lose the thread halfway through.

And audit them. Skills rot the same way docs rot, but the consequences are worse because nobody notices until the agent does something wrong and you can’t figure out why.

Every skill you get right is a permanent upgrade to every session you’ll ever run. Every one you leave vague is a bug you haven’t found yet.