Plans Are the Engineering

I’ve been using AI coding agents heavily for the last year. Claude Code, Codex, Gemini, the whole lineup. Across four projects and roughly 150 plans, the biggest lesson had nothing to do with prompting or model selection. It was that in agent-driven work, the engineering moves upstream into planning, decomposition, and verification.

Keep plans small: usually three to five tasks, bounded tightly enough that a mid-level engineer could finish them in a sitting.

What Goes Wrong With Big Plans

Hand an AI agent something like “build the authentication system” and you’ll get one of two outcomes:

It charges ahead, makes architectural decisions you never approved, and you burn more time unwinding those choices than you saved.
It gets lost halfway through. Context drifts. The second half of the implementation quietly contradicts the first.

Same root cause both times. The plan blew past the agent’s effective working memory. Not the context window. The useful context window. Big difference.

The Rules I Follow Now

No implementation code in plans. A plan describes what changes and why. File names, command signatures, acceptance criteria, those are fine. But the second you put actual implementation code in there, the agent treats it like gospel instead of guidance. Let it write the code when it gets to that step. It’ll make better decisions with the actual codebase in front of it.

Each step must be independently testable. Not “we’ll test at the end.” Every step needs a way to verify it worked before you move on. Does it play nicely with what came before? Does it set up the next step correctly? Forward and backward, both directions matter.

No TDD islands. A step that says “write tests for the auth module” followed by “implement the auth module” creates this gap where tests got written against imagined code. Write the test and the implementation together. Same step.

Hand-off ready. If you can’t hand a step to a mid-level engineer who’s never seen the project and have them execute it, it’s too vague. Or too coupled to context that only exists in your head.

What This Looks Like at Scale

I recently built a TypeScript SDK for the VergeOS platform. 75+ services, full API parity with existing Go and Python SDKs. The kind of project that sounds like months of work.

I broke it into 46 plans.

Each plan used the same template: overview, requirements, research findings, affected files, implementation steps, a verification gate, and a checklist. Every plan used the same sections and the same format.

The steps within each plan were granular. “Implement VM types” is one step, maybe 1-2 hours. “VM service class with 9 methods and full TSDoc” is another. “Unit tests with 10 assertions” is another. Each step listed the exact files it touches, described what to build in detail, and ended with an acceptance criterion you could run in a terminal: pnpm -F tsvergeos typecheck clean, pnpm -F tsvergeos test passes, specific test counts.

No step was vague. No step said “implement the remaining services.” Every step had a scope you could hold in your head.

Here’s what an actual plan looks like, from the vrg CLI project:

# Phase 9a: Alarm Management

**Scope:** `vrg alarm` commands
**Dependencies:** None

## Overview
Add alarm management commands for VergeOS monitoring.

## SDK Reference
| CLI concept | SDK manager | SDK source file |
|---|---|---|
| Alarms | `client.alarms` | `pyvergeos/resources/alarms.py` |

## Task Checklist
- [ ] Create `alarm.py` with list, get, snooze, unsnooze, resolve, summary
- [ ] Create `alarm_history.py` with list + get commands
- [ ] Register in `cli.py`
- [ ] Add test fixtures to `conftest.py`
- [ ] Create `test_alarm.py` and `test_alarm_history.py`
- [ ] Run `uv run ruff check && uv run mypy src/verge_cli && uv run pytest`

Six tasks. SDK reference pointing to the exact source file. A mechanical verification gate at the end. The full plan also included command signatures, column definitions, data mappings, and test fixtures, but the skeleton above is the part that matters for execution. The agent knows exactly what to build, what to test against, and how to prove it worked.

Reference Docs Are Half the Battle

The plans alone aren’t enough. Each one referenced a set of docs that gave the agent the context it needed to make good decisions: the VergeOS API documentation (337 endpoint docs with complete field schemas), the Go SDK’s 64 service files and 18 architecture decision records, the Python SDK’s 82 resource managers and filter patterns, and a cross-SDK comparison table showing where the implementations diverged.

These reference docs answered the questions the agent would otherwise guess at. What are the actual field names? Which action methods exist? Where does the Go SDK deviate from the API? What patterns did the Python SDK establish for filter builders? The answers were written down, in reference files, before a single line of SDK code existed.

Without those references, every plan becomes an invitation for the agent to hallucinate API shapes. With them, the agent is working from the same source of truth you are.

The Pattern Repeats

The SDK wasn’t the first time I worked this way. It was just the cleanest example.

Before it, I built Marvin, a RAG knowledge base with search, connectors, Slack integration, and a React frontend. 42 plans across 8 phases. The key difference from the SDK: Marvin had runtime wiring as a verification gate. The task runner checked that components were actually initialized at startup, not just importable. Unit tests passing meant nothing if the service wasn’t wired into the running app.

Then vrg, a Python CLI wrapping the VergeOS SDK. 58 plans across 10 phases. Same structure, but with a harder gate: every command had to pass live system verification against a real VergeOS cluster. If a command failed against the live system after three fix attempts, the task runner logged the issue and moved on rather than letting one failure block the chain.

Then vdash, a React dashboard for multi-site VergeOS management. 30+ iterations using autoresearch instead of the plan runner, each scoped to one feature surface (VM management, tenant drawer, storage page). The verification was different too: coverage scripts that checked whether SDK services were actually consumed by UI components, not just imported.

Four projects. ~150 plans total. The same structure kept paying off, even when the results still needed cleanup.

Running Plans Overnight

This is where the planning discipline pays off. I wrote a task runner: a bash script that loops through the plans in order, invokes Claude with each one, waits for it to execute exactly one task, mark it complete, commit the changes, and move on to the next.

For the SDK, eleven hours of autonomous execution produced a mostly working 75-service SDK. For vrg, the same runner worked through 58 plans across ten phases of CLI commands. The task runner prompt was nearly identical across projects. Same structure, same protocol, different reference docs.

If any plan had been vague, or large, or missing its reference context, the chain would have broken at that link.

Why Small Plans?

The sizing rule I use when prompting Claude to break down an implementation: each plan should be three to five tasks, small enough that a mid-level engineer could finish them in a few hours. That’s the bar. Not because a human is doing the work, but because it’s the right unit of scope for an agent to hold in context without drifting.

This matches a broader pattern others are landing on too: long-running sessions degrade, while short bounded loops anchored in files and git stay reliable. Each plan is a fresh context window. The agent reads the plan, reads the reference docs, executes one bounded task, commits, and stops. No accumulated context to rot.

The sizing also forces you to think about decomposition. Breaking work into chunks isn’t busywork. It’s design. The seams where you split the plan reveal the actual architecture of what you’re building. When I broke the SDK into 46 plans, the plan boundaries mapped almost exactly to the service boundaries in the final code. The decomposition was the architecture.

Shakedowns Between Phases

Every 6-8 plans, I inserted a shakedown. A short plan (3 steps) that audits everything built so far. Dead code and wiring audit. Test coverage audit. Build output validation. These aren’t feature work. They’re quality gates the agent runs between phases.

The shakedowns caught things that individual plan verification missed. A service that passed its own tests but wasn’t wired into the barrel exports. Type definitions that compiled but didn’t match the actual API response shape. Dead code left behind from an early refactor that the agent never cleaned up because no single plan owned it.

Six shakedown phases across 46 plans. Each one found something worth fixing.

Where It Doesn’t Land Clean

None of these projects came out of the task runner at 100%. Not one.

The SDK had services that passed their own tests but broke when you actually imported them together. Marvin had components that were wired up correctly in isolation but failed under real concurrent load. vrg had commands that worked against the live system during development and then hit edge cases the test fixtures never covered.

Every project came out somewhere between 75% and 90% done. Each one took roughly half a day of back-and-forth with Claude Code to get over the line to what I’d call MVP. Debugging the gaps, fixing the integration seams, handling the edge cases the plans didn’t anticipate.

But here’s the thing: going from MVP to 1.0 took far less effort than it would have without the autonomous run. The base was solid. The architecture was sound. The patterns were consistent. I was fixing gaps in a coherent codebase, not wrestling with a pile of disconnected experiments. The planning bought me a foundation worth finishing.

The Real Shift

Think about what I actually did for the VergeOS SDK. I wasn’t manually writing 75 services. I was steering the system: defining the planning structure, directing the agent to extract and compare reference material, reviewing the decomposition, and tightening the verification gates until the execution loop was reliable. The agent did most of the legwork, including large parts of the research, document synthesis, and plan drafting. My contribution was setting constraints, making judgment calls, and validating outputs.

I didn’t review most of the generated code line by line. I reviewed it through verification gates, shakedowns, and integration checks instead. That was enough to tell me whether each step worked and where the seams still needed attention.

The same thing happened with Marvin, vrg, and vdash. Different languages, different problem domains, different verification gates. Same result: the hours I spent steering bought multiples of that in autonomous execution. And the plans were portable across projects in a way the code never could be. The task runner prompt, the plan structure, the verification protocol, all of it transferred. The implementations didn’t.

That’s the shift I keep seeing. The skill isn’t “can you write a good function.” It’s “can you build a workflow where an agent can produce both the plans and the code, without the architecture falling apart.”

The plans are more valuable than the code they produce. The code will change. The plans preserve the architecture, the decisions, and the constraints that made the code trustworthy in the first place.