← What's New

The Spec-Driven Development Playbook: How to Stop Vibe-Coding Your Agents Into Production

Vibe coding survives prototypes; production breaks it. A spec template, three rigor levels, and the path to spec-driven development without revolution

Vibe coding is the prototyping technique that broke when it tried to grow up.

There is a phrase that captured the first phase of AI-assisted development perfectly: vibe coding. You sit with a coding assistant, describe what you want in approximate language, watch it produce something that looks plausible, run it, and iterate by feel. For a prototype it works. For a side project it is genuinely fun. For production software with maintenance horizons, customers, regulators, and on-call rotations, it falls over in a recognizable way: the code looks right, sometimes works, but solves a slightly different problem than the team actually had, and nobody can quite reconstruct how it got there.

The response that has crystallized in the past quarter is Spec-Driven Development, abbreviated SDD. The methodology is not new — formal specifications predate large language models by decades, and behavior-driven development covered some of the same ground — but the practice has been re-formed around a specific premise: when an AI does much of the implementation, the artifact a team should treat as canonical is the specification, not the resulting code. A January 2026 arXiv paper formalized the typology, and a wave of tools — GitHub Spec Kit, AWS Kiro, Claude Code’s skills system, OpenSpec, Tessl, and others — turned the philosophy into something teams can actually adopt.

This post is the playbook for that adoption. Where SDD pays off, where it does not, the three levels of rigor, the structure of a useful spec, the current tools, and the path most teams should take to introduce the practice without a heroic rewrite.

What changed — and why “spec-driven” exists as a movement

Specifications are not new in software. Most engineering organizations have been writing some form of design doc, RFC, PRD, or API contract for years. So why is “spec-driven development” a recognizable category in 2026 when it was just “writing things down” in 2022?

Two things changed.

The first is that AI coding agents made the cost of generating code-from-specification dramatically lower. The economic relationship between writing a spec and writing the code has inverted for a meaningful share of work: in the new world, the spec is the limiting factor and the code is comparatively cheap. That changes which artifact deserves the team’s investment.

The second is that AI agents produced a specific new failure mode — fluent but subtly wrong code — that traditional reviews catch poorly. A spec, executed as a contract the agent must satisfy, gives the team something to check the output against. Without it, every diff is judged on whether it looks plausible, which is exactly the property AI is best at producing.

The shift is from prompt-as-context (write a request, hope) to spec-as-contract (write the constraints, generate, verify). Engineering leaders should hear it that way: not as a methodology fad, but as the realization that AI changes the relative cost of specification and implementation, and the practice should follow that economics.

The three levels of rigor

The arXiv paper presents three levels of specification rigor, each appropriate for different situations. This taxonomy is the most useful contribution of the recent literature, and the right starting point for any team thinking about adoption.

Spec-first is the heaviest. A complete specification is written and reviewed before any code is generated. The spec covers requirements, design, acceptance criteria, and constraints. Code is produced from the spec, often with multiple iterations. The spec is updated when behavior changes, before the code is regenerated. Best for: compliance-sensitive workloads, teams of three or more sharing context, anything with a maintenance horizon longer than six months, and greenfield projects where requirements can be settled in advance.

Spec-anchored is the middle path. A lightweight spec is generated at the start of each feature or sprint — enough to constrain the AI’s interpretation, not enough to be exhaustive. The spec evolves with the code; either can be the source of an update, with the discipline of keeping both in sync. Best for: most teams switching away from vibe coding. The on-ramp for the discipline without the overhead of full upfront documentation.

Spec-as-source is the experimental end. The specification is the persistent artifact; code is generated on demand. To change behavior, you change the spec and regenerate; you never edit the generated code. Tessl is the most-discussed implementation of this pattern. Best for: forward-looking experiments where the team has unusual discipline. Production use of this pattern is rare and the tooling is not mature.

The realistic guidance: start at spec-anchored. The discipline catches enough of the vibe-coding failure modes to be worth the cost, and the overhead is small enough that teams will actually do it. Spec-first is the right destination for high-stakes workloads but a bad starting point — the upfront effort is large, and teams routinely abandon it. Spec-as-source is interesting and worth tracking; deploying it as a primary methodology in 2026 is a strong opinion held weakly.

A six-element spec template

A spec is only useful if it is concrete enough to reduce ambiguity but light enough that someone will actually write it. The template that has held up across most teams converges on six elements.

  • Intent and scope. A two-to-five-sentence description of what this feature does and, just as important, what it does not do. The boundary is the part that most often saves you.
  • User stories or acceptance criteria. A small number of clear “when X, then Y” statements. Three to ten is the right range. More than that means the spec is doing the work of the implementation.
  • Interfaces and contracts. API shape, input and output schemas, error responses, idempotency rules. This is where the AI’s biggest hallucination risk lives, so it earns the most precision.
  • Constraints and non-functional requirements. Performance budgets, security rules, privacy requirements, dependencies, platform constraints. The things that, if left unspecified, the agent will quietly invent.
  • Tests and validation. Concrete examples — inputs, outputs, error cases. These often double as evals: the agent’s output is verified against them automatically.
  • Open questions and decisions log. What is unresolved. What was decided and why. The history of the spec, which is what future teammates and future agents will rely on.

Two principles tie the template together. First, every section should be skimmable in under a minute. If it is longer, decompose it. Second, the spec is for the next person, not the current one — the agent rebuilding from the spec a year from now, or the engineer joining the project. Write it for them.

Tool comparison: where the field actually is

The tooling landscape has converged faster than most categories. As of early 2026, four products cover most of what teams actually use.

GitHub Spec Kit is the open-source reference implementation. It provides templates, a CLI, and prompts to guide an AI coding agent through a structured workflow: constitution, specification, plan, tasks. It supports a long list of coding agents (Copilot, Claude Code, Gemini CLI, Cursor, and many others). The strength is portability — your spec is not locked to one vendor — and the weakness is that the current default treats each spec as a branch-scoped artifact, which makes it closer to spec-first per-feature than spec-anchored over time. Good starting point for teams whose agents vary.

AWS Kiro is the most opinionated. An IDE built on Code OSS, with the spec workflow baked into the product (requirements.md, design.md, tasks.md generated for every feature) and “agent hooks” that keep specs and code in sync. The deep AWS-ecosystem integration (Bedrock, CDK, IAM Identity Center) is a real lift for AWS-first teams and feels like overhead otherwise. The pure-product experience is the most mature; the lock-in is the most explicit.

Claude Code’s skills system treats SKILL.md files as scoped specifications the agent reads before acting on a class of task. Less feature-spec-oriented than the others, more domain-rule-oriented. Useful as a complement to either Spec Kit or Kiro rather than a replacement.

OpenSpec and Tessl represent the spec-anchored and spec-as-source ends of the spectrum respectively. Smaller communities, more experimental tooling, but worth tracking for teams that want to push further than the mainstream products allow.

The honest call: pick the tool that fits your existing agent and platform commitment. The methodology is more important than the tool, and the tool you pick today will not be the tool you use in two years.

Introducing SDD into a brownfield codebase

The literature focuses on greenfield. The reality for most teams is brownfield — large existing codebases, established conventions, working features that nobody documented. A four-step path that has worked for teams I have seen.

  1. Pick one feature area, not the whole codebase. SDD is hard to introduce as a top-down mandate. It is easy to introduce as “the next feature in module X will be written this way.” Pick a module with a clean boundary, ideally one with a high rate of AI-assisted work and a recent incident that exposed the vibe-coding failure mode.
  2. Write a retrospective spec for one existing piece. Take a recent feature in the chosen area and write the spec the team should have written. This is calibration. It teaches the team what a useful spec looks like, surfaces ambiguities in the existing code, and produces an artifact the agent can use immediately.
  3. Run the next feature spec-anchored. Use the lightweight template. Generate code from it. Update the spec when behavior changes during implementation. Treat the spec as a review artifact alongside the diff. Iterate.
  4. Evaluate after three features, not after one. The first feature done this way is slower; the third is usually faster than the vibe-coded version because the team has caught the ambiguities upfront and the agent has tighter context. If after three features the team’s review burden has not decreased and the rework rate has not improved, the practice is not worth keeping in that area. Try elsewhere or stop.

The constant temptation in brownfield is to treat SDD adoption as a documentation project. It is not. The spec exists to guide the agent and anchor review; if no agent reads it and no reviewer references it, it is wasted effort. Tie every spec to the work it produces.

The deeper point about spec-driven development is that it is not really about specs. It is about describing intent before generating output, in an environment where output is cheap and intent is the scarce resource. Vibe coding worked while specifications were the bottleneck and writing code was the cost. With those costs inverted, the methodology has to invert too. Teams that adapt will look two years from now like they always did this — careful, deliberate, structured. Teams that did not will have a particular kind of debt that takes longer to recognize and longer to repay. The forcing function is new. The playbook is not.