Software engineering is undergoing a phase transition. The convergence of capable foundation models, agentic tool orchestration, and structured development methodologies has created a new paradigm where the human engineer's role shifts from implementer to orchestrator, from writing code line-by-line to specifying intent, guiding execution, and verifying outcomes. Addy Osmani captures this inversion succinctly: in late 2025, AI wrote 80% of code for early adopters, placing disproportionate emphasis on the human's role in owning outcomes, maintaining quality bars, and ensuring that tests actually validate behavior [1].
This tutorial examines the three pillars of this emerging stack: Claude Code as the agentic execution environment, Specification-Driven Development (SDD) as the methodology that transforms vague intent into executable contracts, and Verification & Validation (V&V) as the assurance framework that closes the loop between specification and implementation. Each pillar is presented first independently, with formal definitions, architecture, and best practices, then unified into a cohesive engineering pipeline.
Claude Code: The Agentic Execution Environment
Claude Code represents a fundamental shift in the developer's relationship with AI. Andrej Karpathy described it as "a little spirit or ghost that lives on your computer" [3]: anything you can achieve by typing commands into a terminal is something Claude Code can automate. The tool is intentionally low-level and non-opinionated, providing near-raw access to the model without imposing specific workflows [2]. By early 2026, Claude Code was authoring roughly 135,000 public GitHub commits per day, representing approximately 4% of all public commits, and Anthropic reported that 90% of its own code is AI-written [36].
The introductory material above establishes what Claude Code is. The remainder of this section is designed for principal engineers and senior practitioners who need to understand how it works at an architectural level, and how to exploit its extension points to build enterprise-grade agentic workflows. The relevant extension points are hooks, subagents, skills, worktrees, and headless execution.
1.1 Architecture: The Master Agent Loop
At its core, Claude Code is built around a deceptively simple architecture: a layered system centered on a
single-threaded master loop (internally codenamed nO), enhanced with a real-time
steering capability (the h2A asynchronous dual-buffer queue), a rich toolkit, intelligent planning
through TODO lists, controlled sub-agent spawning, and comprehensive safety measures [37]. The design thesis is explicit: a simple, single-threaded master loop
combined with disciplined tools and planning delivers controllable autonomy. Anthropic deliberately chose
this approach over complex multi-agent swarm architectures for debuggability, transparency, and reliability [38].
The operational flow is straightforward: user input arrives → the model reasons and selects tools →
tools execute → results feed back to the model → the cycle continues until a plain text response
without tool calls terminates the loop [37]. The h2A queue enables
real-time steering: users can inject new instructions, constraints, or redirections while the
agent is actively working, without requiring a full restart [38]. This is what makes
Claude Code feel interactive even during long autonomous operations.
A critical architectural detail: Claude Code dispatches shell commands to Claude Haiku (Anthropic's smallest, fastest model) for pre-execution safety classification. Haiku responds with structured output classifying the command's risk level, enabling Claude Code to avoid dangerous operations without slowing down the main reasoning loop [39]. This is the first line of defense against prompt injection. Because it runs outside the main model's context, it is lightweight, fast, and structurally immune to the attack it is defending against.
1.2 The Tool System: 14 Primitives for Agentic Agency
Without tools, Claude can only respond with text. Tools are what make Claude Code agentic [40]. The system ships with 14 built-in tools organized into five functional categories:
| Category | Tools | Purpose |
|---|---|---|
| Command Line | Bash, Glob, Grep, LS |
Shell execution, file pattern matching, content search, directory listing |
| File I/O | Read, Write, Edit, MultiEdit |
Read, create, modify, and batch-modify files |
| Web | WebFetch, WebSearch |
Retrieve URLs, search the internet for current information |
| Notebooks | NotebookRead, NotebookEdit |
Parse and modify Jupyter notebooks (special handling for long cell output) |
| Orchestration | TodoWrite, Task |
Planning via structured TODO lists; subagent delegation |
TodoWrite deserves special attention. It functions as cognitive scaffolding: the model creates a structured JSON task list with IDs, content, status, and priority levels before executing work. The UI renders these as interactive checklists. System reminder messages inject the current TODO state after each tool use, preventing the model from losing track of its objectives in long conversations [37]. When the TODO list is empty, a system reminder nudges the model to create one, entirely invisibly to the user. This design pattern reveals something important: Claude Code's planning capability is not emergent magic; it is engineered scaffolding implemented through tool-call conventions and system prompts.
The tool system is designed for parallel dispatch. Claude can invoke multiple independent tool
calls in a single response turn, dramatically reducing latency for operations like reading several files or
running concurrent searches. The system prompt explicitly instructs: "batch your tool calls together for optimal
performance" [41]. A strict Read-before-Write invariant is
enforced: Claude must read a file's current state before writing to it, preventing blind overwrites. The
Grep tool is built on ripgrep and the system prompt explicitly prohibits using grep or
rg via Bash, ensuring all search operations go through the instrumented, permission-aware tool
layer [41].
1.3 Context Engineering: The Governing Constraint
Context management is not a feature of Claude Code; it is the governing constraint that shapes every architectural decision. Every message sent, file read, tool result, and conversation turn consumes tokens from a fixed budget, and the size of that budget varies significantly across models. A 500-line TypeScript file alone consumes approximately 4,000 tokens [42]. In practice, without optimization, a developer fills even the largest context window during sustained intensive work [43].
| Model | Context Window | Max Output | Extended Thinking | Context Engineering Notes |
|---|---|---|---|---|
| Claude Opus 4.6 | 1,000,000 tokens | 128K tokens | Yes (adaptive) | 5× the window of Haiku; enables full-repository reasoning without compaction for most projects. Higher cost per token ($5/$25 per MTok in/out) makes context hygiene an economic concern even when the window is not technically exhausted. |
| Claude Sonnet 4.6 | 1,000,000 tokens | 64K tokens | Yes (adaptive) | Same window as Opus at lower cost ($3/$15 per MTok). Claude Code's default model for most coding tasks. The lower max output (64K vs 128K) means very large single-file generations may require chunking. |
| Claude Haiku 4.5 | 200,000 tokens | 64K tokens | Yes (manual only) | Used internally by Claude Code for safety classification of bash commands. The 200K window is the binding constraint when Haiku is used as the primary model; compaction becomes critical around 30 minutes of intensive work. |
The 1M-token window on Opus and Sonnet represents a 5× expansion over the previous generation and fundamentally changes context strategy: entire medium-sized codebases (∼750K lines) can fit within a single session. However, more context does not mean better performance. The governing constraint is not just window size but context quality: irrelevant tokens dilute attention regardless of how much capacity remains.
Compaction: The Safety Net
Automatic compaction triggers when context usage reaches approximately 92–95% of the window. The
compaction engine (wU2) summarizes earlier context while preserving essential information like
session names, plan mode state, and custom configurations [46]. A
counter-intuitive finding: triggering compaction earlier (around 75% utilization) actually extends
productive session length, because the remaining 25% provides sufficient working memory for high-quality
reasoning. Claude Code appears to have adopted this approach, reserving approximately 50K tokens for active
reasoning [47].
CLAUDE.md fully survives compaction. After /compact, Claude re-reads CLAUDE.md
from disk and re-injects it fresh. If an instruction disappeared after compaction, it was given only in
conversation, not written to CLAUDE.md [48]. This is why persistent instructions
must live in files, not in chat.
Manual compaction via /compact accepts an optional parameter that guides summarization:
/compact "Preserve only the modifications on auth/ and the tests". PreCompact
hooks execute automatically just before compaction, enabling state preservation (backing up
transcripts, logging critical context) before information is lossy-compressed [42].
Since Claude Code version 2.0, the compaction engine preserves structured prompts with 92% fidelity versus 71%
for narrative prompts [45].
Advanced Context Strategies
Multi-session horizontal scaling is the strategy of distributing work across several specialized Claude Code instances, each with its own context window. One terminal for backend, another for frontend, a third for tests. Average response time drops from 8.2 seconds to 2.1 seconds with three targeted sessions [45].
Plan mode (Shift+Tab) restricts the agent to read-only access. It reads your
codebase, proposes a strategy, and waits for validation before executing. This reduces total token consumption
by 40–60% on complex tasks by separating the thinking phase from the execution phase [42].
At the API level, context editing provides fine-grained control: the
clear_tool_uses strategy automatically clears the oldest tool results in chronological order when
context grows beyond a threshold. This is distinct from compaction (which summarizes); context editing
surgically removes stale tool outputs that have already been processed [49]. For long-running agentic workflows, Anthropic recommends combining
both: compaction for conversation management, and context editing for tool result hygiene.
1.4 Memory Architecture: CLAUDE.md and Auto Memory
CLAUDE.md supports a hierarchy of scopes: ~/.claude/CLAUDE.md (global preferences)
→ project-root CLAUDE.md (repository-wide conventions) → subdirectory CLAUDE.md files
(module-specific rules). This mirrors configuration inheritance patterns from tools like
.editorconfig or tsconfig.json. For large teams, .claude/rules/ files
provide modular instruction sets that load alongside CLAUDE.md [48].
@path imports or .claude/rules/ files.
Auto memory lets Claude learn from your corrections without manual effort. When you correct
Claude's behavior (e.g., "No, we always use named exports"), Claude can write that correction to its auto memory
directory as plain markdown files you can read, edit, or delete via /memory. Subagents can
maintain their own auto memory [48]. This creates a compounding learning loop: each
session benefits from the corrections of all previous sessions.
The /init command generates a draft CLAUDE.md by analyzing your project structure. Most
practitioners start with /init and aggressively prune. Over time, the most valuable additions come
from code reviews: when a PR reveals an undocumented convention, that is a signal to update CLAUDE.md [6].
1.5 Model Context Protocol (MCP): External Tool Integration
MCP servers communicate via three transport protocols: stdio (local processes via
stdin/stdout), SSE/HTTP (remote streaming endpoints), and HTTP
(non-streaming). Configuration lives in a .mcp.json file at the project root, which can be
committed to git for team sharing [8].
// .mcp.json - Example JIRA + GitHub integration
{
"mcpServers": {
"jira": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@anthropic/mcp-jira"],
"env": {
"JIRA_HOST": "${JIRA_HOST}",
"JIRA_API_TOKEN": "${JIRA_API_TOKEN}"
}
},
"github": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": { "GITHUB_TOKEN": "${GITHUB_TOKEN}" }
}
}
}
MCP tools appear as regular tools in the permission system and follow a naming convention:
mcp__<server>__<tool> (e.g., mcp__github__search_repositories). This means
hooks can match them via regex: mcp__memory__.* matches all tools from a memory server [11].
When many MCP tools are configured, tool definitions can consume a significant portion of the context window. Claude Code addresses this with MCP tool search, which dynamically loads tools on-demand when tool descriptions would consume more than 10% of the context [7]. An even more aggressive optimization: compiling MCP servers to Skills can reduce token consumption by 80–98% by converting dynamic tool definitions into static instruction sets [21].
1.6 Agent Skills: On-Demand Domain Expertise
SKILL.md descriptor and optional scripts. Unlike
slash commands (which are user-triggered), skills activate automatically when their description matches the
current task context. They function as on-demand domain expertise that loads contextually [10].
Skills solve the fundamental problem of context budget allocation: you cannot put all domain expertise in the system prompt without destroying the context budget. A custom UI library's usage patterns, a specific testing framework's conventions, or a deployment pipeline's requirements are valuable when relevant but wasteful when not. The skill character budget scales with the context window: approximately 2% of total context. Users with larger context windows can see more skill descriptions without truncation [46].
Skills reside in .claude/skills/. Each folder contains a SKILL.md with YAML
frontmatter (name, description) followed by instructional content. The description field is critical: it
determines when the skill activates via semantic matching. Skills are injected into context via the
tool_result mechanism, loading them as if they were the output of a tool call rather than as part
of the system prompt. This is an important distinction: it means skills can be loaded and unloaded dynamically
without invalidating the prompt cache [50].
CLAUDE.md: Always-on context loaded at session start. Use for conventions, build commands, architecture rules. Cost: permanent token consumption.
Skills: Loaded on-demand when semantically relevant. Use for domain-specific instructions that are only sometimes needed. Cost: token consumption only when active.
Hooks: Deterministic shell commands at lifecycle points. Use for actions that must always happen regardless of model judgment. Cost: zero token impact (runs outside the model) [36].
1.7 Hooks: Deterministic Lifecycle Control
Claude Code exposes 21 lifecycle events across seven categories and four handler types. The critical distinction: hooks are deterministic (guaranteed execution), while CLAUDE.md instructions are probabilistic (model-dependent adherence). If an action must happen every time, it belongs in a hook [36].
The four handler types are: command (type: "command") for shell scripts that
receive JSON on stdin; HTTP (type: "http") for POSTing event JSON to remote
endpoints (useful for external logging and webhooks); prompt (type: "prompt") for
single-turn LLM evaluation using a fast Claude model; and agent (type: "agent")
for spawning a subagent with tool access (Read, Grep, Glob) to perform multi-turn verification before returning
a decision. Eight events support all four types; the remaining thirteen are command-only [11].
| Event | Fires When | Handler Types | Advanced Use |
|---|---|---|---|
| Session Management | |||
SessionStart |
Session begins/resumes | command | Inject git status + TODO.md as context; set env vars via CLAUDE_ENV_FILE |
InstructionsLoaded |
CLAUDE.md or .claude/rules/*.md loaded |
command | Audit which instruction files are loaded; compliance logging; track lazy loads |
SessionEnd |
Session terminates | command | Cleanup, logging, session statistics; 1.5s default timeout |
| User Input | |||
UserPromptSubmit |
User submits prompt, before processing | all four | Prompt filtering, logging, input transformation; decision: "block" erases the prompt |
| Tool Execution | |||
PreToolUse |
Before a tool call executes | all four | Block rm -rf, protect .env files; updatedInput modifies tool
args |
PermissionRequest |
Permission dialog appears | all four | Auto-approve known-safe patterns; updatedPermissions applies "always allow" rules |
PostToolUse |
After tool succeeds | all four | Auto-format with Prettier after Write|Edit; auto-lint;
updatedMCPToolOutput for MCP tools
|
PostToolUseFailure |
After tool fails | all four | Structured error logging; provide corrective additionalContext to Claude |
| Notifications | |||
Notification |
Claude Code sends a notification | command | Custom notification routing; Slack alerts for permission_prompt or
idle_prompt
|
| Agent Management | |||
SubagentStart |
Subagent spawned | command | Inject context into subagents via additionalContext; log subagent spawning |
SubagentStop |
Subagent finishes | all four | Validate subagent output; decision: "block" prevents stopping |
Stop |
Main agent finishes responding | all four | Generate summary, notify Slack; decision: "block" forces Claude to continue |
TeammateIdle |
Agent team teammate goes idle | command | Quality gate for agent teams; require build artifacts before stopping |
TaskCompleted |
Task marked completed | all four | Quality gate: exit code 2 blocks completion + feeds stderr as feedback |
| Configuration & Cleanup | |||
ConfigChange |
Config file changes mid-session | command | Audit settings changes; block unauthorized modifications (except managed policy) |
WorktreeCreate |
Worktree created via --worktree |
command | Custom VCS support (SVN, Perforce); hook prints worktree path to stdout |
WorktreeRemove |
Worktree removed at session exit | command | Cleanup worktree state; archive changes; failures logged in debug mode only |
| Context Management | |||
PreCompact |
Before context compaction | command | Back up transcripts; matcher distinguishes auto vs manual |
PostCompact |
After compaction completes | command | Log compaction summaries; update external state from compact_summary |
| MCP Integration | |||
Elicitation |
MCP server requests user input | command | Auto-respond to MCP input requests; action: "accept" with form values skips dialog |
ElicitationResult |
User responds to MCP elicitation | command | Override or audit user responses before they reach the MCP server |
The exit code protocol is the key to hook power: exit 0 means "proceed normally" (with optional
JSON output for decision control), while exit code 2 means "block the action and feed stderr back to the
model as feedback". This creates closed-loop quality gates. For example, a TaskCompleted
hook can run the test suite and, on failure, exit 2 with the error message. The model receives the test
failures as feedback and continues working to fix them [11].
# .claude/hooks/quality-gate.sh — TaskCompleted hook
#!/bin/bash
INPUT=$(cat)
TASK_SUBJECT=$(echo "$INPUT" | jq -r '.task_subject')
if ! npm test 2>&1; then
echo "Tests not passing. Fix failing tests before completing: $TASK_SUBJECT" >&2
exit 2 # Block completion, feed error back to model
fi
exit 0
Hook configuration is snapshotted at session start for security: if a malicious CLAUDE.md or prompt injection
modifies the settings file mid-session, the changes are not applied until the user reviews them in
/hooks. Enterprise deployments can enforce policy hooks organization-wide [51].
1.8 Subagents, Agent Teams, and Worktrees
Subagents: Isolated Context Workers
Subagents are spawned via the Task tool (internally dispatch_agent). Each subagent
runs in its own context window with its own set of allowed tools, completely isolated from the main conversation
[52]. When a subagent completes, only its summary returns to the
parent. The verbose intermediate outputs (test logs, file contents, search results) remain contained within
the subagent. This is the primary mechanism for managing context pollution: delegate context-heavy operations
to a subagent and receive only the distilled result.
Two built-in subagent types exist: Explore (read-only codebase research, triggered
automatically for open-ended questions) and general-purpose (full tool access for
implementation tasks). Custom subagents are defined as markdown files in .claude/agents/ with YAML
frontmatter specifying name, description, tools, and whether the agent is proactive (auto-triggered
when relevant) or explicit-only [52]. A critical constraint: subagents cannot
spawn other subagents. If your workflow requires nested delegation, chain subagents from the main conversation
[52].
# .claude/agents/security-reviewer.md
---
name: security-reviewer
description: Security review specialist for authentication, injection, and credential exposure
tools:
- Read
- Grep
- Glob
- TodoWrite
proactive: true
---
You are an expert security reviewer. When invoked:
1. Scan for hardcoded credentials and API keys
2. Check authentication and authorization logic
3. Review input validation and sanitization
4. Assess error handling for information leakage
5. Flag dependency vulnerabilities
Return findings with severity ratings and remediation steps.
Agent Teams: Coordinated Multi-Session Work
Shipped as an experimental feature with Opus 4.6 in February 2026, Agent Teams extend subagents to coordinated multi-session work [53]. Unlike subagents (which report results back in isolation), team members can share findings, challenge assumptions, and coordinate directly with each other via a shared task list and mailbox-based messaging system.
The architecture: one session acts as the team lead, coordinating work and synthesizing
results. Teammates work independently, each in its own context window, communicating through direct messaging.
Think of subagents as contractors you send on separate errands; Agent Teams are a project team sitting in the
same room [53]. Teams are enabled via
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and created with natural language: "Set up a team: one agent
handles the backend API routes, one builds the frontend login forms, one writes the integration tests."
Worktrees: Filesystem Isolation
When multiple agents edit the same files simultaneously, you get merge conflicts, race conditions, and corrupted
state. Worktrees solve this by giving each agent its own copy of the repository via
git worktree. The --worktree flag creates a new directory with its own branch, files,
and git index [54]. Changes merge back through standard git workflows.
Custom subagents can specify isolation: worktree in their frontmatter to automatically run in
isolated worktrees [12]. Agent Teams can be combined with worktrees: teammates
coordinate tasks through messaging while maintaining full filesystem isolation.
1.9 Extended Thinking and Model Selection
Claude Code supports extended thinking, a mode in which the model allocates internal compute to reason through complex solution spaces before producing output. Thinking is triggered by keywords embedded in prompts, each allocating a different budget [55]:
| Keyword | Budget | Best For |
|---|---|---|
think |
~4K tokens | Routine tasks needing modest deliberation |
think hard |
~10K tokens | Multi-step problems, bug investigation |
megathink |
~20K tokens | Design work, API design, architecture |
ultrathink |
~32K tokens | Complex architecture decisions, deep security review |
Opus 4.6 introduced adaptive thinking: the model automatically determines reasoning depth based on problem complexity, eliminating the need for manual keyword selection in most cases [36]. The latest models also support interleaved thinking, which reasons between tool calls rather than only at the start, enabling the model to adjust its approach based on intermediate results [55].
/model
during a session. In headless CI/CD pipelines, specify via --model flag [36].
1.10 Headless Mode and CI/CD Integration
The -p (or --print) flag transforms Claude Code from an interactive assistant into a
programmable Unix utility that plays nicely with pipes, redirects, and shell scripts [56]. The Agent SDK (previously called "headless mode") is available as a CLI, or
as Python and TypeScript packages for full programmatic control with structured outputs, tool approval
callbacks, and native message objects.
# Code review as a CI pipeline step
git diff origin/main...HEAD | claude -p \
"Review this diff for bugs, security issues, and performance concerns. \
Format as JSON with severity ratings." \
--output-format json \
--allowedTools Read,Grep,Glob \
--max-turns 5 > review.json
# Quality gate: block merge on critical findings
if echo "$review" | jq -e '.[] | select(.severity == "critical")' > /dev/null; then
echo "Critical issues found. Blocking merge."
exit 1
fi
Three output formats serve different automation needs: text (human-readable, default),
json (structured with token stats and session metadata), and stream-json (real-time
streaming for progress monitoring) [57]. The --allowedTools flag
restricts what tools the agent can use. This is critical for CI/CD pipelines where read-only tools like
Read,Grep,Glob are appropriate but
Bash or Write are not.
The /batch command extends this to codebase-wide migrations. It launches an
internal orchestrator that investigates the codebase, decomposes work into 5–30 independent units, and
processes them in parallel via worktree-isolated agents: /batch migrate from React to Vue,
/batch add type annotations to all untyped functions [58].
For Docker-based CI, Claude Code runs in a Node.js Alpine image with approximately 225 MB total footprint.
Session persistence across pipeline steps is handled via --session-id for naming sessions and
--resume for continuing them [57].
1.11 Security Architecture
Claude Code implements a defense-in-depth security model with four permission modes, sandboxing, and active prompt injection detection [59]:
| Mode | Behavior | Use Case |
|---|---|---|
| Normal (default) | Explicit approval for every write, shell command, network call | Interactive development, unfamiliar codebases |
| Plan | Read-only access; no modifications allowed | Codebase auditing, refactoring strategy |
| Auto-accept | File reads/writes auto-approved; shell commands still require approval | Routine development (40% speedup per Anthropic benchmarks) |
| Bypass | All confirmations removed | Trusted CI/CD only; never on unaudited repositories |
Sandboxing is enabled by default on macOS (via Seatbelt) and Linux (via bubblewrap), adding
less than 15ms of latency while isolating the filesystem [60]. Write access is
restricted to the folder where Claude Code was started and its subfolders. The permission system uses granular
allow/deny rules in settings.json that support regex patterns: Bash(npm *) allows only
npm commands, while Bash() would allow all shell commands (a common and dangerous misconfiguration)
[60].
Prompt injection defense operates at multiple layers: the Haiku safety classifier pre-screens bash commands, the
permission system gates all side-effecting operations, context-aware analysis detects embedded malicious
instructions, and a command blocklist blocks risky network commands like curl and wget
by default [59]. A documented CVE (CVE-2025-55284) demonstrated API key theft via
DNS exfiltration, underscoring that these layers are necessary, not theoretical [61].
1.12 Prompting Strategies for Agentic Coding
Prompting an agentic system differs fundamentally from prompting a chatbot. The most effective pattern, validated across Anthropic's internal teams, follows a four-phase workflow: Explore, Plan, Code, Commit [4].
Explore (Read-Only)
Instruct Claude to research the codebase without writing any code. "Look at the authentication module in src/auth/ and the related tests. Understand the patterns used. Do NOT write any code yet." This front-loads context and prevents premature implementation.
Plan
Ask Claude to propose an implementation strategy. "Now create a plan for adding OAuth2 support. Identify which files need changes, what new files are needed, and any migration concerns. Do not implement yet." Review this plan before proceeding.
Code
With context loaded and plan approved, direct Claude to implement. "Implement the OAuth2 support following your plan. Start with the provider configuration, then the callback handler, then the session integration." Specificity reduces drift.
Commit
Have Claude write a commit message and prepare the change for review. Many Anthropic engineers delegate over 90% of their git operations to Claude, including commit messages, branch management, and PR descriptions [4].
Be specific on first attempt. "Reference the existing widget implementations on the homepage, especially HotDogWidget.php. Implement a new calendar widget following that pattern" dramatically outperforms "add a calendar widget" [13].
Manage context deliberately. Use /compact with explicit summaries before
context fills. For long-running sessions, break work into 30-minute sprints with compaction between them [43].
Trigger appropriate thinking depth. Use think for routine tasks,
ultrathink for architecture decisions. Match the thinking budget to the problem complexity rather
than defaulting to maximum depth [55].
Use checklists as inline verification. "Before committing, verify: (1) all tests pass, (2)
no console.log statements remain, (3) error handling covers the edge cases in the spec." Better yet, encode
these as a TaskCompleted hook so they are enforced deterministically.
Delegate to subagents for context hygiene. Any operation that produces large output (running tests, fetching documentation, processing logs) should be delegated to a subagent so only the summary enters the main conversation's context [52].
Specification-Driven Development
SDD emerged as the antithesis of "vibe coding," the pattern where developers describe goals conversationally and receive code in return. As the Thoughtworks analysis notes, SDD may not have the visibility of a term like vibe coding, but it is one of the most important practices to emerge in 2025 [14]. The core insight is that AI coding agents are literal-minded pair programmers: they excel at pattern recognition but need unambiguous instructions. We should not treat them like search engines [15]. By the time GitHub Spec Kit accumulated 72,000+ stars and 110 releases through February 2026, SDD had become a first-class engineering practice supported by every major AI coding platform [62].
The introductory framing above establishes what SDD is. The remainder of this section treats SDD as what InfoQ more precisely characterizes it: not merely a methodology analogous to TDD, but an architectural pattern that inverts the traditional source of truth by elevating executable specifications above code itself [63]. We formalize the underlying model, examine the constitution pattern's role as an invariant enforcement mechanism, analyze the spec maturity spectrum, survey failure modes, and provide a practitioner's guide to enterprise adoption.
2.1 Formal Characterization
A specification in SDD is more than a Product Requirements Document (PRD). Technically, a specification should explicitly define the external behavior of the target software: input/output mappings, preconditions/postconditions, invariants, constraints, interface types, integration contracts, and sequential logic/state machines [14]. Formally, we can characterize SDD as a transformation pipeline:
Let I = human intent (vague, incomplete), S = specification (precise, structured), P = implementation plan, T = task decomposition, C = code. Traditional development performs the mapping I → C directly via human cognition. SDD decomposes this into:
I → S → P → T → C
where each arrow represents a verifiable transformation step. Crucially, S is the invariant: code can be regenerated from S, but S cannot be reconstructed from C without information loss.
This decomposition provides a fundamental advantage: each transformation can be independently verified. You can check whether S faithfully captures I (specification review), whether P correctly operationalizes S (plan review), whether T correctly decomposes P (task review), and whether C satisfies T (testing). In the vibe coding paradigm, all these transformations happen implicitly inside a single prompt-response pair, making failure diagnosis nearly impossible.
The real-world consequences of skipping this decomposition are stark. According to TechCrunch, 25% of Y Combinator's Winter 2025 cohort shipped codebases that were 95% AI-generated, yet teams were drowning in technical debt, security holes, and implementations that compiled but did not actually solve the right problems [64]. SDD exists because the cost of a vague specification is not a vague implementation. Vague specifications produce confidently wrong implementations.
2.2 The Specification Maturity Spectrum
Not all SDD implementations are created equal. Martin Fowler's analysis identifies three maturity levels that represent fundamentally different relationships between specification and code [65]:
| Maturity Level | Spec Role | Code Role | Editing Model | Drift Risk |
|---|---|---|---|---|
| Spec-First | Written upfront, used for the task at hand | Source of truth after generation | Humans edit code directly | High (spec becomes stale) |
| Spec-Anchored | Maintained throughout lifecycle; changes start with the spec | Generated output, but may be edited | Spec → regenerate affected code | Medium (drift caught in CI) |
| Spec-as-Source | The only artifact humans edit | Transient byproduct; never manually edited | Only spec is edited; code is fully regenerated | Low (spec is the system) |
Most current tools target spec-anchored, where the specification is the living source of truth that evolves with the project, and code is regenerated when the spec changes. GitHub Spec Kit and AWS Kiro operate at this level. Tessl (in private beta) is exploring spec-as-source, where generated files are explicitly marked as non-editable and only the specification can be modified by humans [65]. The Thoughtworks perspective adds a dissenting nuance: more traditional technologists argue that executable code should remain the source of truth, with specifications serving as generation drivers similar to how tests drive code in TDD. The debate over which artifact is "ultimate" remains unresolved and leads to fundamentally different workflows [14].
2.3 The Four Phases
Phase 1: Specify. You provide a high-level description of what you are building and why. The
coding agent generates a detailed specification focusing on user experience, success criteria, and functional
requirements. The key instruction: focus on the "what" and "why," not the technical details [15]. The result is a spec.md file that serves as the contract. Spec Kit
adds a /speckit.clarify step here: a structured, coverage-based questioning workflow that
records answers in a Clarifications section, reducing rework downstream before the plan phase begins [16]. The resulting file is written to a feature-specific directory:
.specify/specs/001-feature-name/spec.md.
Phase 2: Plan. You provide high-level technical direction (stack preferences, architectural
constraints, existing patterns). The agent generates a detailed implementation plan including data models, API
contracts, component architecture, and a research document covering technology-specific concerns [16]. The output includes plan.md, data-model.md,
research.md, and API contract files. All of these are written to the same feature directory
(e.g., .specify/specs/001-feature-name/plan.md). This phase is where the agent performs
codebase-aware
research by analyzing existing patterns, dependencies, and conventions before proposing architecture. In Spec
Kit, dependency management is handled explicitly: tasks are ordered to respect dependencies between components
(models before services, services before endpoints), with parallel-safe tasks marked with [P] [16].
Phase 3: Tasks. The agent decomposes the specification and plan into small, reviewable,
independently testable work units, written to .specify/specs/001-feature-name/tasks.md.
Instead of "build authentication," you get concrete tasks like "create a user
registration endpoint that validates email format" [15]. Each task maps to a user
story with explicit completion criteria. This is analogous to test-driven development for AI: each task is
something the agent can complete and validate independently.
Phase 4: Implement. The agent tackles tasks one by one (or in parallel via agent teams and
worktrees). Instead of reviewing thousand-line code dumps, you review focused changes that solve specific
problems. The agent knows what to build (specification), how to build it (plan), and what to work on next
(tasks) because the /speckit.implement command template instructs it to read all three files
from the feature directory before generating code [15].
Phase 5: Validate. This phase, absent from the original four-phase model but critical for enterprise adoption, closes the loop. Validation combines automated tests (unit, integration, acceptance), BDD scenario execution against the implementation, drift detection (does the code still conform to the spec?), and human review for non-functional requirements [66]. Modern SDD embeds spec validation into CI/CD, checking every commit against the specification so drift is caught immediately rather than during quarterly reviews. Tools like Specmatic generate mock servers from specs and validate that implemented services match their contracts in CI. Any deviation fails the build [66].
2.4 The Constitution Pattern: Architectural Invariants
At the heart of SDD lies the constitution, a set of immutable principles that govern how
specifications become code. The constitution lives at .specify/memory/constitution.md and acts as
the architectural DNA
of the system, ensuring that every generated implementation maintains consistency, simplicity, and quality [67]. Unlike CLAUDE.md, which is loaded into context automatically at
session start,
the constitution is consumed on demand by Spec Kit slash commands. When you run
/speckit.plan,
the command template instructs the agent to read the constitution file before generating the plan. When you run
/speckit.analyze, the template instructs the agent to validate outputs against the constitution.
The agent discovers it by path, not by magic. By analogy to political constitutions that constrain governmental
action,
software constitutions constrain code generation to produce implementations that are correct by construction [68].
Spec Kit's constitution defines nine articles covering principles like: every feature must begin as a standalone library (forcing modular design), test-first with BDD/Gherkin, vertical slice architecture, observability by default, security by design, and dependency management [69]. The constitution's power lies in its immutability: while implementation details evolve, core principles remain constant. This provides consistency across time (code generated today follows the same principles as code generated next year), consistency across LLMs (different AI models produce architecturally compatible code), and architectural integrity (every feature reinforces rather than undermines system design) [67].
Research on Constitutional Spec-Driven Development formalizes enforcement using RFC 2119 semantics: MUST (non-negotiable, build-breaking), SHOULD (strong recommendation, warning-level), and MAY (optional guidance). Each principle maps to specific CWE vulnerability identifiers and includes a rationale. A banking microservices case study demonstrated a 73% reduction in security vulnerabilities, 56% faster time to first secure build, and 4.3x improvement in compliance documentation coverage using constitutional constraints [68].
The constitution generalizes beyond security. Marri identifies four categories of constitutional extension [68]: Architectural Principles (layered separation, dependency inversion, bounded context boundaries), Operational Requirements (observability, health checks, graceful degradation), Performance Constraints (response time budgets, resource limits), and Compliance Requirements (data residency, audit logging, access control). These prevent AI-generated code from violating structural invariants that are difficult to detect through testing alone.
2.5 Writing Effective Specifications
GitHub's analysis of over 2,500 agent configuration files revealed a clear pattern: the most effective specifications cover six core areas [17]:
1 Commands
Put executable commands early, with full flags: npm test, pytest -v,
npm run build. The agent references these constantly.
2 Testing
How to run tests, what framework, where test files live, what coverage expectations exist.
3 Project Structure
Where source code lives, where tests go, where docs belong. Be explicit: "src/ for application code, tests/ for unit tests."
4 Code Style
Naming conventions, formatting rules, import patterns, error handling approaches. Link to existing examples.
5 Git Workflow
Branch naming, commit message format, PR process, merge strategy.
6 Boundaries
A three-tier system: Always do (actions without asking), Ask first (require approval), Never do (hard stops) [17].
Structural Principles for Specification Design
Separate concerns into modular specs. Red Hat's guidance recommends separating specifications by concern: one spec for architecture, another for documentation, others for testing or security. This modular approach lets multiple "how" specs compose harmoniously while keeping each one tightly scoped [18]. A feedback loop through "lessons learned" files reduces agent errors over time.
Write for behavior, not implementation. Specifications should use domain-oriented ubiquitous language to describe business intent rather than technology-specific implementations. A spec that says "the system must reject orders exceeding the customer's credit limit" is superior to "add an if-statement in OrderService.java checking creditLimit > orderTotal" [14]. The former survives a language migration; the latter does not.
Embed preconditions, postconditions, and invariants. Drawing on Meyer's Design by Contract, effective SDD specifications explicitly define what must be true before an operation (preconditions), what must be true after (postconditions), and what must always hold (invariants). These map directly to testable assertions. The Thoughtworks analysis emphasizes that specifications should define input/output mappings, constraints, interface types, and sequential logic/state machines [14].
Use the Clarity Gate as a quality metric. If a different AI agent (or a fresh session) cannot generate functionally equivalent code from the same spec, then the spec has implicit assumptions baked in. Those assumptions will cause drift. Spec quality is inversely proportional to the number of implicit assumptions [23].
Specify what NOT to do. Constraints on prohibited behavior are as important as positive requirements. Conference practitioners formalize this as: be specific about edge cases, define explicit acceptance criteria, list constraints ("do NOT use global state"), reference existing code patterns, and include error handling expectations. Conversely, avoid describing implementation details, writing vague requirements ("make it fast"), assuming the agent knows your codebase, or mixing multiple features in one spec [69].
2.6 Tools, Frameworks, and the Ecosystem
| Tool | Creator | Maturity Level | Key Features |
|---|---|---|---|
| GitHub Spec Kit | GitHub | Spec-Anchored | 72.7K stars, 110 releases (Feb 2026). Constitution-based governance. Commands:
/speckit.constitution, /speckit.specify, /speckit.clarify,
/speckit.plan, /speckit.tasks, /speckit.implement. Supports 22+ AI
platforms. Best for greenfield projects [16] [62].
|
| OpenSpec | Fission AI | Spec-Anchored | Maintains a top-level unified spec representing the live system. Better for brownfield/1→N
projects. Commands: /opsx:propose, /opsx:apply, /opsx:verify,
/opsx:archive. Faster iteration cycle than Spec Kit [19] [70].
|
| AWS Kiro | Amazon | Spec-Anchored | Agentic IDE with 3-phase workflow (Requirements, Design, Tasks) + Steering Docs (analogous to constitution). 250K developers in first 3 months. Deep AWS integration, strong brownfield support [20] [69]. |
| Tessl | Tessl | Spec-as-Source | Private beta. Generated files marked non-editable; only specs edited by humans. The most radical implementation of the "code is a transient byproduct" philosophy [65]. |
| Native Claude Code SDD | Anthropic | Spec-Anchored | No external framework required. Uses CLAUDE.md for project-level rules (distinct from a formal constitution), subagents for parallel research, Tasks system + worktrees for implementation delegation, hooks for enforcement. The agent reads and writes spec files via standard tools; there is no built-in awareness of any spec directory structure [21]. |
2.7 SDD and Context Engineering: Two Halves of One Problem
A specification without proper context delivery is a beautifully written document that the agent cannot properly implement. An emerging synthesis treats SDD and context engineering as inseparable [64]. SDD addresses what to build; context engineering addresses what information guides the building. Neither succeeds alone.
Specifications act as "super-prompts" that break down complex problems into modular components aligned with
agents' context windows [66]. But some knowledge is tacit: the senior developer's
intuition about which database queries will scale, the UX designer's understanding of user expectations. These
do not easily translate into specifications. Effective SDD must handle both explicit specifications and implicit
organizational knowledge through mechanisms like the constitution, Claude Code's auto memory
(stored in ~/.claude/projects/, distinct from Spec Kit's .specify/memory/),
and lessons-learned feedback files.
This intersection explains why multi-agent architectures are emerging as a natural complement to SDD: they distribute specifications across specialized agents, each with focused context, rather than overloading a single agent's window with the entire system's specification, constitution, plan, and task list simultaneously [64].
2.8 Enterprise Adoption: Failure Modes and Best Practices
Known Failure Modes
Specification drift. The most common failure. The spec says one thing; the code does another; nobody notices until production. SDD without CI-embedded validation is just documentation that ages badly. The fix: embed spec conformance checks in the build pipeline so drift fails the build [66].
Over-specification. The Thoughtworks Technology Radar (Volume 33) warns that current AI-driven spec workflows often involve overly rigid, opinionated processes. Experienced programmers find that over-formalized specs can slow down change and feedback cycles, reintroducing the rigidity that agile methods sought to escape [14]. The rule of thumb from conference practitioners: if you can explain the task in one sentence, skip the spec. If it takes two or more prompts to explain, write a spec [69].
Constitution over-adherence. Agents sometimes follow constitutional principles too eagerly, generating unnecessary complexity. One practitioner reported that a constitution article requiring library-first architecture caused the agent to generate duplicate class hierarchies where a simple function would suffice [65]. Constitutions need calibration against practical use.
Multi-repository coordination gaps. InfoQ notes a critical unsolved challenge: current tools typically keep specs co-located with code in a single repository, while modern architectures span microservices, shared libraries, and infrastructure repositories [62]. Cross-service specification coordination remains largely manual.
LLM non-determinism. Even structured specs can lead to varying outputs across regenerations. Techniques like property-based testing address this by automatically verifying that invariants from specs are satisfied regardless of implementation variation [66].
Proven Best Practices
Treat specs as living documents, not static blueprints. When something does not make sense, go back to the spec; when a project grows complex, refine it; when tasks feel too large, break them down [15]. The constitution can evolve through a documented amendment process requiring rationale, maintainer approval, and backwards compatibility assessment [67].
Use the Constitution as the zero-th phase. Before specifying any feature, establish your project's governing principles. The mandatory workflow becomes: Constitution → 𝄆 Specify → Plan → Tasks → Implement 𝄇 (the repeat symbol indicates the cycle runs for each feature while the constitution remains stable) [69].
Parallelize specification and implementation. SDD is not waterfall. Within the specification phase, AI can help flesh out edge cases. Within the planning phase, parallel research subagents can investigate technology-specific questions. Within implementation, tasks with no dependencies can execute concurrently across agent teams [16].
Prioritize "human reviewable" spec sizes. From a review burden perspective, keeping specs human-reviewable in terms of size matters. Sheer volume can make detailed review daunting. Specification styles that facilitate meaningful conversation promote better dialogue and thinking through solutions in concert with AI, rather than rubber-stamping large generated artifacts [71].
Scale adoption by problem size. Small features (single service) use focused specification-to-implementation workflows. Medium systems (multi-service) add constitution-based governance, typically requiring 2–4 weeks for phased integration. Large systems require multi-agent orchestration, decomposition pipelines, and constitutional governance [62].
Bridge SDD with compliance frameworks. The EU AI Act requires high-risk AI systems to comply with obligations starting August 2, 2026, with fines up to €35 million or 7% of global annual turnover. SDD specifications, particularly constitutional documents with CWE mappings and enforcement levels, serve as compliance evidence and audit trails. Organizations with strong AI governance are approximately 25–30% more likely to achieve positive AI outcomes [62].
2.9 File Layout and Agent Discovery
A common question when moving from SDD theory to practice: where do these files actually live, and how does the
agent find them? The critical point to internalize first: Claude Code has no native awareness of
the .specify/ directory. It will not walk that folder, auto-load those files, or treat
them specially. The .specify/ namespace is entirely a Spec Kit convention. The entire integration
between Claude Code and Spec Kit is mediated by two mechanisms: slash command templates
(markdown files in .claude/commands/ that instruct the agent which files to read and write) and
@path imports in CLAUDE.md (which inline file contents into the
agent's context at session start). Without one of these two bridges, a file in .specify/ is
invisible to the agent.
| File | Canonical Location | Discovery Mechanism | Loaded When? |
|---|---|---|---|
CLAUDE.md |
Project root ./CLAUDE.md or ./.claude/CLAUDE.md |
Claude Code walks the directory tree upward from the working directory, loading every CLAUDE.md it finds | Always, at session start |
.claude/rules/*.md |
.claude/rules/ (recursive) |
Claude Code loads unconditional rules at launch; path-specific rules load when the agent touches matching files | Always (unconditional) or on-demand (path-filtered via YAML paths: frontmatter) |
constitution.md |
.specify/memory/constitution.md |
Spec Kit slash commands instruct the agent to read this path explicitly; can also be imported via
@.specify/memory/constitution.md in CLAUDE.md for always-on loading
|
On demand (via slash command) or always (via @path import) |
spec.md |
.specify/specs/NNN-feature-name/spec.md |
The slash command template tells the agent to read this path using its standard Read tool;
there is no automatic resolution. The agent follows the instruction because the prompt says to. |
On demand, during the specify/plan/analyze phases |
plan.md |
.specify/specs/NNN-feature-name/plan.md |
Same as spec.md; co-located in the feature directory | On demand, during the plan/implement phases |
tasks.md |
.specify/specs/NNN-feature-name/tasks.md |
Same as spec.md; co-located in the feature directory | On demand, during the tasks/implement phases |
In practice, the canonical project layout for a team using Claude Code with Spec Kit looks like this:
project-root/
├── CLAUDE.md # Always-on agent context
├── .claude/
│ ├── settings.json # Permission rules, model config
│ ├── commands/ # Spec Kit slash commands
│ │ ├── speckit.constitution.md
│ │ ├── speckit.specify.md
│ │ ├── speckit.plan.md
│ │ ├── speckit.tasks.md
│ │ ├── speckit.implement.md
│ │ └── speckit.analyze.md
│ ├── agents/ # Custom subagent definitions
│ │ └── security-reviewer.md
│ ├── rules/ # Conditional and unconditional rules
│ │ └── api-design.md
│ └── hooks/ # Lifecycle hooks (quality gates)
│ └── quality-gate.sh
├── .specify/
│ ├── memory/
│ │ └── constitution.md # Architectural invariants
│ ├── templates/ # Spec Kit templates
│ └── specs/
│ ├── 001-user-auth/
│ │ ├── spec.md # Feature specification
│ │ ├── plan.md # Implementation plan
│ │ ├── tasks.md # Task decomposition
│ │ └── research.md # Technology research
│ └── 002-payment-flow/
│ ├── spec.md
│ ├── plan.md
│ └── tasks.md
└── src/ # Implementation code
@path Bridge@path import in your CLAUDE.md. Adding
@.specify/memory/constitution.md to your CLAUDE.md ensures the constitution is always in
context. For active features, adding @.specify/specs/001-feature-name/spec.md temporarily
during implementation ensures the agent always has the current spec visible. Remove these imports when the
feature ships to reclaim context budget.
Verification & Validation for AI-Generated Code
3.1 The Verification Problem at Scale
As autonomous coding systems proliferate, the volume of produced code quickly exceeds the limits of thorough human oversight. OpenAI's alignment team articulated the core tension: we cannot assume that code-generating systems are trustworthy or correct; we must check their work [25]. Empirical data underscores this urgency: a 2024 study of 733 Copilot-generated snippets found that 29.5% of Python and 24.2% of JavaScript snippets contained security weaknesses [26].
The challenge is compounded by a phenomenon that Bright Security terms review degradation: over time, teams treat AI-generated code as boilerplate, and developers may lose awareness of secure coding principles if they rely too heavily on AI for decisions [27]. Traditional line-by-line code review simply does not scale to the volume of AI-generated output. We need a structured, abstract approach.
3.2 The Multi-Layer V&V Framework
Like the Swiss Cheese Model from safety engineering, no single evaluation layer catches every issue. Anthropic's evals team recommends combining multiple methods, each covering different failure modes [28]. The following framework synthesizes best practices from Anthropic, OpenAI, GitHub, and the formal verification community into a six-layer model.
Layer 1: Static Analysis & Linting. The fastest, cheapest layer. Catches syntax errors, style violations, type mismatches, and known vulnerability patterns. Tools include ESLint, Pylint, mypy, CodeQL, and Semgrep. This layer can be enforced automatically via Claude Code hooks on every file write [11].
Layer 2: Automated Testing. Unit tests, integration tests, and end-to-end tests verify functional correctness. Simon Willison notes that a robust test suite gives AI agents superpowers because they can validate and iterate quickly when tests fail [17]. The SDD methodology ensures that test expectations are derived from the specification, not invented by the implementing agent.
Layer 3: Property-Based & Contract Testing. Rather than testing specific inputs and outputs, property-based testing (PBT) generates thousands of random inputs and verifies that invariant properties hold. This catches edge cases that example-based tests miss. Agentic PBT systems can now synthesize candidate properties from code analysis, translate them into executable Hypothesis tests, and refine properties based on counterexamples [29]. Contract testing (via tools like Pact) verifies that API consumers and providers agree on interface contracts.
Layer 4: LLM-as-a-Judge. For criteria that are hard to test automatically (code style, readability, adherence to architectural patterns), a second agent reviews the first agent's output against the specification's quality guidelines. This adds a layer of semantic evaluation beyond syntax checks [17]. OpenAI's automated code reviewer processes over 100,000 external PRs per day, with authors making code changes in response to 52.7% of comments [25].
Layer 5: Spec Conformance Verification. This layer directly addresses the question: "Does the implementation satisfy every requirement in the specification?" The agent is prompted to compare its output against the spec item by item: "Review the above requirements list and ensure each is satisfied, marking any missing ones" [17]. Conformance suites (language-independent tests, often YAML-based, that any implementation must pass) formalize this process [17].
Layer 6: Human Architectural Review. The most expensive but highest-assurance layer. Humans evaluate whether the overall system design is sound, whether the specification captured the right requirements, and whether the architecture handles edge cases, scalability, and operational concerns that automated tools cannot assess.
3.3 Formal Methods for Agent Outputs
For high-stakes systems, the V&V framework can incorporate formal verification techniques. The key insight is that formal verification of agent outputs is tractable because we are not verifying the model (which is a black box) but rather verifying the output against a specification (which is a well-defined problem) [30].
Define the agent's output as a state transition SA → SB (the codebase before and after the change). Define invariants as properties that must hold in SB. Verification becomes: does SB satisfy all invariants? If any invariant fails, the transition is rejected and SA remains unchanged. This is directly analogous to database transactions where constraints prevent invalid states from being committed [30].
This pattern, called transactional integrity for agent outputs, ensures that even if the model misbehaves, the system does not enter an invalid state. The invariants can range from simple (all tests pass) to complex (formal properties verified by tools like Z3 or property-based testing frameworks). The Proof-Carrying Agents paradigm extends this further: agents accept or reject pipeline branch merges solely on the basis of verifier outputs, enforcing correctness without continuous human oversight [29].
3.4 The LLM-as-a-Judge Pattern in Detail
The LLM-as-a-Judge pattern exploits what OpenAI's alignment team calls the verification-generation gap: generating correct code requires broad search and many tokens, while falsifying a proposed change usually needs only targeted hypothesis generation and checks [25]. Verification is fundamentally easier than generation, which means a reviewer agent with modest compute can catch errors in code produced by a generator agent with substantial compute.
In practice, the Writer/Reviewer pattern from Claude Code's agent teams implements this directly. One session generates code; a fresh session (with clean context, no implementation bias) reviews it. The reviewer has access to the specification and can flag deviations. At Anthropic, this pattern has proven effective because a fresh context improves code review quality, as the reviewer will not be biased toward code it just wrote [4].
3.5 Practical V&V Patterns for AI-Assisted Development
Pattern 1: Hook-Enforced Quality Gates
Use Claude Code hooks to enforce automated checks at every relevant lifecycle point. A PostToolUse
hook with matcher "Write|Edit" runs the linter and type checker after every file change. A
TaskCompleted hook runs the full test suite and rejects task completion (exit code 2) if tests
fail. This creates continuous, invisible V&V without manual intervention.
Pattern 2: Spec-Test Duality
For every requirement in the specification, there should exist at least one test that would fail if the requirement were not met. This is the SDD analog of TDD's "red-green-refactor" cycle. The specification defines what success looks like; the tests operationalize that definition into executable assertions. GitHub's Spec Kit generates task checklists that can serve as the basis for these tests [16].
Pattern 3: The Dual-Agent Verification Loop
Have one agent write code and another write tests independently from the same specification. If the tests fail against the implementation (or vice versa), there is either a spec ambiguity, an implementation bug, or a test error. The disagreement is itself diagnostic [4].
Pattern 4: Conformance Suite as Contract
Build a conformance suite of language-independent tests (often YAML-based input/output pairs) that any implementation must pass. If the implementation is regenerated from the specification, the conformance suite validates that the new implementation is functionally equivalent. This decouples verification from any specific implementation, making the specification truly regenerable [17].
Pattern 5: Progressive Trust Escalation
Not all code changes require the same level of scrutiny. Define a risk taxonomy tied to the boundary system from the specification. Changes to authentication logic, database schemas, or infrastructure configuration require full six-layer V&V. Changes to UI formatting or documentation may need only layers 1 and 2. This risk-proportionate approach makes V&V economically sustainable at scale.
The Unified Pipeline: Putting It All Together
The three pillars are not independent tools to be used in isolation. They form a closed-loop engineering system
where specifications drive agent behavior when loaded into context (via slash commands or
@path imports), V&V results feed back into specification refinement, and the agentic
execution environment (Claude Code) provides the substrate for the entire process. The coupling between
these layers is intentionally loose: Claude Code provides the execution machinery but has no built-in
knowledge of SDD artifacts. Spec Kit provides the SDD workflow but relies on Claude Code's tool access
to read and write files. The integration point is the slash command template, a markdown file that
bridges the two by telling the agent exactly which files to consume and produce.
4.1 Integrated Architecture
4.2 End-to-End Workflow
Here is the complete workflow, combining all three pillars into a practical engineering process:
Setup: Configure the Execution Environment
Create or refine CLAUDE.md with project context, commands, architecture rules, and coding
standards. Import the constitution into always-on context with
@.specify/memory/constitution.md.
For the feature you are about to build, import the active spec:
@.specify/specs/001-feature-name/spec.md. Configure MCP servers for external tool access
(Jira, GitHub, databases). Define hooks for quality gates (PostToolUse for linting,
TaskCompleted for test enforcement). Create skills for domain-specific expertise.
This is a one-time investment per project (plus a per-feature @path import that you add
and remove as features ship).
Specify: Transform Intent into Contract
Describe the feature at a high level, focusing on what and why. Use /speckit.specify or
equivalent to generate a structured specification. Review the spec against the six core areas (commands,
testing, structure, style, workflow, boundaries). Apply the Clarity Gate: could a different agent generate
equivalent code from this spec alone? Iterate until the answer is yes.
Plan: Architect the Solution
Provide technical constraints and preferences. Use /speckit.plan to generate data models, API
contracts, and component architecture. Dispatch research subagents for technology-specific questions. Review
and approve the plan before proceeding.
Decompose: Create Verifiable Tasks
Use /speckit.tasks to break the plan into small, independently testable work units. For each
task, define completion criteria derived from the specification. Establish the conformance suite: what tests
must pass for each task to be considered done?
Implement with Continuous V&V
Deploy the dual-agent pattern: the primary agent implements tasks while a test agent writes tests from the
specification (not from the implementation). Hooks enforce L1 (static analysis) on every file write.
TaskCompleted hooks enforce L2 (automated tests). The review agent provides L4 (semantic
review) on each completed task. L5 (spec conformance) is checked before the task is marked done.
Validate and Close the Loop
Run the full conformance suite against the complete implementation. Apply L6 (human architectural review)
for high-risk changes. If V&V reveals specification gaps, the feedback loop works mechanically:
run /speckit.analyze, which performs a read-only cross-artifact consistency check across
spec.md, plan.md, and tasks.md. The analyze command outputs a
structured remediation report identifying inconsistencies, ambiguities, and gaps. After human review
and approval of the remediation plan, the agent uses its standard Write and Edit
tools to modify the spec files in place at their known paths in .specify/specs/.
Within the current session, the agent already has the updated content in memory because it just
wrote it. For subsequent sessions, the @path import in CLAUDE.md re-reads from disk at
launch, and Spec Kit slash commands always re-read the feature directory at invocation time, so
changes are picked up automatically. The spec remains the source of truth. Commit the updated
specification alongside the code.
This pipeline is not theoretical. It synthesizes documented practices from Anthropic's internal teams [4], GitHub's Spec Kit methodology [15], Addy Osmani's specification framework [17], OpenAI's automated code review system [25], and the formal verification community's work on agent output assurance [29]. Each piece has been validated independently; the contribution here is their integration into a coherent end-to-end methodology suited for teams adopting agentic AI development at scale.
The DORA 2025 report's finding applies directly: AI is an amplifier of your development practices. Good processes get better, with high-performing teams seeing 55-70% faster delivery. Bad processes get worse, accumulating debt at unprecedented speed [1]. The pipeline described here is the good process.