Previously, I covered Ralph Wiggum as a Claude Code plugin: install it, run /ralph-loop, wake up to working code. That post covered Anthropic’s official plugin. This one covers the underlying bash-based methodology - same technique, different implementation.

Geoffrey Huntley originated the technique. Clayton Farr documented and refined it into a full playbook. Huntley forked it, signaling endorsement.

The Core Philosophy

Three principles drive the methodology:

  • Context is scarce: With ~176K usable tokens from a 200K window, you can’t afford bloated prompts. Keep each iteration lean.
  • Plans are disposable: A plan that drifts is cheaper to regenerate than to salvage. Don’t fight stale state.
  • Backpressure beats direction: Instead of telling the agent what to do, engineer an environment where wrong outputs get rejected automatically.

Human roles shift from “telling the agent what to do” to “engineering conditions where good outcomes emerge naturally through iteration.”

— Clayton Farr

Three-Phase Workflow

Phase 1: Requirements

Human and LLM conversation. Identify jobs to be done. Break them into topics of concern. Create a spec file for each topic in specs/.

No code yet. Just requirements documented clearly enough that an agent can gap-analyze against them.

Phase 2: Planning

The agent reads specs, examines existing code, and generates IMPLEMENTATION_PLAN.md: a prioritized task list with no implementation. Pure gap analysis.

This happens in “planning mode” with its own prompt (PROMPT_plan.md). The agent exits after writing the plan.

Phase 3: Building

The agent picks the top task from IMPLEMENTATION_PLAN.md, implements it, runs validation, updates the plan, commits, and exits. Fresh context for the next iteration.

One task per iteration

This is the key insight. Each iteration processes exactly one task. Context stays lean. The agent stays in the smart zone instead of accumulating cruft.

File Structure

Five files make it work:

project/
├── loop.sh                    # Bash orchestrator
├── PROMPT_plan.md             # Planning mode instructions
├── PROMPT_build.md            # Building mode instructions
├── AGENTS.md                  # Build/test/lint commands (~60 lines)
├── IMPLEMENTATION_PLAN.md     # Shared state between iterations
└── specs/
    └── *.md                   # One file per topic of concern

loop.sh: The outer loop. Feeds prompts to Claude, manages modes, enforces iteration limits. Simple bash.

PROMPT_plan.md / PROMPT_build.md: Mode-specific instructions loaded each iteration. Planning mode generates the plan. Building mode executes from it.

AGENTS.md: Operational guide. Build commands, test commands, validation steps. Keep it under 60 lines. Everything the agent needs to verify its work.

specs/*: Source of truth for requirements. One markdown file per topic. The agent references these during gap analysis and implementation.

IMPLEMENTATION_PLAN.md: Persistent state. The agent reads it, picks a task, implements, updates it, exits. Next iteration reads the updated plan.

Backpressure Mechanisms

Autonomous loops converge when wrong outputs get rejected. Three layers:

  • Downstream gates: Tests, type-checking, linting, build validation. If these fail, the agent knows to retry or adjust.
  • Upstream steering: Existing code patterns guide the agent’s approach. It discovers conventions through exploration rather than explicit instruction.
  • LLM-as-judge: For subjective criteria (tone, UX feel, aesthetics), use another LLM call with binary pass/fail. Eventually converges through iteration.
Start with hard gates

Tests and builds are deterministic. Start there. Add LLM-as-judge for subjective criteria only after the mechanical backpressure is working.

Context Efficiency

The dumb zone hits around 40% context utilization. Past that, reasoning degrades.

Ralph’s one-task-per-iteration structure sidesteps this. Each iteration starts fresh. The agent loads only what it needs: the prompt, the plan, relevant code. Context stays in the sweet spot.

Additional tactics:

  • Spawn subagents for expensive exploration instead of consuming main context
  • Keep AGENTS.md lean - 60 lines, not 600
  • Trust code patterns over exhaustive prompt instructions

Plan Disposability

Plans drift. Requirements change. The agent misunderstands something early and compounds the error.

The fix: regenerate. Switching back to planning mode and rerunning gap analysis is cheap. Fighting a stale plan wastes more iterations.

Treat IMPLEMENTATION_PLAN.md as coordination state, not a contract. When it’s wrong, throw it away.

Advanced Patterns

The playbook proposes five enhancements beyond the core loop:

  • Acceptance-driven backpressure: Derive test requirements from acceptance criteria during planning. Connect specs → required tests → implementation. Prevents “cheating” - can’t claim done without passing the tests that prove it.
  • LLM-as-judge for subjective criteria: Binary pass/fail reviews for tone, aesthetics, UX quality. Create a fixture pattern (llm-review.ts) that Ralph discovers and learns to apply.
  • Work-scoped branches: Create a scoped IMPLEMENTATION_PLAN.md per branch upfront with a plan-work mode. Scope at plan creation (deterministic) rather than runtime filtering (probabilistic).
  • JTBD → Story Map → SLC releases: Reframe specs as user journey activities. Slice horizontally through the story map to identify Simple, Lovable, Complete releases rather than building everything at once.
  • AskUserQuestionTool for requirements: Use Claude’s interview capabilities during Phase 1 to systematically clarify edge cases and acceptance criteria before writing specs.

Prompt Structure

The playbook’s prompts follow a phase structure:

  • 0a-0d: Orientation (read files, understand context)
  • 1-4: Main instructions (what to do this iteration)
  • 999+: Guardrails (what not to do, safety rails)

Key language patterns that improve agent behavior: “study the codebase first,” “don’t assume not implemented,” “ultrathink before acting,” “capture the why in commits.”

Topic Scope Test

How do you know if a spec file covers one topic or several? Try describing it in one sentence without “and.” If you need conjunctions, split it.

Example JTBD: “Help designers create mood boards”

  • Topics: image collection, color extraction, layout, sharing
  • Each topic → one spec file → multiple implementation tasks

What This Doesn’t Solve

  • Bad specs: Garbage in, garbage out. The methodology assumes you’ve done Phase 1 properly.
  • Architectural decisions: Novel abstractions still need human judgment. Ralph handles execution, not design.
  • Cost: Each iteration burns tokens. 50 iterations on a large codebase can hit $50-100+. Set limits.

Try It Yourself

This post summarizes the methodology. For the full details, read Farr’s complete playbook - it covers advanced patterns, edge cases, and working examples that go beyond what’s here.

Source repos:

Start with a small project and clear requirements. Write specs. Let it plan. Let it build. Watch what breaks and add backpressure accordingly.

The official plugin handles the loop mechanics with CLI ergonomics. The bash playbook gives you full control.

Update: June 2026

Two things happened to this methodology since January: it got sharper, and it started getting automated.

Huntley sharpened the inputs. His March porting essay refines the freeform specs/ and PROMPT.md into structured, citation-linked specs: the agent gets links back to the original implementation it can consult mid-task, not just a prose description. Same three-phase shape this post describes, with the requirements phase hardened so the agent stops guessing. If you run the playbook, steal this one move - cite your sources inside the spec.

The platform started building the five files for you. In late May, Claude Code shipped Dynamic Workflows: give it a goal and Claude writes its own JavaScript orchestrator, spawns parallel subagents, and iterates until convergence. That is loop.sh, the plan generation, and the one-task-per-iteration discipline, generated on demand instead of hand-assembled. The hand-built methodology in this post is becoming the “how it works under the hood” for a native feature, which is exactly the absorption cycle the rest of this series tracks.

But it stays individual-scale. The one place the methodology still breaks is teams. Attempts to share IMPLEMENTATION_PLAN.md across a team in early 2026 produced merge conflicts and a dozen diverging versions overwriting each other. The lesson the community settled on: the shared artifact is the PR, not the plan file. This is a single-developer power tool. Scaling it to a team is still an unsolved problem, not a config flag.