Coding agents lie. Not maliciously, but constantly. “The bug is fixed.” “The layout matches the design.” “All tests pass.”

Half the time, they’re wrong.

There are several ways to deal with this. Ralph Wiggum loops force agents to keep iterating until acceptance criteria actually pass. Beads gives agents memory so they can track issues across sessions. But for visual work, you need something else: make them prove it with screenshots.

Steve Yegge shared this approach in a recent interview. Screenshots as source of truth. Visual verification at every step.

The Pattern

Yegge is building a React client for his 30-year-old game. The agent iterates on code, but every claim gets verified against a reference screenshot:

I basically say, okay, our screenshot has to match this reference screenshot. First make it look kind of like this. Then let’s work on the theming. Now let’s work on the fonts. I’m just iterating and refining.
— Steve Yegge

The reference screenshot is the Steam client - ugly, old, beloved by players. The React client needs to match it. Not pixel-perfect, but functionally equivalent.

Every iteration:

  1. Agent makes changes
  2. Agent claims success
  3. Screenshot gets refreshed via Playwright
  4. Agent compares current state to reference
  5. If mismatch, agent identifies what’s wrong and continues

The key insight: agents can see images. They know when reality doesn’t match their claims. Force them to look, and they self-correct.

Catching Lies Early

It said ‘the screenshot still shows gray placeholders.’ I caught you lying to me. That’s what they do. They’re like, okay, we’re all done now. It caught itself because I made it catch itself.
— Steve Yegge

Without visual verification, you’re trusting text claims. “The spells are rendering correctly.” Are they? You’d have to check manually.

With visual verification, the agent checks itself. The screenshot is evidence. The agent either matches it or admits it doesn’t.

This is the “prove it to me” pattern:

  • Don’t accept “it’s done”
  • Demand evidence
  • Make the agent evaluate its own output

The Headless Plugin

I built a Headless plugin for exactly this workflow. Parallel browser agents capture screenshots, compare states, report differences. I used it for .NET 4.5 to 10 migrations, running parity checks in an agentic loop until the migrated site matched the legacy one.

/headless:parity https://legacy.example.com https://migrated.example.com

Each agent:

  1. Starts a Playwright browser session
  2. Navigates and interacts
  3. Captures screenshots at each step
  4. Compares legacy vs migrated states
  5. Reports differences with severity

The agents see what they’re testing. They’re not guessing based on DOM structure. They’re looking at rendered output.

Why Playwright?

Playwright gives you real browser rendering. CSS, JavaScript, fonts, images - everything that affects visual output. DOM inspection misses styling issues. Screenshots catch them.

Video Capture for Temporal Bugs

Screenshots have a blind spot: time.

Yegge’s React client has a flickering bug. The map renders, then disappears, then renders again. A screenshot might catch it in either state. You need multiple frames to see the problem.

The flickering thing - it’s going to have to take multiple screenshots, catch it in the act. It’s not a static screenshot, it’s a behavior over time.
— Steve Yegge

I’ve added video capture to the Headless plugin for exactly this:

# Start session with video recording
npx browser.ts start https://legacy.com https://migrated.com --video

# Interact, wait, let animations play...

# Stop - returns video paths
npx browser.ts stop <session-id>

Use video for:

  • Flickering: Elements appearing/disappearing intermittently
  • Animation issues: Transitions, loading spinners, hover states
  • Race conditions: Content loading in wrong order
  • Timing bugs: Things that look fine frozen but break in motion

The videos save to temp files. Feed them to a multimodal model for analysis. Gemini Pro handles video natively: up to 2 hours at 1 FPS, 258 tokens per frame. My Gemini plugin can analyze the captured footage directly.

Token cost

Video analysis is expensive. 10-15 seconds of footage generates significant tokens when processed frame-by-frame. Use screenshots for static validation, video only for temporal bugs.

The Frontier: Live Streaming

The logical endpoint is real-time video streaming to agents. Not recorded clips, but live feeds during development.

What I would like is something that can record 10 to 15 seconds of video, whatever is achievable for the thing to upload and analyze, so you can say we’re going to focus on this interaction and I want to see the following things happen.
— Steve Yegge

This is coming. The infrastructure exists (WebRTC, screen sharing). The bottleneck is cost and latency. But for high-value applications like game development, it’s worth it.

Imagine: agent watches you play, spots visual bugs in real-time, fixes them while you continue testing.

Multimodal Prompting Tips

Based on Yegge’s workflow and my own experience:

  • Reference images are anchors. Put them in your project. Tell agents “match this.” They’ll iterate toward it.
  • Screenshot after every claim. Agent says it’s fixed? Screenshot. Agent says layout matches? Screenshot. Trust but verify.
  • Use descriptive filenames. reference-homepage.png, current-homepage.png. Agents parse filenames for context.
  • Capture console errors too. Visual parity isn’t enough. Console errors in the new version but not the old is a regression.
  • Video is expensive but necessary. For animations and timing, screenshots won’t cut it. Budget for video analysis.

Combining the Approaches

Visual verification pairs well with the other tools for handling agent unreliability:

  • Ralph Wiggum: Autonomous loops that keep running until acceptance criteria pass. Combine with Headless for visual acceptance criteria: “loop until screenshot matches reference.”
  • Beads: Session memory. File an issue: “Map flickering - intermittent render bug.” Attach the video. Next session, agent has visual evidence.
  • Visual verification: Evidence of current state. The screenshot doesn’t lie.

Each reinforces the others. Beads tracks what needs fixing. Ralph Wiggum keeps iterating. Visual verification proves when it’s actually done.


Agents are optimistic. They want to report success. Visual verification adds accountability. The screenshot doesn’t lie. The video shows what actually happened.

Make them prove it.