Visual Verification: Making Agents Prove Their Work

Coding agents lie. Not maliciously, but constantly. “The bug is fixed.” “The layout matches the design.” “All tests pass.”

Half the time, they’re wrong.

There are several ways to deal with this. Ralph Wiggum loops force agents to keep iterating until acceptance criteria actually pass. Beads gives agents memory so they can track issues across sessions. But for visual work, you need something else: make them prove it with screenshots.

Steve Yegge shared this approach in a recent interview. Screenshots as source of truth. Visual verification at every step.

The Pattern

Yegge is building a React client for his 30-year-old game. The agent iterates on code, but every claim gets verified against a reference screenshot:

I basically say, okay, our screenshot has to match this reference screenshot. First make it look kind of like this. Then let’s work on the theming. Now let’s work on the fonts. I’m just iterating and refining.
— Steve Yegge

The reference screenshot is the Steam client - ugly, old, beloved by players. The React client needs to match it. Not pixel-perfect, but functionally equivalent.

Every iteration:

Agent makes changes
Agent claims success
Screenshot gets refreshed via Playwright
Agent compares current state to reference
If mismatch, agent identifies what’s wrong and continues

The key insight: agents can see images. They know when reality doesn’t match their claims. Force them to look, and they self-correct.

Catching Lies Early

It said ‘the screenshot still shows gray placeholders.’ I caught you lying to me. That’s what they do. They’re like, okay, we’re all done now. It caught itself because I made it catch itself.
— Steve Yegge

Without visual verification, you’re trusting text claims. “The spells are rendering correctly.” Are they? You’d have to check manually.

With visual verification, the agent checks itself. The screenshot is evidence. The agent either matches it or admits it doesn’t.

This is the “prove it to me” pattern:

Don’t accept “it’s done”
Demand evidence
Make the agent evaluate its own output

The Headless Plugin

I built a Headless plugin for exactly this workflow. Parallel browser agents capture screenshots, compare states, report differences. I used it for .NET 4.5 to 10 migrations, running parity checks in an agentic loop until the migrated site matched the legacy one.

/headless:parity https://legacy.example.com https://migrated.example.com

Each agent:

Starts a Playwright browser session
Navigates and interacts
Captures screenshots at each step
Compares legacy vs migrated states
Reports differences with severity

The agents see what they’re testing. They’re not guessing based on DOM structure. They’re looking at rendered output.

Why Playwright?

Playwright gives you real browser rendering. CSS, JavaScript, fonts, images - everything that affects visual output. DOM inspection misses styling issues. Screenshots catch them.

Video Capture for Temporal Bugs

Screenshots have a blind spot: time.

Yegge’s React client has a flickering bug. The map renders, then disappears, then renders again. A screenshot might catch it in either state. You need multiple frames to see the problem.

The flickering thing - it’s going to have to take multiple screenshots, catch it in the act. It’s not a static screenshot, it’s a behavior over time.
— Steve Yegge

I’ve added video capture to the Headless plugin for exactly this:

# Start session with video recording
npx browser.ts start https://legacy.com https://migrated.com --video

# Interact, wait, let animations play...

# Stop - returns video paths
npx browser.ts stop <session-id>

Use video for:

Flickering: Elements appearing/disappearing intermittently
Animation issues: Transitions, loading spinners, hover states
Race conditions: Content loading in wrong order
Timing bugs: Things that look fine frozen but break in motion

The videos save to temp files. Feed them to a multimodal model for analysis. Gemini Pro handles video natively: up to 2 hours at 1 FPS, 258 tokens per frame. My Gemini plugin can analyze the captured footage directly.

Token cost

Video analysis is expensive. 10-15 seconds of footage generates significant tokens when processed frame-by-frame. Use screenshots for static validation, video only for temporal bugs.

The Frontier: Live Streaming

The logical endpoint is real-time video streaming to agents. Not recorded clips, but live feeds during development.

What I would like is something that can record 10 to 15 seconds of video, whatever is achievable for the thing to upload and analyze, so you can say we’re going to focus on this interaction and I want to see the following things happen.
— Steve Yegge

This is coming. The infrastructure exists (WebRTC, screen sharing). The bottleneck is cost and latency. But for high-value applications like game development, it’s worth it.

Imagine: agent watches you play, spots visual bugs in real-time, fixes them while you continue testing.

Multimodal Prompting Tips

Based on Yegge’s workflow and my own experience:

Reference images are anchors. Put them in your project. Tell agents “match this.” They’ll iterate toward it.
Screenshot after every claim. Agent says it’s fixed? Screenshot. Agent says layout matches? Screenshot. Trust but verify.
Use descriptive filenames. reference-homepage.png, current-homepage.png. Agents parse filenames for context.
Capture console errors too. Visual parity isn’t enough. Console errors in the new version but not the old is a regression.
Video is expensive but necessary. For animations and timing, screenshots won’t cut it. Budget for video analysis.

Visual verification pairs with Ralph Wiggum (loop until screenshot matches) and Beads (attach video evidence to issues). Each reinforces the others.

Agents are optimistic. They want to report success. Make them prove it.

Visual Verification: Making Agents Prove Their Work

The Pattern

Catching Lies Early

The Headless Plugin

Video Capture for Temporal Bugs

The Frontier: Live Streaming

Multimodal Prompting Tips

Share this article

Related Posts

From Beads to Tasks: Anthropic Productizes Agent Memory

Claude Code 2.1: The Pain Points? Fixed.

claude-tools: A Plugin Marketplace for Claude Code