Opus 4.8 landed on May 28, forty-one days after 4.7, at exactly the same price. The press wrote up the benchmark: agentic coding from 64.3% to 69.2%, 84% on Online-Mind2Web, the strongest browser-agent model Anthropic has tested. Real numbers, real lead.

But the line Anthropic actually led with wasn’t a score. It was honesty. Opus 4.8 is “around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked,” and more honest about its own progress on a task. That is the headline. And it has a token bill.

The Feature Is Doubt

Watch what people on X singled out in the first day. Not the coding score. Self-correction. One user posted the model catching itself mid-answer: my first instinct was wrong, hence the correction. Another liked that you can hand it a goal and it keeps working, checking, until the end state is actually reached. The praise clusters around the same trait: the model became more willing to say it isn’t sure, and to go back and fix what it got wrong.

That is genuinely the thing that matters for agentic work. I’ve written before that AI finding the bug just relocates the question of who fixes it, and that agents merging code just moves the bottleneck to who ships it. The constraint in a fleet was never raw capability. It was trust. A model that flags its own flaws moves the human checkpoint later, which is worth real money.

The problem is that every honest behaviour is a token spender.

Self-Doubt Has a Token Bill

Think about what “more honest about its progress” mechanically requires:

  • Checking your own work is more steps. A model that re-reads its diff before declaring done runs longer than one that declares done and stops.
  • Flagging uncertainty is more output. “I’m not confident about the edge case in the retry logic” is tokens that a cheerful, wrong “done!” never spends.
  • Asking the clarifying question is a round trip. Cheaper to guess. The honest move costs a turn.

Anthropic’s own framing is that 4.8 completes tasks in fewer steps, and that’s plausible: fewer wasted steps chasing a wrong assumption. But fewer wasted steps is not fewer tokens. Verification is real work, and the model now does more of it by default.

The Effort Dial Is a Confidence Tax

Here’s where it stops being a side effect and becomes the product. Claude Code now exposes effort as a literal dial, labelled Faster at one end and Smarter at the other: low, medium, high (the default), xhigh, max. The pitch is blunt. Slide toward Smarter and the model spends more tokens to get better results. You are buying certainty by the token, on a slider.

The far Smarter end is the tell. Past a divider, beyond max, Claude Code offers a setting it calls ultracode: xhigh effort fused with Dynamic Workflows, the research-preview feature that lets Claude plan a job and fan out hundreds of parallel subagents in one session. That’s confidence through redundancy and breadth, and it multiplies spend by the agent count. The most thorough notch on the dial is, by construction, the most expensive one. This is the harness graduating from a DIY pattern into a priced product, and the thing it prices is thoroughness.

The same week, Anthropic shipped the matching idea into Claude Code as a free, hook-based security reviewer that flags vulnerabilities as you write and fixes them in the same session. Its cheapest layer is deterministic and free; the deeper layers spend model tokens, capped on purpose. That’s a post of its own, but it’s the same thesis: cheap checks for the obvious, metered model judgment for the doubt that earns it.

A wrong answer is cheap. A model willing to keep working until it’s sure is not. Opus 4.8 turned doubt into a setting, and the setting bills by the token.

The Timing Is the Tell

None of this would matter much if tokens were free. They’re about to stop being free in exactly the place this model is built to run.

On June 15, Anthropic moves automation off the flat subscription and onto a metered credit at full API rates. claude -p, the Agent SDK, GitHub Actions, third-party agents: all metered. So the same month the meter starts, the flagship model’s best new trait is a willingness to spend more to be sure, and the headline feature is a dial that spends even more.

Put those together, but get the doors right. Interactively, ultracode and Max effort still run on your subscription. They just drain a fixed budget faster, and once it’s gone the overage bills at API rates, by opt-in. Automated, the same behaviour goes straight onto the June 15 meter at full API rates. Two doors, one destination: the more the model second-guesses itself and fans out, the faster you arrive at per-token pricing. The model got honest enough to be worth paying for, right as the thing you pay for started switching from a seat to a token.

You Were Already Paying for This

There’s a reading of this post that’s wrong, and it might be the truer one, so it’s worth saying out loud. The claim “honesty is expensive” assumes the verification is new. For anyone already disciplined about this, it isn’t.

If you were any good at driving these models, you were already spending these tokens. You had it weigh a second approach before committing to the first. You sent it back into the codebase to see how the change played against the callers, the tests, the thing three files over that quietly assumes the old behaviour. You made it argue against its own design. That was the work, and it cost tokens every single time.

Opus 4.8 didn’t add that cost. It absorbed it. The model now does by default what the careful operator was doing by hand: catching its own mistakes, flagging what it isn’t sure of, going back without being told to. The verification didn’t get more expensive. It moved inside the model.

Bundled, it may even be cheaper. A manual review pass is a blunt instrument: a whole extra round trip that re-examines everything, because you can’t know in advance which line is wrong. A model second-guessing itself is targeted, it spends the tokens on the part it’s actually unsure about, not the whole file. Doubt aimed by the thing that has it beats blanket suspicion applied from outside.

So who actually pays more? The people who weren’t doing it. If you shipped the first draft and let production find the flaws, 4.8 hands you a bill you used to defer, not a new one. The cost of being right was always there. You were either paying it in your own loops, or paying it later, in incidents.

What It Doesn’t Fix

The honest pushback, because “more honest model” is not the same as “free lunch”:

  • Self-reported honesty is still self-reported. “4x less likely to miss its own flaws” is a confidence interval, not a guarantee. It misses fewer. It does not miss none. The human checkpoint doesn’t disappear, it just gets a better first pass.
  • Confidence got more expensive right as it got metered. Budgeting a flat seat was easy. Budgeting a fleet running Max effort with variable fan-out is a forecast, not a number.
  • It’s modest, and it knows it. Anthropic called it a “modest but tangible improvement.” GPT-5.5 still wins agentic terminal coding on Terminal-Bench 2.1. Some users say the frontend and design sense feels a step back from 4.7. The vibe-check consensus was that the model is stronger than the app around it.
  • Honesty you don’t need is honesty you overpaid for. Plenty of work doesn’t need a model that double-checks itself and fans out subagents. Routine edits still run fine and cheap on Sonnet 4.6. Paying Max effort for a one-line fix is paying for doubt you didn’t have.
Match the effort to the task

The effort dial is a cost dial. Default to High, reach for xhigh or Max only when the cost of being wrong is higher than the cost of the tokens: a migration, a security-sensitive change, a refactor you can’t easily review. Instrument your headless runs with token logging now, before June 15 turns those counts into an invoice, so you learn which tasks actually justify the spend.

Closing

The most honest thing about Opus 4.8 isn’t in the model card. It’s the shape of the release. Anthropic built a model whose standout quality is restraint, the willingness to slow down, check, and admit doubt, and shipped it weeks before the billing change that turns all of that into a meter.

The cost of being right didn’t appear with 4.8. It moved. It used to live in your own verification loops, or further downstream in the incidents you shipped without them. Now it lives inside the model, on by default, with a dial and a subagent count attached. The honest model is the better model. Whether it’s the expensive one depends entirely on whether you were already paying.

This post was drafted by the model it’s about. The benchmark figures came back inconsistent across sources, so it went back and reconciled them before writing the number down. That extra pass is the feature. It also cost extra tokens.