Amazon built an internal leaderboard that ranked engineers by how much AI they used. It was called Kirorank, it tracked token consumption across the company’s Kiro platform and an in-house agentic tool called MeshClaw, and it was introduced to push AI adoption.
It worked exactly as designed, which is to say it failed. Employees gamed it by pointing agents at pointless tasks to inflate their numbers and climb the board. Costs went up. Useful output didn’t. Amazon killed it.
What Happened
- The metric was tokens. Kirorank ranked people by AI activity, measured in consumption. The implicit message: more usage is better.
- Staff optimized the metric. They called it “tokenmaxxing”, artificially running up usage to rank higher. Some assigned agents to do unnecessary work purely to burn tokens.
- The bill noticed before the value did. A senior VP later conceded the board was built with good intentions but generated real extra cost from inflated usage.
- The fix was to measure outcomes. Amazon now tracks “normalized deployments”, AI-generated code that actually ships and is useful, instead of raw token counts.
This Is Goodhart’s Law With a Meter
The principle is old: when a measure becomes a target, it stops being a good measure. Amazon just gave it a 2026 costume. Token usage is an input. It tells you how much compute someone spent, not how much value they created. The instant you rank people by an input, you get more of the input, decoupled from anything you actually wanted.
What makes the AI version worse than the usual KPI-gaming is that the gaming has a direct, metered cost. A sandbagged sales number is free to fake. A tokenmaxxed leaderboard burns real GPU dollars every time someone games it. The metric doesn’t just mislead, it bills you for being misled.
— An Amazon engineer, to the Financial TimesThere is just so much pressure to use these tools. Some people are just using MeshClaw to maximise their token usage.
Read that again. The pressure was to use the tool, not to do the work better. That’s the tell that the metric had become the job.
Why Everyone Is About to Do This
It would be comforting to file this under “Amazon being Amazon,” but the incentive is everywhere. Leadership has spent fortunes on AI tooling and wants to see it landing. Value from AI is genuinely hard to measure. Token spend is trivially easy to measure. So orgs reach for the number they can count instead of the one that matters, and reward it.
I’ve made versions of this argument before: that the tools we measure with were designed for humans, not agents, and that vendor benchmarks measure the wrong thing on purpose. Kirorank is the internal-management edition. Same disease, new ward.
Measure the outcome, not the consumption. Shipped-and-survived changes, defect rates, cycle time, customer-facing impact, all of these resist gaming in a way token counts never will. The day you put usage on a leaderboard, you’ve told people their job is to use the tool. They will believe you. Amazon’s pivot to “normalized deployments” is the right instinct: count what ships, not what burns.
What This Isn’t
- Not an argument against measuring at all. You should know whether the expensive new tools are working. The failure was the choice of metric, not the act of measuring.
- Not proof AI is useless. Tokenmaxxing is a management failure, not a capability one. The same engineers, measured on outcomes, might be genuinely more productive. We just can’t tell from a usage board.
- Amazon caught it. Switching to deployments that matter is the correct correction, and quickly. The cautionary part is how many orgs are running a Kirorank right now and calling the rising number “adoption.”
The Takeaway
If you reward usage, you get usage. If you reward tokens, you get tokenmaxxing, and a bill for it. The only AI metric worth putting in front of people is the one that survives contact with their incentives: did something useful ship because of this. Everything else is paying to be lied to.


