Amazon's AI Outages Escalated. So Did the Denial.

Two weeks ago I wrote about Kiro deleting and recreating an AWS production environment. I ended that post with a specific observation: “Amazon will weather this. Cost Explorer in one China region isn’t existential.”

I was right about the weathering. Wrong about the timeline.

Amazon.com Goes Down

On March 5, Amazon.com went down for six hours. Checkout, pricing, accounts. Not an internal AWS service in a single region. The storefront. The thing that makes the money.

The cause: a faulty software deployment following AI-assisted changes.

Amazon hasn’t confirmed Kiro was directly involved, and they likely never will. But the pattern is identical to the December incidents: AI-assisted code changes pushed to production, insufficient review, cascading failure. The same failure mode, at a scale that’s harder to dismiss.

The escalation timeline

Mid-December: Kiro deletes and recreates Cost Explorer environment. 13-hour outage in one China region.
December (separate incident): Amazon Q Developer causes service disruption under similar circumstances.
March 5: Amazon.com goes down for 6 hours. Checkout, pricing, accounts affected.
March 10: Emergency engineering meeting convened by SVP Dave Treadwell.

Three incidents in three months. Each one bigger than the last.

The Emergency Meeting

On March 10, SVP Dave Treadwell convened an emergency engineering meeting. The outcome: a new policy requiring senior engineer sign-offs for AI-assisted code deployed by junior staff.

Read that again. Amazon’s response to AI-caused outages is human review checkpoints. Deterministic guardrails. Exactly the kind of safeguards I wrote about needing to exist before the mandate, not after it.

This is the right response. It’s also three months late.

The 80% weekly Kiro usage mandate came first. The peer review requirements came after the outages. The sign-off policy came after the retail site went down. Every safeguard has been reactive, bolted on after damage that was - to quote their own senior engineer - “entirely foreseeable.”

Ship the capability. Mandate adoption. Discover the failure mode in production. Add the guardrail. Blame the human.

— The pattern, every time

The Internal Rebellion

Here’s the part Amazon really doesn’t want you to know: approximately 1,500 engineers have signed an internal petition pushing for Claude Code access.

Their argument is straightforward. Claude Code outperforms Kiro on multi-language refactoring. Engineers who know which tool works best for the job are being overridden by product strategy. The 80% mandate isn’t about engineering quality. It’s about adoption metrics for an internal product.

Exceptions to the Kiro mandate now require VP-level approval. Amazon is spending more organizational energy enforcing tool adoption than ensuring tool safety.

This mirrors exactly what I described with Microsoft and Copilot: mandate your own tool, track usage metrics, tie it to performance reviews. Engineers route around it. The difference is that Microsoft’s engineers are quietly paying out of pocket for alternatives. Amazon’s engineers are organizing.

When engineers rebel

1,500 signatures isn’t a preference survey. It’s a signal that the people closest to the work believe the mandate is actively harming their productivity. When VP-level approval is required to use a different text editor, something has gone structurally wrong.

The Denial Pattern

After the December incident, Amazon said it was “user error, not AI error.” The engineer had too-broad permissions. It was “a coincidence that AI tools were involved.”

After the March incident, the framing shifted slightly. No longer “coincidence” - now it’s “faulty software deployment.” But the AI-assisted qualifier is buried. The emergency meeting happens behind closed doors. The new sign-off policy is internal, not public.

The denial is evolving, not disappearing. Each incident gets a slightly more sophisticated explanation for why the systemic issue isn’t systemic:

December: “User error. Permissions were too broad.”
March: “Deployment error. Process wasn’t followed.”
Next time: ???

The common thread they won’t say out loud: we mandated adoption of agentic AI tools faster than our safety infrastructure could keep up. Every post-incident fix confirms this. You don’t add senior sign-off requirements unless the existing process failed. You don’t convene emergency SVP meetings unless the problem is bigger than any single team.

What This Actually Means

This isn’t about Amazon being uniquely reckless. They’re just the most visible example of a pattern playing out everywhere. The sequence is predictable:

Capability ships fast. Kiro went from preview to GA to company-wide mandate in five months.
Safeguards ship slow. Peer review, sign-off policies, and permission scoping arrived months after the mandate.
Incidents fill the gap. The space between “capability deployed” and “safeguards in place” is where outages live.

The question isn’t whether AI agents cause production incidents. They will. Stochastic systems with deterministic authority will always find failure modes that nobody anticipated. The question is whether your organization builds safety culture faster than it ships capabilities.

Amazon’s answer, so far, is no. The mandate came first. The guardrails came after the fires.

The good news: senior sign-offs and mandatory review are real safeguards. They’re deterministic. They don’t care what the agent thinks is optimal. That’s exactly the Prettier pattern: make the dangerous path structurally harder.

The bad news: it took a retail outage affecting millions of customers to get there. And the 80% mandate is still in place.

We’ve already seen at least two production outages. The engineers let the AI agent resolve an issue without intervention. The outages were small but entirely foreseeable.

That quote was from December. It’s March. The outages aren’t small anymore.

Amazon's AI Outages Escalated. So Did the Denial.

Amazon.com Goes Down

The Emergency Meeting

The Internal Rebellion

The Denial Pattern

What This Actually Means

Share this article

Related Posts

Delete and Recreate: When AWS's AI Agent Went Rogue

Your Lobster Is Leaking

Your AI Tools Are the Attack Surface