Recently I opened my laptop to find a production event queue stalled. By the end of the day I had merged and deployed eight pull requests, closed the incident, and - in the quiet hours after - read through a greenfield candidate being built to eventually replace part of the same system. The two on the same day gave me the cleanest data point I have yet for what AI does and does not do in real engineering.

The Fire

The shape will be familiar to anyone who has run an event-driven backend against external vendor APIs.

A background-worker service was wedging on outbound HTTP calls. Every HTTP client had been registered with a long default timeout and an aggressive retry policy. Layered underneath the service’s own retry loop, a single upstream blip turned into a multi-minute wedge per event. The worker burned its lifetime inside one hung call. Events piled up. The housekeeping job designed to surface this very situation had silently stopped ticking hours earlier, so the observability signal was already dead by the time anyone looked.

Four root causes. Three layers. All latent for months, all activated together by one upstream API flap.

The Fix Chain

The chain, in order

Strip runaway retry policies and add per-call timeouts. Wrap every background handler in a deadline at the kernel level. Add drop-on-full backpressure to fire-and-forget telemetry. Port the pattern into sibling services. Rewrite a silently-wedging monitor. Pin every behaviour with targeted tests.

One AI session from morning coffee to 2am. The agent drove the mechanics: logs, database queries, PR descriptions, git and deploy flows, the test suite, watching new pods come up. Three tools, three repositories, ten-step sequences per fix, no lost context.

I kept the judgement. Safe to touch this many rows without ops? Deploy now or wait for morning traffic? Ship with this follow-up queued or hold the deploy? The AI could propose answers and did. The call was mine, and it should have been.

Most telling: a separate reviewer model caught a bug the first agent missed - a parameter-type precedence issue in the bulk SQL that forced an implicit column conversion and destroyed the index seek. Clean-looking query. Latent plan-killer. Three roles, three modes. None redundant.

AI did not replace me that day. It let me be the staff engineer across eight pull requests instead of writing two by hand.

AI took the tax on breadth. Across the incident I was architect, debugger, reviewer, deploy engineer, DBA, test author, and PR author. That breadth normally eats a day in context switches. The agent ate the switching tax instead. Coding was never the bottleneck; coordination was.

The Twist

Later that night I read through the greenfield candidate. Months of work. Hundreds of tests. Impressive velocity. Clean architecture. All the things you want in a fresh system.

Every class of bug we’d fixed that day was already present. Wrong cancellation token threaded through multiple layers - the exact pattern that wedged the mature system’s worker. Overly broad exception catches swallowing failure. Validate-then-commit outside a transaction. Layered timeout-retry-circuit-breaker resilience with no per-attempt timeout - the same shape we had just stripped out of the old system.

Not because the builder is bad. These are latent bugs. They look fine on paper, fine in hundreds of tests, fine in staging. They only look bad the first time a vendor flaps, a database hits lock contention, or a pod restarts under load. They will look fine right up until the day the greenfield system has to run in production long enough to have a bad day.

The Edge Cases Are the System

A mature system is not defined by its features. The features are the easy part. What a mature system encodes, and what a rewrite does not get for free, is the accreted edge-case knowledge from years of hitting production: that a specific vendor returns error bursts during maintenance and needs a wider backoff; that an external pipeline occasionally hangs for nearly two minutes without error; that a specific index seek dies under implicit type coercion and silently degrades from milliseconds to minutes; that a rolling-deploy pattern aborts in-flight writes, so a handler’s lifetime token must not link to the shutdown token.

None of this is in the architecture, the feature list, or the greenfield tests. All of it was paid for in incidents.

The incident IS the spec

Every incident that leaves the original system running - which is almost all of them - adds a line to the unwritten specification of what the system actually does. Timeout values, retry bounds, lock hints, token propagation rules. The spec lives in the commits, the post-mortems, and the minds of the engineers who were on call. A greenfield rewrite inherits none of it.

This is not an argument against rewrites. It is an argument against the fantasy - gaining currency because AI makes greenfield feel fast - that you can AI your way to a replacement in a quarter. You can write the features in a quarter. You cannot recreate the scars in a quarter. The scars are most of what makes the old system actually work.

The Division of Labour

AI’s most valuable mode is not replacing the engineer or writing ninety percent of the code. It is removing the coordination tax. In an incident I’m not bottlenecked by typing speed. I’m bottlenecked by how many windows I can hold in my head. An AI that runs the windows while I keep the taste and the final call is a dramatically more productive system than either of us alone.

If you’re experienced and not yet comfortable running half a dozen parallel threads through an agent during an outage, you’re leaving capacity on the floor. If you’re junior, the skill to invest in is not prompting - it’s the judgement that lets you contribute when the AI writes the code.

And if you’re the person insisting a messy, integration-heavy production system can be AI-rebuilt from scratch because the new prototype looks cleaner: you’re looking at the code. Go look at the incidents. That is where the system actually lives.