GPT-5.4 and the Wall Nobody's Talking About

Sam Altman launched GPT-5.4 yesterday. Native computer use. Tool search. 75% on OSWorld, surpassing human performance. By any technical measure, it’s OpenAI’s best model.

He launched it into a firestorm.

295% spike in ChatGPT uninstalls. Claude at #1 on the App Store. 2.5 million people in the QuitGPT movement. 900 tech workers signing letters supporting the company OpenAI just undercut. Chalk on the sidewalk outside OpenAI’s offices: “Where are your red lines?”

Gizmodo’s headline: “OpenAI, in Desperate Need of a Win, Launches GPT-5.4.”

The Model Is Good

Let’s be clear about what GPT-5.4 actually delivers. The enterprise and agentic improvements are real:

Computer use: 75% on OSWorld-Verified, up from GPT-5.2’s 47.3%. The first general-purpose model to beat human performance (72.4%) at desktop navigation
Knowledge work: 83% on GDPval across 44 occupations, up from 70.9%
Tool search: Instead of loading all tool definitions upfront, the model discovers and retrieves them on demand. 47% token reduction on Scale’s MCP Atlas benchmark with no accuracy loss
Hallucination reduction: 33% fewer false claims, 18% fewer errors per response
Legal work: 91% on BigLaw Bench. Spreadsheet modeling at 87.3%, up from 68.4%

These aren’t incremental. Computer use nearly doubling is a generational jump. Tool search is genuinely clever engineering. The knowledge work benchmarks suggest a model that’s becoming seriously useful for professional services beyond coding.

Tool search isn't new

Claude Code shipped tool search with Opus 4.6 in February. The model receives a lightweight tool index and discovers full definitions on demand. GPT-5.4 adopts the same pattern with a 47% token reduction on Scale’s MCP Atlas benchmark. The approach is sound. The framing as an OpenAI innovation is generous.

The Model Is Also Marginal

For developers already on GPT-5.3-Codex, the gains are thin.

SWE-Bench Pro: 56.8% to 57.7%. That’s a 0.9 percentage point improvement. Opus 4.6 sits at 80.8% on SWE-Bench Verified. Gemini 3.1 Pro hits 80.6%. The frontier models are converging within a percentage point of each other on the benchmarks that matter most to developers.

It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, many developers barely looked at evaluation scores.

— Interconnects (AI newsletter)

The era of a single “best” model is over. February-March 2026 gave developers three near-equivalent options: Opus 4.6 for complex multi-file architecture, GPT-5.4 for enterprise workflows and computer use, Gemini 3.1 Pro for cost-sensitive reasoning at $2/$12 per million tokens. The smart play is routing between all three, not pledging loyalty.

Gary Marcus called the broader GPT-5 family “overhyped and underwhelming.” IEEE Spectrum ran “GPT 5’s Rocky Launch Highlights AI Disillusionment.” The GPT-5 model lineup has expanded so rapidly since mid-2025 that choosing the right variant is now its own problem.

The Week That Rewrote the Rules

You can’t evaluate GPT-5.4 without the week it launched into. I covered the Pentagon crisis in detail in Same Terms, Different Treatment, but the timeline bears repeating:

February 25: Pentagon moves to blacklist Anthropic for refusing “all lawful purposes” language, including autonomous weapons and mass surveillance
February 27: Trump orders all federal agencies to stop using Anthropic. OpenAI announces its own Pentagon deal hours later
March 1: Claude hits #1 on the App Store. QuitGPT begins
March 3: Altman calls his own deal “opportunistic and sloppy,” renegotiates
March 4: Amodei calls OpenAI’s messaging “straight up lies.” 900 employees sign letters backing Anthropic
March 5: GPT-5.4 launches. Altman tells the Morgan Stanley conference that companies shouldn’t be “more powerful than the government”

OpenAI launched a flagship model into a consumer revolt triggered by their own actions. The technical narrative of GPT-5.4 is inseparable from the trust narrative of the week that surrounded it.

The Iran detail nobody noticed

The Wall Street Journal reported that U.S. strikes in Iran used Anthropic’s technology hours after Trump announced the ban. The military was using Claude in active operations while simultaneously labeling the company a national security threat.

”There Is No Wall”

Altman’s recurring claim: there is no wall on AI scaling. He first said it in late 2024 when reports emerged that all major labs were hitting data limits. He’s echoed it since. GPT-5.4 is supposed to be the proof.

The technical case is mixed. Computer use and knowledge work jumped significantly. Coding didn’t. The frontier is converging. Whether that’s a wall or a plateau depends on which benchmark you cherry-pick.

But the more interesting wall is one Altman doesn’t talk about.

OpenAI’s revenue run rate tops $25 billion. ChatGPT has 900 million weekly active users. And in a single weekend, 1.5 million of them walked away over a Pentagon contract. One-star reviews spiked 775%. Downloads fell 13% day-over-day.

The technical wall may not exist. The trust wall is very real.

Many employees really respect Anthropic for standing up to the Pentagon and are frustrated with OpenAI’s handling of their own contract.

— A current OpenAI employee, to CNN (on condition of anonymity)

100 of Altman’s own employees signed a letter backing his competitor. Research scientist Aidan McLaughlin posted publicly: “I personally don’t think this deal was worth it.” OpenAI’s safety researcher Jasmine Wang called for independent legal counsel to analyze the contract language.

When your own safety team questions the deal, the “no wall” pitch rings hollow. The wall isn’t compute or data. It’s whether anyone trusts you with what you’re building.

The Competitive Landscape Just Flipped

Before this week, the AI market was a three-way race on capabilities. After this week, it’s a two-dimensional competition: capabilities and trust.

Anthropic got punished by the government and rewarded by the market. Claude downloads surged 37% on Friday, then 51% on Saturday. Free users up 60% since January. Paid subscriptions more than doubled this year. Anthropic’s revenue run rate hit $19 billion. The company that got blacklisted is growing faster than ever.

OpenAI got rewarded by the government and punished by the market. The QuitGPT movement, the chalk protests, the internal dissent. Altman spending a weekend doing damage control instead of celebrating a major model launch. GPT-5.4’s release announcement buried under Pentagon headlines.

Google employees signed letters too. 800+ Alphabet workers joined the “We Will Not Be Divided” letter. The Center for American Progress called for Congress to investigate. Lawfare concluded the supply chain risk designation is “close to untenable” legally.

For developers choosing models right now

The model quality gap is nearly closed. Opus 4.6 leads coding (80.8% SWE-Bench Verified). GPT-5.4 leads computer use (75% OSWorld) and knowledge work (83% GDPval). Gemini 3.1 Pro leads reasoning (94.3% GPQA Diamond) at the lowest price. Route between all three. Don’t bet your workflow on one vendor’s ethics holding up.

What GPT-5.4 Actually Tells Us

Strip away the politics and GPT-5.4 reveals something important about where AI is heading:

Agentic is the new frontier. Computer use, tool search, multi-step workflows. The model improvements that matter most aren’t “smarter answers” but “can operate software autonomously.” This is the transition from chatbot to agent, and GPT-5.4 is further along that path than anything OpenAI has shipped.

Benchmarks are converging. Three labs within a percentage point on SWE-Bench. The differentiation is shifting from raw capability to specialization, pricing, and ecosystem. Gemini 3.1 Pro at 60% cheaper output than Opus is a real competitive lever. GPT-5.4’s computer use lead is another.

The model zoo is getting confusing. GPT-5.0, 5.2, 5.3-Codex, 5.4, 5.4 Thinking, 5.4 Pro. Six models in eight months. OpenAI’s naming scheme is becoming a liability.

Enterprise is the growth play. GPT-5.4’s strongest gains are in legal, financial, and professional services. The model is clearly aimed at enterprise customers who care about GDPval scores, not developers who care about SWE-Bench.

The Question Nobody’s Asking

Here’s what I keep coming back to. Altman said at the Morgan Stanley conference that he’s “terrified of a world where AI companies act like they have more power than the government.” He framed Anthropic’s refusal to comply as corporate overreach.

But Anthropic didn’t refuse the government. Anthropic refused to enable autonomous weapons and mass domestic surveillance. Two red lines. The same two that OpenAI’s own employees support, that 900 tech workers signed letters defending, that consumers revolted over.

The question isn’t whether companies should be more powerful than the government. The question is whether the government should be able to strip safety guardrails from AI by threatening the companies that build them. This week proved it can.

GPT-5.4 is a very good model. It’s also a model launched by a company that undercut its competitor’s safety stance, admitted the move was “sloppy,” renegotiated under public pressure, and still can’t release the actual contract language for independent review.

The wall Altman should worry about isn’t scaling. It’s the growing number of users, employees, and developers who watched this week unfold and started building exits.

The main reason they accepted and we did not is that they cared about placating employees, and we actually cared about preventing abuses.

— Dario Amodei, Anthropic CEO (internal memo)

GPT-5.4 and the Wall Nobody's Talking About

The Model Is Good

The Model Is Also Marginal

The Week That Rewrote the Rules

”There Is No Wall”

The Competitive Landscape Just Flipped

What GPT-5.4 Actually Tells Us

The Question Nobody’s Asking

Share this article

Related Posts

The Polyglot Stack: Why Developers Stopped Picking One AI

The Week AI Went Full Throttle

Opus 4.6: The Vibe Working Inflection