February 5, 2026. Anthropic ships Opus 4.6 with agent teams, 1M token context, and a “vibe working” thesis. An hour later, OpenAI drops GPT-5.3-Codex. Same day. Same benchmarks. Different energy.
One CEO was talking about vibe working. The other was writing a 400-word essay at 6am about Super Bowl ads. GPT-5.3-Codex deserved its own headline. Its CEO made sure it didn’t get one.
What GPT-5.3-Codex Actually Is
OpenAI’s most capable agentic coding model. But the pitch is bigger than coding.
GPT-5.3-Codex combines the coding performance of GPT-5.2-Codex with the reasoning capabilities of GPT-5.2, 25% faster. Available to all paid ChatGPT plans across the Codex app, CLI, IDE extension, and web.
The standout claim: first model instrumental in creating itself. The Codex team used early versions to debug its own training pipeline, manage deployment, and diagnose test results. Self-referential training isn’t new conceptually, but claiming it publicly at this scale is.
“Instrumental in creating itself” is doing heavy lifting here. The model found bugs in its own training pipeline and helped deploy itself. Whether that’s a milestone or marketing depends on how much weight you give to “instrumental.”
Beyond raw coding:
- Interactive collaboration: Steer the model mid-task without losing context. Less “fire and forget,” more working alongside it
- Computer use: OSWorld-Verified jumped to 64.7% from 38.2%. The model can operate desktop environments, not just write code
- Cybersecurity: First model OpenAI classifies as “High capability” for security tasks, with $10M committed in API credits for cyber defense
- Professional work: GDPval at 70.9%: financial analysis, document generation, slide decks. The delegation thesis gets more infrastructure
The Benchmark Wars
Both Opus 4.6 and GPT-5.3-Codex launched the same day with competing claims.
| Benchmark | GPT-5.3-Codex | Opus 4.6 | Notes |
|---|---|---|---|
| Terminal-Bench 2.0 | 77.3% | 65.4% | GPT reclaims the lead |
| SWE-Bench Pro | 56.8% | - | Different from Opus’s SWE-Bench Verified (80.8%) |
| OSWorld-Verified | 64.7% | - | Computer use, not just coding |
| GDPval | 70.9% | 1,606 Elo | Different scoring systems, both strong |
| Cybersecurity CTF | 77.6% | - | New category for GPT-5.3 |
| SWE-Lancer IC Diamond | 81.4% | - | Freelance-style engineering tasks |
OpenAI reports SWE-Bench Pro. Anthropic reports SWE-Bench Verified. Different benchmarks, different difficulty levels. Terminal-Bench 2.0 is the only direct comparison, and GPT-5.3-Codex’s 77.3% is a meaningful jump over Opus 4.6’s 65.4%. But OpenAI’s numbers are self-reported at “xhigh” reasoning effort. Independent testing will tell the real story.
The Codex app hit 500,000 downloads in three days since its Monday launch. OpenAI is pushing hard on product surface: app, CLI, IDE, web. Anthropic’s coding story is still primarily Claude Code in the terminal. I’ve been using Codex as a strategic architect alongside Claude for months. GPT-5.3-Codex makes that pairing stronger.
The Rant
Same morning GPT-5.3-Codex dropped, Sam Altman posted a 400-word essay on X about Anthropic’s Super Bowl campaign. 8.7 million views. 6,100 replies. The ratio was not kind.
The ads: four commercials with the tagline “Ads are coming to AI. But not to Claude.” One depicts a man asking a chatbot for advice about his mom, only for the response to twist into an ad for “Golden Encounters,” a fictional cougar-dating site. Funny. Pointed. Directly mocking OpenAI’s announced plan to put conversation-specific ads in ChatGPT.
Altman started diplomatically:
— Sam Altman, XFirst, the good part of the Anthropic ads: they are funny, and I laughed.
Then it escalated. He called the ads “clearly dishonest.” Called Anthropic’s approach “doublespeak.” Said they “serve an expensive product to rich people.” Accused them of wanting to “control what people do with AI” and blocking companies from their coding product “including us.” Called them “authoritarian.” Twice.
The finale: “One authoritarian company won’t get us there on their own, to say nothing of the other obvious risks. It is a dark path.”
All of this over a Super Bowl ad about a cougar-dating site.
OpenAI has explicitly confirmed plans for conversation-specific ads in ChatGPT. Altman said free users would see ads. The Anthropic commercials depict literally what OpenAI announced they’d do. Calling this depiction “clearly dishonest” when it’s your stated business model is a bold communications strategy.
The internet did what the internet does with defensive essays about jokes.
The top reply, with 3,500 likes: “It’s a funny ad. You should have just rolled with it. Your tweet should have just said ‘The Anthropic ads are funny, and I laughed.’ Instead, it’s cope.”
Nikita Bier: “Comms advice: Never respond to playful humor with an essay. Just say ‘damn, they cooked us.’” @satn: “Anthropic: makes funny ad about ad-driven AI. Sam: writes 400-word defensive essay calling them authoritarian.”
By reacting, Altman validated Anthropic’s entire marketing strategy. The ads were designed to provoke. They worked. And now instead of people discussing GPT-5.3-Codex’s Terminal-Bench scores, they’re discussing whether Sam Altman can take a joke.
The Reactive Pattern
This isn’t the first time OpenAI has shipped under competitive pressure, and it shows.
GPT-5.2 was the code red response to Gemini 3, pushed out in under a month. GPT-5.2-Codex followed a week later: almost certainly an early 5.3 checkpoint shipped ahead of schedule. OpenAI’s own blog says they used “early versions” of GPT-5.3-Codex during its training. Those early versions had to go somewhere. Now GPT-5.3-Codex drops the same day as Opus 4.6. And Altman’s rant is itself a reactive move: Anthropic runs ads, Altman can’t let it go, the conversation shifts from benchmarks to ego.
The pattern: competitor ships, OpenAI reacts, the reaction overshadows the substance. Every time. The models are strong enough to stand alone. The leadership keeps getting in the way.
What Feb 5 Actually Reveals
Strip away the drama, and the day tells you where this is heading.
The models are converging. Terminal-Bench, SWE-Bench, GDPval: both companies are within striking distance on every benchmark that matters. The era of one model being clearly better is ending. Model quality is becoming table stakes.
The business models are diverging. Anthropic is premium-only, no ads, enterprise-focused (80% of revenue from enterprise). OpenAI is free-tier with ads, consumer-focused, trying to bring “AI to billions who can’t pay for subscriptions.” Both are defensible. They’re fundamentally different bets.
The ecosystems are diverging faster. OpenAI has the Codex app (500K downloads), computer use, and web dev capabilities. Anthropic has Claude Code with agent teams, 1M context, and context compaction. The moat isn’t the model anymore. It’s the tooling around it.
— Sam Altman, XThis time belongs to the builders, not the people who want to control them.
Set aside the pettiness, and this line captures a real strategic difference. OpenAI is building a platform for everyone. Anthropic is building a tool for professionals. Both are right for their respective users.
Where This Leaves Developers
For anyone building software, Feb 5 was a good day. Two frontier models launched within an hour, both pushing agentic coding forward. Competition is giving us:
- Better models faster: Terminal-Bench jumped from 64% to 77.3% in under two months
- Lower prices: OpenAI is running GPT-5.3-Codex 25% faster, which means cheaper per task
- Product innovation: Agent teams, interactive collaboration, computer use. The feature set expands weekly
The delegation era just got two competing visions of what delegation looks like. Pick the ecosystem that fits your workflow, not the benchmark that won this week.
GPT-5.3-Codex is a genuinely strong release. Terminal-Bench 2.0 at 77.3% is real. The bootstrapping story is compelling. The Codex app at 500K downloads shows product momentum. This model should be standing on its own two feet.
Instead, its own CEO turned its launch day into a discourse about whether he can handle a joke. The models will keep converging. The leaders won’t. And for developers, the models matter more than the egos behind them.


