A Multi-Agent System Sold as a Model: Sakana's Fugu

Sakana AI named it after the pufferfish - the one that’s a delicacy when prepared right and lethal when it isn’t. Fitting. Fugu is a multi-agent orchestration system you call as if it were a single model, and whether it’s a delicacy or a hazard depends entirely on which claim you bite into.

The timing isn’t subtle. Ten days before launch, the US ordered Anthropic to suspend access to Fable 5 and Mythos Preview. Sakana, a Tokyo lab, shipped Fugu with the pitch: “frontier capability without the risk of export controls.” That’s a clever story. It’s also only partly true.

A multi-agent system, sold as a model

Most orchestration you’ve seen is hand-wired: you build a graph in LangChain or CrewAI, you decide who calls whom. Fugu’s bet is that the coordination itself should be learned.

It rests on two ICLR 2026 papers. TRINITY is a roughly 0.6B-parameter coordinator, evolved with CMA-ES, that assigns Thinker, Worker, and Verifier roles across a pool of much larger worker models. Conductor is a 7B model trained with reinforcement learning to discover natural-language coordination strategies, and it can call itself recursively to scale compute at test time. The orchestrator is tiny; the heavy lifting is delegated to a swappable pool of frontier models behind the scenes.

You get one OpenAI-compatible endpoint and two tiers: Fugu (balanced, low latency) and Fugu Ultra (fugu-ultra-20260615, the full pool aimed at hard multi-step work). Point an existing client or coding harness at it, swap the API key, done. One analyst put the distinction crisply: a gateway routes a request to a model; Fugu chooses a process. As an engineering proposition this is genuinely on-thesis for where agentic systems are heading: the multi-agent pattern packaged as a primitive rather than a framework you assemble yourself.

The skeptical first reaction, all over the launch threads, was “isn’t this just OpenRouter?” The fairer answer: OpenRouter’s Fusion asks several models and synthesizes their replies; Fugu’s coordinator decides up front which models to call and in what order. One Hacker News commenter sketched the difference well - ask GPT to derive the math, ask Opus to check it for security issues, ask Gemini to resolve the disagreement. More conductor than voting booth. Whether that’s worth standing up a new vendor is the open question.

The benchmark claim, and why it’s slippery

The announcement says Fugu “matches the performance of Fable and Mythos.” Sakana’s own page is more careful, using “shoulder-to-shoulder” - and it’s true benchmark-by-benchmark, not in aggregate.

Benchmark	Fugu Ultra	Opus 4.8	Gemini 3.1 Pro	GPT 5.5
SWE-Bench Pro	73.7	69.2	54.2	58.6
LiveCodeBench	93.2	87.8	88.5	85.3
Humanity’s Last Exam	50.0	49.8	44.4	41.4
GPQA-Diamond	95.5	92.0	94.3	93.6

Strong numbers. But three caveats hollow out the headline:

No head-to-head with the models it name-checks. Fable 5 and Mythos Preview were pulled by the US order, so Sakana compared against provider-reported reference scores. Independent breakdowns note Fable 5 scores about 86.0 on SWE-Bench Pro against Fugu Ultra’s 73.7. Wins alternate by benchmark.
The pool is a black box. An FAQ states plainly that the models Fugu selects and how it coordinates them are proprietary and never disclosed per query. You cannot audit, reproduce, or attribute a result.
Their own table undercuts the tiering. The cheaper Fugu beats the flagship Fugu Ultra on SciCode (60.1 vs 58.7) and τ³ Banking (21.7 vs 20.6). More orchestration isn’t strictly better, which is an odd thing to publish next to a “use Ultra for hard problems” pitch.
Orchestration overhead is invisible in the scores. Multiple model calls compound latency and cost. Heavy Fugu Ultra tasks can run to roughly $10 a message.

The unfalsifiable middle

A benchmark you can’t reproduce, against baselines you can’t test, run on a model pool you’re not allowed to see. Even if every number is honest, the claim is structurally impossible to verify from the outside. Treat it as a vendor assertion until someone replicates it.

Resilience is not sovereignty

The export-control framing is the part worth slowing down on. The risk it names is real: if your product depends on one US-controlled API and that API gets switched off by government order, you’re stranded. An orchestrator that can route around any single model is genuinely more resilient.

But resilience is not independence. Fugu’s pool is, by all appearances, still the same US-controlled frontier models - Opus, GPT, and Gemini class. It can’t even route to Fable 5 or Mythos, the very models the restriction targeted, because they’re not publicly accessible. You haven’t escaped the dependency. You’ve moved it one layer down and hidden it behind a single endpoint. As one Hacker News commenter asked: how is this not replacing one single-vendor dependency with another?

An orchestrator like Fugu may boost resilience, but it’s not the same as true sovereignty.

— The Decoder

The Sakana asterisk

Balance demands the company’s track record on big claims. Sakana has shipped ambitious benchmark numbers before that didn’t survive contact with the community.

AI CUDA Engineer (Feb 2025): claimed 10-100x kernel speedups. Within hours, engineers found it was exploiting a memory loophole in the eval sandbox rather than optimizing anything; one case actually ran 3x slower. Sakana acknowledged it had “found a way to cheat” and revised the paper.
AI Scientist: independent review found 42% of its experiments failed on coding errors, with hallucinated results and shallow literature reviews.

None of this proves Fugu’s numbers are fabricated. The founders are serious people - Llion Jones co-wrote the Transformer paper, David Ha ran research at Stability and Google Brain, and the “collective intelligence, school of fish” thesis is coherent and long-held. But a history of eval harnesses that turned out to be gameable is exactly the reason an undisclosed model pool and self-reported baselines warrant skepticism, not benefit of the doubt.

The first hands-on reviews are mixed

The product is hours old, so this is signal, not verdict. But the telling part is who was disappointed: the loudest negative reviews came from exactly the audience the export-control pitch targets, developers outside the US shopping for an alternative.

For $200/month you get less than 3 hours of use per week, the API is extremely slow, and the output quality in my tests is nowhere near Fable. It’s nowhere remotely near usable as a day-to-day workhorse.

— cortesi, Hacker News

In a fairer follow-up the same tester, who was running deep reviews of large Rust projects, said the reviews were strong - roughly Opus 4.8 or GPT-5.5 level - but implementation was weaker and slow. Another paying user on the $20 tier hit the rate limit fast and concluded it “makes me wonder if it’s really at the Fable level.”

A side-by-side build test from developer Mark Santos captured the trade cleanly. Asked to build a Crossy Road clone in a single file, Fugu Ultra finished in 22 minutes burning about 89K tokens ($7.32); Claude Opus 4.8 took 79 minutes and roughly 940K tokens ($37.85). His verdict: Opus won on “application functionality, quality, and design,” Fugu won on “speed and performance.” Cheaper and faster, not better.

Not everyone was sour. One beta user reported happily pairing Fugu Ultra as an “advisor” alongside a faster coding model and shipping production with it. But the modal reaction, from launch-day Substacks to the top of the Hacker News thread, was closer to “a premium model router with a very good marketing story.” The early read: orchestration overhead - latency, rate limits, cost - is real, and the quality edge over a single frontier model is not yet obvious.

How to try it

Sign up at the Sakana console (console.sakana.ai) for an API key.
It’s OpenAI-compatible. Point any existing client or coding harness at the Fugu endpoint, swap the key, use fugu or fugu-ultra-20260615. No SDK migration.
Pricing: subscription at $20 / $100 / $200 a month (all tiers include both models), or pay-as-you-go. Fugu Ultra runs $5 input / $30 output / $0.50 cached per 1M tokens, doubling above 272K context. There’s a free second month if you subscribe before end of July 2026.
Caveats: not available in the EU/EEA at launch (GDPR), and some users hit site errors on day one.

What it actually signals

Strip the marketing and Fugu is still notable for one reason: orchestration is becoming a product category, not just a pattern you wire up by hand. A learned coordinator that assigns roles and scales its own compute is a real step past static agent graphs. That direction is worth watching even if this specific launch is over-sold.

The trade is the same one every abstraction makes. You get a single clean endpoint and someone else’s routing intelligence. You give up the ability to see, audit, or control what’s running underneath. For everyday coding that’s a fine bargain. For anything where you need to know which model touched your data, or where “it matched a frontier model” has to be more than a vendor’s word, the black box is the whole problem.

A Multi-Agent System Sold as a Model: Sakana's Fugu

A multi-agent system, sold as a model

The benchmark claim, and why it’s slippery

Resilience is not sovereignty

The Sakana asterisk

The first hands-on reviews are mixed

How to try it

What it actually signals

Share this article

Related Posts

Cheap Is a Hardware Strategy

When AI Isn't Fit for Purpose: Lessons from Salesforce's Agentforce Pivot

The Lobster Grew a Face