Skip to content

Agent Architecture Is Now The Bottleneck, Not LLM Capability

#llm-agents #system-design #agent-memory #multi-agent #benchmarks

Last week four papers dropped on arXiv within 72 hours that together end the argument about where agent progress will come from for the next two years.

Nobody is saying models don't matter. Nobody is saying GPT-5.5 is not better than GPT-4o. But for any non-trivial agent task running longer than 10 steps, you will get larger performance improvements from good system design than upgrading from GPT-4o to GPT-5.5.

That is the core result nobody was expecting two years ago. That is the thing every production ML team is now waking up to.

The shift from model scaling to harness scaling

The first paper, From Model Scaling to System Scaling, formalizes this shift. It names the layer around the foundation model the agent harness: all the code that handles memory, context construction, tool routing, verification, loop orchestration and governance.

Until 6 months ago everyone treated this harness as disposable boilerplate. Benchmarks measured only final task success. Papers would test an agent, attribute all performance to the model, and never mention that they wrote 3000 lines of orchestration code that did 90% of the actual work.

This was always a lie. Agent performance does not come from the model. It emerges from the interaction between model, memory substrate, context constructor, routing layer, orchestration loop and verification layer. Change any one of these components by 10% and you can change final task success by 40%.

The paper introduces CheetahClaws, an open reference harness implementation. When run with exactly the same GPT-4o base model, it outperformed Claude Code by 27% and OpenClaw by 39% on identical long horizon tasks. No model changes. No fine tuning. Only better harness design.

This is not a trick. This is the state of the field right now. You can beat every public agent demo today with GPT-4o and good system code.

Provenance collapse is the silent failure mode

Most agents still die on memory. Not running out of context window. Forgetting where facts came from.

The Mitigating Provenance-Role Collapse paper documents the single most common unreported failure mode in long running agents. When you store memory as unstructured flat text, agents lose track of what they observed, what they inferred, what someone told them, and what they just guessed.

After 20 steps this stops being an edge case. It becomes the default state. Agents will confidently assert things they made up three steps earlier as hard observed fact. They will repeat mistakes indefinitely. They will treat their own prior output as ground truth input. Nobody benchmarks this. Nobody talks about it. Every production agent team has fought this bug.

The paper proposes MemIR, a typed memory representation that separates raw evidence, retrieval cues, and truth bearing claims at write time. It does not use fancier embeddings. It does not use a better vector database. It adds structure to what gets stored.

On BEAM-100K tasks requiring source tracking over 100k tokens, MemIR improved success rate from 21% to 68% with the exact same base model. No other changes.

You do not need a 10 million token context window. You need to stop storing garbage in your memory.

When you actually need multi-agent systems

Right now half the agent demos on twitter are 12 agent swarms doing something a single agent could do in 3 lines of prompt.

Anthropic published their production guidance this week on this exact topic. They have seen dozens of teams spend 3 months building elaborate multi-agent orchestration, only to find that a well prompted single agent got identical or better results.

Multi-agent systems are not magic. They have very real coordination overhead. That overhead is worth paying only in three narrow cases:

  1. Context pollution would degrade a single agent. When running a subtask would leave toxic state in the conversation history that breaks later reasoning, spin up a clean subagent and throw it away when done.
  2. The task can be cleanly partitioned into independent parallel work. There is no gain from running things sequentially just to keep it all in one context.
  3. Specialization produces materially better tool selection. Different subagents can have entirely different tool sets and system prompts, with no cross contamination.

Outside these three cases you will always get worse performance and higher cost with multiple agents. There are no exceptions to this rule today.

Benchmarks are systematically overstating agent capability

All existing agent benchmarks are lying.

The Claw-Anything paper demonstrates this cleanly. Every popular agent benchmark gives the agent a clean isolated state, no irrelevant noise, no history, exactly the information required to complete the task. On these benchmarks GPT-5.5 scores 82% pass@1.

Claw-Anything runs the same tasks inside a simulated user environment with 3 months of prior activity history, irrelevant events, conflicting signals, and realistic state drift. On this benchmark GPT-5.5 scores 34.5% pass@1.

That 47 point gap is the difference between demo agents and agents that actually work for real users. Nobody is closing this gap with bigger models. This gap is entirely harness and memory design.

The paper also shows that training on realistic noisy environments improved base agent performance by 23.7%. This is the largest single improvement reported for general purpose agents in the last 12 months. It did not come from scaling parameters. It came from better training data that matched actual deployment conditions.

Agency lives in the model, reliability lives in the harness

There is one point that almost everyone gets wrong.

Agency is not implemented in code. Agency is learned during model training. No amount of loop orchestration will ever make a model that cannot act, act. You cannot bolt agency onto a generic LLM.

But reliability, consistency, auditability, safety and long horizon performance are almost entirely implemented in the harness. The model is the driver. The harness is the vehicle. A good driver will crash a bad car. A bad driver will not win a race even in the best car.

For the last ten years everyone only talked about the driver. We argued about who had the fastest driver. We measured driver lap times on empty test tracks.

Now we are starting to build actual roads. Now we are finding out that brakes, steering, suspension and seat belts matter a lot more than raw engine power.

This is not the end of model scaling. This is the point where agent engineering becomes a real engineering discipline. Most of the gains from here will not come from papers published by OpenAI and Anthropic. They will come from ordinary engineers building good systems.

That is the good part. Most of us can work on this.

References

  1. From Model Scaling to System Scaling: Scaling the Harness in Agentic AI http://arxiv.org/abs/2605.26112v1
  2. Claw-Anything: Benchmarking Always-On Personal Assistants http://arxiv.org/abs/2605.26086v1
  3. Mitigating Provenance-Role Collapse in Long-Term Agents http://arxiv.org/abs/2605.25869v1
  4. When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills http://arxiv.org/abs/2605.25832v1
  5. Building multi-agent systems: when and how to use them https://claude.com/blog/building-multi-agent-systems-when-and-how-to-use-them