Appearance
We stopped building better models six months ago. Nobody cares about the next 1% on MMLU. Every serious applied ML team is building agents.
This is not hype. This is where the work is happening right now. It is also where almost everyone is making exactly the same mistakes, wasting exactly the same money, and repeating exactly the same arguments that were settled in distributed systems engineering twenty years ago.
This article will not tell you agents are the future. It will tell you what works today, what doesn't, what costs too much, and which architectural choices you will regret three months from now.
We stopped building models. We started building agents.
If you have only seen agent demos you have not seen agents.
A demo agent runs one task, once, for 90 seconds, on a cherry picked problem. A production agent runs 1200 tasks an hour, retries failed steps 3.7 times on average, burns 27x more tokens than the original budget estimate, and will quietly drift into producing garbage output for three days before anyone notices.
73% of enterprises report AI costs exceeded original projections. Gartner puts the agent loop token multiplier between 5x and 30x per completed task. That is not an edge case. That is the default outcome for teams that follow the advice you see on Twitter.
The good news is we now have actual data, actual research, and actual production systems that avoid this. None of them look like the demo architecture.
Delegation is not task splitting. It is a learned capability.
The standard multi-agent pitch goes like this: give a main agent tools to spawn subagents, it will decompose tasks automatically. This does not work.
Any competent model can split a task into bullet points. Almost no model can decide when to delegate, what information to pass down, what to ask for back, and how to integrate partial results without losing state. This is delegation intelligence, and it is almost entirely absent from base models.
Natural text contains almost no training examples of good delegation. Humans do not write down the internal decision process of when to ask a colleague for help. They just do it.
SearchSwarm solves this by building a guided execution harness that produces correct delegation trajectories, then fine tuning on those trajectories. The resulting 30B model achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results at that parameter scale.
This is the pattern that will replace raw prompting for agent capabilities. You do not prompt a model to be good at delegation. You generate good execution traces, then bake that behaviour into the weights. There is no shortcut.
Web agents do not train on demos. They train on real environments.
Web agent research has been stuck for three years because everyone trained on synthetic DOM datasets. AliyunConsoleAgent broke this trend by training entirely inside the live production Alibaba Cloud console.
Cloud console UI verification is a 4 million inspection per year problem with less than 1% manual coverage. Frontier proprietary models hit 65.34% success rate on this benchmark, but cost too much to run at scale.
AliyunConsoleAgent uses a two stage pipeline:
- Supervised fine tuning on trajectories distilled from the frontier model
- Reinforcement learning with GRPO running directly against live cloud environments, with reward signals pulled from backend audit logs not UI state.
The final 32B model hits 63.52% success rate. That is 1.82 percentage points behind the best frontier model, at 92% lower inference cost.
The lesson here is unambiguous. If you want an agent that works in the real world you train it in the real world. Synthetic datasets will always leave a permanent performance gap.
Skills are not prompts. They are cost engineering.
Everyone treats agent skills as prompt compression. Make the skill shorter, save tokens. This is backwards.
A recent study of skill rewriting found that shorter skills very often increase total execution cost. Removing redundant looking anchor comments, validation checks and failure recovery steps makes the agent wander, retry, and burn far more tokens exploring paths that the original skill would have avoided.
There is no universal optimal skill template. API anchoring, workflow guarding, and formula anchoring all produce different cost and quality tradeoffs for different task families. The best learned policy reduced total task cost by 14.7% on cross model transfer while maintaining output quality.
Skill design is not prompt engineering. It is operational knowledge engineering with an economic tradeoff at every line. You do not optimize for the shortest skill. You optimize for the lowest total cost of the entire execution path.
The lightest first rule for agent extensions.
If you build agents with Claude Code you have stared at the same choice: should this be a Skill, an MCP server, a Plugin, or just a CLI call?
Almost everyone reaches for the heaviest tool first. This is the single largest avoidable source of token waste in production agent systems today.
These are not competing options. They are nested layers, ordered by context cost:
- Skill: Static procedure and knowledge. Only the name and description load at startup, tens of tokens each. Body loads on demand.
- CLI call: Executed only when used. Zero startup overhead.
- MCP server: Full tool schema loads at startup and is carried on every single turn. Roughly 700 tokens per exposed tool, regardless of whether the tool is ever used.
- Plugin: Distribution wrapper for all of the above.
One public measurement found connected MCP servers consuming 49% of a 200k context window before the user typed a single message. A single idle GitHub MCP server adds 18k tokens of permanent overhead to every turn.
The rule is simple: reach for the lightest thing that works. Stop at Skill if a Skill is enough. Only move down the stack when you have proven the lighter layer cannot do the job.
The loop is not the product.
You have seen the tweet: you should not be prompting agents. You should be designing loops that prompt your agents.
This is good advice that everyone misinterprets.
A loop is not an upgrade. A loop is an amplifier. It will amplify good output. It will amplify bad output. It will amplify token burn. And it will do all of this completely silently until the invoice arrives.
Building an agent loop without guardrails is the equivalent of giving an intern root access and a corporate credit card then going on holiday.
For a loop to be production safe it must have three non negotiable properties, all implemented outside the LLM:
- A hard maximum turn count and token ceiling that kills execution unconditionally
- Deterministic verification gates. The agent never gets to decide when it is done.
- A structured state ledger that logs every turn, not just raw chat history.
The LLM is never the orchestrator. The deterministic finite state machine is the orchestrator. The LLM is a stateless utility function you call inside strict boundaries. This is the architectural inversion 90% of teams never make.
Building block composition is the real agent superpower.
Agents are bad at building things from scratch. They are extremely good at gluing together existing working components.
This is not a limitation. This is their entire practical value.
Hugging Face Spaces now expose an agents.md endpoint for every Gradio space. This is a plain text document that tells an agent exactly how to call the space, upload files, poll for results, and handle errors. No SDK required. No custom integration.
A coding agent recently built a complete interactive 3D gallery of Paris monuments by chaining exactly two public spaces: Ideogram 4 for reference images, TripoSplat for 3D reconstruction. It then wrote the viewer, compressed the assets, and deployed the final result as a static space. No human wrote any code. No human touched any model.
The marginal cost of a new working application is now approaching the cost of describing it. This is the actual revolution that no one is talking about.
Security and audit for untrusted execution.
Agents run code. Code does bad things.
Anthropic's official security review action is the first production grade agent security tool that behaves correctly. It does not run inside the agent loop. It runs outside, on the diff output, after execution. It never gives the agent authority over its own audit.
It will miss vulnerabilities. It will produce false positives. But it has one critical property that almost every other agent security tool lacks: it cannot be prompted or tricked by the code it is auditing.
This is the only safe pattern. Audit always runs at a higher privilege level than the code being audited. Audit is never implemented as an agent capability.
What production agent systems actually look like right now.
ARIS is the most mature open source deep research agent in use today. It is also the most boring.
You will not find a cool architecture diagram on its readme. You will not find claims of AGI. You will find twelve consecutive patch releases fixing SSE parsing edge cases, secret redaction bugs, idle timeouts, credential leakage, process cleanup and MCP correlation errors.
That is what a production agent system looks like. 5% interesting capability work. 95% boring operational error handling that no one will ever tweet about.
ARIS also follows every rule outlined in this article. It has deterministic guardrails. It uses lightest first extension. It logs structured state. It never lets the agent decide when to stop. It costs 1/10th what comparable agent systems cost to run.
The quiet unglamorous future.
Agent systems will not look like science fiction. They will look like cron jobs. They will look like Airflow pipelines. They will look like all the boring reliable infrastructure we have been building for the last forty years, with one small difference: inside the loop, there is an LLM.
All the hard problems are solved. None of them are reasoning problems. They are cost problems. They are audit problems. They are boundary problems. They are drift problems. They are the exact same problems every distributed system has always had.
If you are building agents right now, stop optimizing for the demo. Stop chasing better reasoning. Start building boring things. Build guardrails. Build ledgers. Build cost controls. Build failure modes that fail loudly.
That is the work that matters. That is the work that will still be running five years from now.