Token waste is killing your agents: engineering the scaffolding that actually matters

You asked your coding assistant to rename a function. It sent 8,400 tokens to do it. Someone checked.

That one data point from a developer who instrumented their Cursor session tells you everything about why agent systems bleed money in production. The model didn't need 8,400 tokens. The scaffolding around it decided that was an acceptable amount of context to ship for a trivial refactoring task. And nobody noticed until someone looked.

This is the real engineering problem in agentic systems right now. Not which model to pick. Not whether MCP is the right protocol. The problem is that the infrastructure between your user's request and the model's output is wasteful by default, and nobody builds in the discipline to fix it until the bill arrives.

The 8,400-token rename

The Cursor token audit is worth reading in full. The author wanted to rename a single function. The AI coding assistant shipped roughly 200 lines of context to accomplish what amounts to a find-and-replace operation. The breakdown is instructive: file contents, surrounding context, tool call overhead, formatting. Each piece feels reasonable in isolation. Together, they're absurd.

This connects directly to what Nicolas Frankel calls "Tokensparsamkeit" in his piece on coding assistants. The German word evokes thrift, and his argument is simple: more data does not automatically produce better decisions. Most coding assistants operate on the assumption that dumping maximum context into the prompt yields better results. In practice, you get noise, higher latency, and a bill that scales with your verbosity.

Frankel's point has a sharp edge for agent builders. If your agent's default behavior is to include entire files, full conversation history, and every available tool description in every request, you're not being thorough. You're being lazy about context management. The model doesn't need your entire codebase to rename a function. It needs the symbol, its references, and the surrounding scope. Maybe 400 tokens, not 8,400.

Small models, big harness

The most compelling agent engineering work right now isn't happening with frontier models. It's happening with 4B parameter models that are forced to work within tight constraints.

SmallCode is a coding agent built specifically for small local models like Gemma and Qwen. The author was frustrated that every existing coding agent (OpenCode, Cursor, Claude Code) assumes you're running a massive cloud model. Try them with a local 4B model and tool calls fail, context overflows, and multi-step tasks collapse.

The result: 87/100 on benchmark tasks with a Gemma model activating only 4B parameters per token. OpenCode scores around 75% with 14B models. The harness does the heavy lifting.

The techniques that make this work are worth enumerating because they're generalizable to any agent system under token or compute constraints:

Compound tools. Instead of making the model chain four tool calls (find file, read file, edit file, verify), SmallCode gives it one tool that does all four. Small models lose coherence after three or more sequential calls. This cuts failures in half. The insight: each tool call is a chance for the model to hallucinate a parameter, misinterpret a result, or drift off task. Reducing the number of decision points reduces the failure surface.

Improvement loop. Every time the model writes code, SmallCode compiles and lints it immediately. If it fails, the errors feed back automatically. The model doesn't need to be smart enough to get it right on the first try. It just needs to fix errors when shown them. This is a fundamentally different design philosophy: optimize for recovery, not perfection.

Decompose on failure. If the model fails the same task twice, SmallCode stops retrying and breaks the problem into smaller pieces. "Fix this 200-line file" becomes "fix line 45 only." This prevents the death spiral where a model keeps making the same mistake with slightly different wording.

Escalation. If decomposition also fails and the user has a Claude or OpenAI key configured, SmallCode auto-escalates to the bigger model for just that one task. You stay local 95% of the time, cloud 5%. This hybrid approach is pragmatic and probably the right architecture for most production agents.

Token budgeting. Small models have 32k to 256k context windows. SmallCode never dumps a whole file into context. It summarizes, truncates, and manages every token so the model never sees truncation markers in the middle of important code. This is the Tokensparsamkeit principle applied at the system level.

Code graph. Instead of grep-searching the codebase, SmallCode indexes code into a symbol graph: functions, classes, call relationships. When you ask "how does auth work," it walks the graph and returns just the relevant connected code, not 15 random files that happen to contain the word "auth."

Each of these techniques trades model intelligence for system intelligence. And they work.

The 80% cost cut

The trading agent architecture post demonstrates similar principles in a different domain. The author built an autonomous AI trading agent and found that extra services were burning tokens on tasks that didn't require model inference.

The architecture that cut costs by 80% follows a pattern: keep the LLM out of decisions it doesn't need to make. Pre-process data with deterministic code. Only invoke the model for genuine judgment calls. Cache aggressively. Structure your tool responses to be minimal rather than comprehensive.

This sounds obvious. In practice, most agent architectures default to routing everything through the model because it's simpler to build. You don't have to think about what requires reasoning and what doesn't. You just throw it all at the LLM and let it sort it out. The cost shows up later.

The trading agent architecture likely implements something like: deterministic data fetching and normalization, rule-based pre-filtering, and then model invocation only for the actual trading decision. Each layer removes tokens from the model's input. Each layer also removes a failure mode.

MCP bloat and the code mode fix

Model Context Protocol was supposed to standardize how agents access external tools and data. In practice, it's created a new source of bloat.

The ZenStack post on "saving bloated MCP with code mode" identifies the core tension. MCP servers expose tools with verbose schemas. Each tool description, each parameter specification, each example gets included in the prompt whether the model needs it for this particular request or not. If your MCP server exposes 30 tools, the model sees 30 tool descriptions on every request, even if it will only use one.

Code mode is the proposed fix. Instead of exposing every tool through MCP's schema-driven interface, you give the agent a code execution environment where it can call functions directly. The model writes or invokes code that calls your APIs. The tool descriptions shrink to a single "you can execute code" instruction. The actual API details live in the code, not in the prompt.

This is a tradeoff. You lose the structured schema validation that MCP provides. You gain a massive reduction in prompt tokens. For agents that need to work with many tools, this can be the difference between a prompt that fits in context and one that doesn't.

The deeper lesson: protocol design affects token economics. A protocol that's elegant from a software engineering perspective can be pathological from a token budget perspective. MCP isn't dead, as the article title asks. But it needs to be used with the same thrift you'd apply to any other context source.

Engineering agent memory

Ken Walger's piece on engineering agent memory bridges two worlds that most practitioners treat separately: the stateless world of LLM inference and the stateful world of production systems.

LLMs are stateless. Every request is independent. Agents need memory: conversation history, task state, learned preferences, and accumulated context. How you bridge that gap determines whether your agent is useful beyond a single interaction or degrades into amnesia after a few turns.

The key distinction is between different memory types with different persistence and retrieval characteristics. Working memory (the current conversation context) has different requirements than episodic memory (what happened in past sessions) or semantic memory (generalized knowledge the agent has accumulated).

For token optimization, the critical question is: what goes into working memory on each request? If you stuff the entire conversation history into every prompt, you're burning tokens linearly with conversation length. Summarization, retrieval-augmented selection, and structured state representations all reduce the per-request token cost at the expense of some information loss.

The engineering challenge is deciding what to lose. A coding agent that forgets the file it was editing two turns ago is broken. A coding agent that includes the full contents of every file it has ever touched is also broken, just in a different way. The right answer depends on the task, and building the logic to make that decision is real engineering work.

Infrastructure: Lambda, file systems, and agents

AWS Lambda now has a persistent file system (EFS), and someone immediately put AI agents on it. This is more interesting than it sounds.

Serverless agent execution has been a hard problem because agents need state. They write intermediate results, cache tool responses, and maintain working files. Lambda's ephemeral filesystem meant you had to externalize all of that to S3 or a database, adding latency and complexity to every step. With EFS, a Lambda-based agent can read and write files as if it were a local process.

The practical implication: you can now run agent loops on Lambda without the state management overhead that previously made it painful. An S3 event fires, your Lambda wakes up, and the agent can maintain a working directory across invocations. This changes the economics of running agents at scale because Lambda's pay-per-invocation model means you're not paying for idle compute while the model thinks.

Combined with the token optimization patterns from SmallCode and the trading agent architecture, you can build agent systems that are cheap at every layer: small models for most tasks, minimal token budgets, serverless execution, and persistent storage only where needed.

Observability: breaking into the black box

One post stands out for a different reason. The author got tired of AI black boxes and built one you can break into. Dead Star AI is a human-in-the-loop reasoning engine built on Gemma 4 that exposes its internal reasoning process.

This matters for token optimization because you can't optimize what you can't see. If your agent sends 8,400 tokens for a rename and you never instrument the request, you never know. Observability into agent reasoning isn't just about trust or safety. It's about cost engineering. You need to know what context the model actually used, what it ignored, and what was redundant.

The human-in-the-loop aspect also connects back to SmallCode's escalation pattern. When you can see the model's reasoning, you can make better decisions about when to intervene, when to let it retry, and when to escalate to a larger model. Blind automation is expensive. Informed automation is cheap.

Low-level: llama.cpp MTP logits optimization

At the lowest level, a recent llama.cpp pull request avoids copying logits during prompt decode in multi-token prediction (MTP). This is the kind of optimization that doesn't show up in benchmark headlines but matters for anyone running agents locally.

Prompt processing speed directly affects agent latency. When your agent makes multiple tool calls in sequence, each one requires prompt processing. If you can shave milliseconds off each prompt decode by avoiding unnecessary memory copies, the compound effect across an agent loop is meaningful. The PR description is terse: "improved prompt processing speed." For agent builders running local models, this is a free performance gain. Update your llama.cpp.

The pattern that connects

Read these sources together and a pattern emerges. The engineers shipping working agent systems are not the ones chasing the biggest model or the most tools. They're the ones building tight scaffolding around modest models.

Compound tools reduce decision points. Token budgeting prevents context bloat. Improvement loops optimize for recovery instead of perfection. Deterministic pre-processing keeps the model out of decisions it doesn't need to make. Code graphs replace grep searches. Escalation architectures let you run local 95% of the time. Observability makes waste visible.

These are not glamorous techniques. Nobody is going to give a keynote about truncating file contents before sending them to a 4B model. But this is the work that turns agent demos into agent systems. The model is the easy part. Everything around it is where the engineering happens.

Token waste is killing your agents: engineering the scaffolding that actually matters

The 8,400-token rename ​

Small models, big harness ​

The 80% cost cut ​

MCP bloat and the code mode fix ​

Engineering agent memory ​

Infrastructure: Lambda, file systems, and agents ​

Observability: breaking into the black box ​

Low-level: llama.cpp MTP logits optimization ​

The pattern that connects ​