Skip to content

Production LLM Agents: No One Cares About Your Model. They Care About Your Harness.

#llm-agents #hermes-agent #harness-engineering #agent-memory #production-ml

An LLM once talked an engineer into the wrong database.

It answered with perfect confidence. He shipped it. It cost him a weekend and a migration he still complains about.

This story is not an edge case. It is the default failure mode of every single agent system deployed today. You do not get warned. You do not see the disagreement. You get one polished answer, and you only find out it was wrong after you have already acted on it.

Every production agent deployment built in the last six months has run into exactly this wall. And every one of them has arrived at the same conclusion: the model stopped being the bottleneck six months ago.

The hard problems now are all outside the model. They are memory architecture. They are harness design. They are cost drift, skill rot, trust boundaries, and the quiet compounding failure modes that no demo will ever show you.

The quiet shift no one is talking about

We crossed an invisible threshold this year.

For the first time, the base model is good enough for 90% of production agent use cases. You can swap GPT-4o, Claude 3.5 Opus, DeepSeek V3, or a good local 70B model and get broadly equivalent results on real tasks. The difference between the best model and a good one is smaller than the difference between a good harness and a bad one.

This is not a popular thing to say. It does not drive Twitter engagement. It does not sell model API credits. But every engineer actually running agents in production will tell you the same thing. Benchmark scores stopped correlating with real world performance once you add a competent harness.

Harness engineering is the discipline that emerged to fill this gap. It is the design of the scaffolding that surrounds the model: context delivery, verification loops, memory systems, sandboxes, permission gates. It is everything that determines if an agent will actually work, instead of just looking good in a demo.

Over the last month four separate production Hermes Agent deployments landed on DEV.to and GitHub. None of them talked about model capability. All of them talked about exactly these problems. This article synthesizes what they found.

Memory is not storage. It is infrastructure.

Most agent memory systems are digital attics. You put things in. You hope to find them later. You mostly don't.

This is the wrong mental model. Storage is where things go to accumulate. Infrastructure is load bearing. Agents do not need a warehouse. They need a power grid.

The difference is not semantic. Storage fails silently. You put something in and nothing comes out, or something wrong comes out, and the agent keeps going with degraded information it cannot see is degraded. Infrastructure fails loudly. That is not a downside. That is the point.

The core failure of vector search memory is the sequencing problem. Causality is only visible in retrospect.

An agent logs: deployment failed due to timeout. A week later it logs: switched to async pattern, deployment succeeded. These two entries belong together. They are the before and after of the same causal chain.

At write time you cannot link them. When the failure happens there is no resolution to tag. When the resolution happens the failure is already buried in the index. And these two entries are semantically distant. Vector search will never reliably find the connection.

The working fix emerging from production deployments has three parts:

  1. Instrumented capture at write time. Log intent, not just outcome. When an agent makes a tool call, record what it was attempting, not just what happened. Attempting calibration sequence v2 carries far more signal than calibration failed.
  2. Temporal mirror reflection pass. Run a small reasoning pass once per ingestion event. This pass looks for structural complementarity across recent entries, not semantic similarity. It finds the failure and resolution pairs that vector search misses. This is a fixed cost at write time, not a compounding cost at query time.
  3. Forensic receipts. Once a causal link is found, store it as an explicit UUID edge, not another embedding. The agent does not search for the connection. It is already encoded.

This architecture trades a small fixed overhead at ingestion for deterministic retrieval. You pre-pay for precision once, instead of paying repeatedly for imprecision every time you query memory.

The jury pattern done right

The single most effective pattern for eliminating overconfident wrong answers is also the simplest: never ask one model.

The Council implementation built for the Hermes Agent Challenge demonstrates this perfectly. It works like this:

  1. Take any judgement call. Fan it out to three separate models in parallel. Two hosted, one local.
  2. Collect positions and reasoning from each. If they disagree, run a second round. Show every juror the arguments from the others. Ask each to hold or change their position.
  3. Pass all deliberated opinions to an independent judge agent. Return a single verdict, a confidence score, and a full breakdown of where and why they disagreed.

This is not a vote. This is deliberation.

On factual questions one juror will almost always cite the correct reference, and the rest will fall in line. On judgement calls you will see actual movement. Jurors change their minds. Confidence climbs as a 2-1 split becomes unanimous.

The most important finding: homogeneous panels agree too easily. Diversity of model family beats raw model capability every single time. Three different 7B models will produce a more reliable verdict than one 400B model.

Every verdict and dissent is written into memory. Over time the system learns which juror to trust for which class of question. Weighting adjusts automatically, with human approval required for any change.

This pattern catches 80% of the confidently wrong answers that sink single model deployments. It adds ~12 seconds of latency per verdict. For any decision that matters, that is the best tradeoff you will ever make.

Compounding capability compounds every problem

Hermes Agent's defining feature is that it compounds. It writes its own reusable skills as plain markdown files. Once it solves a problem once, it never has to re-derive the solution again. The marginal cost of that task trends towards zero.

This is genuinely revolutionary. It is also the trap.

Compounding is a system property, not a feature. Properties do not take sides. The same loop that compounds capability also compounds cost, drift, and trust surface.

Cost drift is the first thing that bites. The happy path says skills make tasks cheaper. That is true per run. But two forces work in the opposite direction:

  • Every skill added to the library increases context overhead. At 5 skills the overhead is zero. At 300 unpruned skills the agent spends more tokens deciding which skill to use than it spends executing the task.
  • Autonomy removes the natural brake. A stateless chatbot only costs money when you type. A scheduled agent costs money when you are asleep. The most common failure reported is waking up to a $47 surprise bill from an overnight recursive run.

Skill rot is the silent failure that arrives at day 90. A self authored skill is code that no human reviewed, with no tests, no owner, and no expiry. When an external API changes the skill does not know. It will continue to produce confidently wrong output forever. Nothing in the default loop will ever notice.

Self improvement and overfitting are the same gradient pointed in hopefully good directions.

Agents need operating systems, not chat interfaces

The biggest mistake people make with Hermes is treating it like a better chatbot. It is not. It is an agent runtime.

Runtimes need operating discipline. The verifiable agent harness implementation demonstrates this clearly. It splits agent state into separate layers, each with the correct durability for the information they hold:

LayerResponsibility
Hermes memoryStable facts only
Hermes skillsReusable procedures
Repo filesProject local state
Task trackerActive work ownership
Session searchHistorical recall
Human approvalExternal side effects

The rule is simple: store information in the lowest layer that is durable enough for its expected lifetime.

This one rule eliminates 90% of the drift that plagues long running agents. Task state does not get mixed up with global memory. Project conventions do not leak into other work. Reusable fixes do not get lost in chat history.

Every action runs through an evidence loop: Intent → Action → Artifact → Verification → Report. The agent never says "I did it". It says "here is the artifact, here is how I verified it".

This is not magic. This is boring production engineering. And it is exactly what every demo skips.

The failure mode taxonomy no one publishes

All production agent deployments fail in the same small set of ways. None of these failure modes are mentioned in marketing material. All of them are predictable. All of them are manageable.

Failure modeWhat it looks likeRoot causeControl
Cost blowoutSurprise overnight billUnbounded recursion + delegationHard spend caps; step limits
Skill rotConfidently wrong outputStale procedure trusted over re-derivationSkill expiry dates; smoke tests
Skill driftBehaviour changes slowly for no reasonRefinement overfit to recent noiseVersion control on skills directory
Skill collisionSame input, different outputOverlapping contradictory skillsPeriodic audit and deduplication
Durable injectionMalicious behaviour survives restartsPoisoned skill persisted to diskSandboxed execution; approval gated writes
Silent failureTask "succeeds" but output is garbageNo verification stepOutput checks; human in the loop on high stakes actions
Context bleedCross task state contaminationShared memory across unrelated workProfile isolation; scoped subagents

If this table looks exactly like standard production engineering, that is the point. Agents are just a new kind of production system. You engineer for their failure modes exactly like you would for any other software.

What actually works right now

None of this is theoretical. These are the controls that every production Hermes deployment has adopted by day 30:

  1. Put the entire skills directory under git version control. Every skill the agent writes becomes a diff you can read, blame, and revert. Self improvement becomes a series of pull requests from your agent to your repo. Review them like you would review a teammate's code.
  2. Set a hard spend cap and step limit before you do anything else. Start tiny. Widen boundaries deliberately. Cost is the only failure you can fully prevent with config.
  3. Sandbox by default. Docker is the standard choice. Grant credentials like you grant SSH keys. Read only until a capability has earned write access.
  4. Prune skills every two weeks. Delete anything that has not been used in 30 days. The agent will re-derive it if it actually needs it. Bloat is a far bigger risk than inconvenience.
  5. Never run an agent with more than 12 active skills. If you need more, split them into separate profiles.
  6. For any high stakes decision, run the jury pattern. The latency penalty is worth every millisecond.

The open problems

We know enough now to build reliable agents that run for months. There are still hard unsolved problems:

We do not have a principled way to calculate minimum viable instrumentation. The Observer's Tax says any instrumentation heavy enough to change agent behaviour corrupts the signal you are trying to capture. We do not yet know where that line sits.

We do not have good models for skill decay. How often should an agent re-verify a procedure? How long can a skill be trusted before it rots?

We do not have good cross agent causal semantics. When multiple agents write to the same memory store, how do you preserve provenance and avoid overwriting causal edges?

None of these problems will be solved by better models. They will be solved by better harnesses, better memory architecture, and better operating discipline.

Closing

This is the quiet phase of agent adoption. All the hype is still about new models and benchmark scores. But out in production, everyone has already moved on.

The people actually building working agents are not arguing about context window sizes. They are pruning skill directories. They are setting spend caps. They are designing memory systems that preserve causal chains. They are building boring, reliable infrastructure.

That is where the real progress is happening right now.