Skip to content

Anthropic just shipped a complete production agent stack, and almost no one noticed

#claude #llm-agents #developer-tools #anthropic #llm-engineering

Most commentary on recent Claude releases has treated each announcement as an isolated feature upgrade. That is a mistake. Anthropic has not just been making their model bigger. Over the last three quarters they have quietly shipped every single primitive required to run production LLM agents. This is a complete end to end stack, not a collection of demo features. Right now it is at least 6 months ahead of every competing platform.

1M token context is not for reading books. It is for agents.

Almost all coverage of the Sonnet 4 1M token launch framed it as a document processing feature. That misses the point entirely. This is not for loading novels. This is the first context window large enough to hold everything an agent needs to run indefinitely, without state management hacks. You can fit full tool definitions, API schemas, system prompt, 800+ previous tool call turns, and the full working set of the task all in one window. No chunking. No rolling context window eviction. No vector database lookups that silently drop critical state. Pricing is tiered correctly. The first 200k tokens stay at the original $3 / MTok input rate. Anything over doubles to $6 / MTok. When combined with prompt caching this works out to ~$0.12 per full 1M token prompt refresh. For comparison, GPT-4o 2M costs $10 / MTok for all input. You can load approximately 75,000 lines of code in one request. That covers most production microservices, end to end. This is the first model where you can reasonably pass an entire codebase as context, not just individual files.

The agent API primitives no one else has shipped

On March 18 Anthropic rolled out four base agent capabilities that every other LLM provider still treats as userland code you have to build yourself:

  1. Native sandboxed code execution. Claude can run Python, inspect output, iterate and retry all within a single API turn. You do not have to run your own sandbox, handle timeouts, scrub environment variables or implement retry logic. This runs inside the Anthropic API, not your infrastructure.
  2. MCP connector. Standard interface for connecting agents to external tools, APIs and services. No custom function calling wrappers required.
  3. Persistent Files API. Agents can upload, reference and modify files across sessions without passing raw bytes back and forth in every prompt.
  4. 60 minute prompt cache. Cached prompt segments cost 1/10th normal pricing, and persist for one full hour. This was built explicitly for long running agents that keep 90% of their context static across turns. None of these are research previews. All are available in production on Bedrock, Vertex and the native API right now.

Claude Code is not a chat assistant. It is an agent orchestrator

Most engineers still think Claude Code is just another terminal chat bot. It is not. It has quietly become the most capable general purpose agent runtime available. Two recent releases changed everything. Auto mode solves the permission problem. For two years every code agent had exactly two modes: stop and ask for approval every 10 seconds, or turn off all guards entirely and hope it does not delete your home directory. Auto mode inserts a classifier between the agent and tool calls. Safe actions run automatically. Risky actions get blocked. Ambiguous actions prompt the user. Early testing shows this reduces human approval requests by ~85% on typical codebase tasks, while blocking 92% of destructive actions that would run with full permissions enabled. Agent View solves the operational problem. You can now launch 10 concurrent agents, send them all to the background, and only get notified when one needs human input. No more 12 open terminal tabs. No more mental tracking of what each agent is doing. You can peek at progress, reply inline, or terminate tasks without attaching to the session. This is the first agent interface that actually scales past one running agent. Everyone else is still building agents for one task at a time.

Security was built in, not bolted on

This is the most under discussed difference between Anthropic and every other provider. They did not build agent capabilities then go back and add security. They shipped security controls in lockstep. The /security-review command runs full codebase vulnerability scanning locally, and can auto remediate found issues. The GitHub Action runs this on every pull request, inline with code changes. It checks for SQL injection, XSS, auth flaws, dependency issues and insecure data handling. Unlike every third party security scanner, this agent understands the full context of your codebase. It does not just grep for vulnerable patterns. It understands how data flows across files. Project Glasswing extends this. All Claude agent tool calls run through the same vulnerability classifier that is being trained on public vulnerability disclosures. This is not an afterthought. It runs before every code execution or file write.

The hard tradeoffs no one talks about

This stack is not perfect. There are very real tradeoffs you need to know before building on it. Long context pricing doubles after 200k tokens. If you are running agents that consistently use 600-800k tokens per turn, costs add up very quickly. Batch processing brings this down 50%, but that is only usable for non-interactive workloads. Auto mode is not perfect. It will occasionally block perfectly safe actions, and occasionally allow risky ones. Anthropic explicitly still recommends running this in isolated environments. Do not run auto mode agents against your production filesystem. The 1M context window does not have perfect recall at the end of the window. Independent testing shows Sonnet 4 correctly retrieves information placed anywhere in the first 800k tokens ~98% of the time, dropping to 91% at 1M. That is good enough for most agent use cases, but it is not perfect. Click accuracy for computer use is still extremely sensitive to screenshot resolution. If you do not downscale screenshots to exactly 1024x768, click error rates double. This is documented, but almost no one reads the implementation guide.

Closing observation

Right now every other major LLM provider is still competing on raw benchmark scores and flashy one off demo agents. Anthropic has stopped doing that. For the last year they have shipped nothing but boring, practical infrastructure required to run agents in production. No one else has a complete stack this mature. No one else has solved the permission problem. No one else has a usable interface for running multiple concurrent agents. No one else has integrated security at the tool call layer. Most people are still arguing about which model has the highest MMLU score. Meanwhile Anthropic already built the thing everyone said they wanted. If you are trying to build production LLM agents right now, this is the stack you should be evaluating.