Appearance
If you are building production LLM systems today, you are almost certainly leaving 30-70% of the Claude platform's capability on the table. Most engineers only ever call the raw Messages API. Almost no one uses the full stack of tooling, cost optimizations, architectural patterns and operational tricks that Anthropic has quietly shipped and dogfooded internally over the last six months.
This is not marketing material. Every pattern, number and guideline here comes directly from official Anthropic engineering posts and internal usage reports. None of this is hypothetical. All of this is running in production right now, both at Anthropic and at their largest customers.
The advisor strategy is the single biggest cost win available today
This is not a trick. This is a supported, production ready architectural pattern that will change how you run agentic workloads.
The standard pattern everyone uses is run the best model you can afford end to end. The advisor pattern inverts this. Run Sonnet or Haiku as the executor for 95% of the work. Only escalate to Opus when the executor encounters a decision it cannot resolve. Opus never calls tools, never generates user output, never does busy work. It only writes 400-700 token guidance notes back to the executor.
Benchmark numbers are unambiguous. On SWE-bench Multilingual, Sonnet + Opus advisor gains 2.7 percentage points over raw Sonnet, while reducing cost per task by 11.9%. On BrowseComp, Haiku + Opus advisor scores 41.2%. That is double raw Haiku's 19.7% score, for 85% less cost than running Sonnet alone.
You do not need to write any orchestration logic. This is a native API feature. Add one entry to your tools array, set a max_uses cap, and the model will decide when to escalate automatically. All context handoff happens server side inside a single messages request. No extra round trips. No context management code on your end.
Almost every production agent deployed today can be upgraded to this pattern with 3 lines of code change. Almost no one is doing this yet.
Prompt caching is not just for cost. It is for throughput.
Everyone knows prompt caching cuts cost by up to 90%. Almost no one knows that as of February 2025, cache read tokens no longer count against your Input Tokens Per Minute rate limit.
This is not a minor change. This completely changes the scaling equation for high volume workloads. For any workload with a shared system prompt, RAG context or instruction set, you can now run effectively unlimited input throughput. You are only ever limited by output tokens.
Caching also no longer requires manual segment tracking. Set a cache breakpoint once, and Claude will automatically match and reuse the longest possible cached prefix for every subsequent request. You do not need to track cache keys, version segments or handle invalidation. The platform handles all of it.
Cognition, the builders of Devin, report this single change let them double their production throughput without requesting any rate limit increases.
Token efficient tool use cuts output tokens by 70%
Tool call formatting was one of the largest hidden waste sources on the Claude API. Prior versions would output full structured JSON schemas on every tool invocation, often generating more tokens for the tool wrapper than for the actual parameters.
The new token efficient tools beta fixes this. Enable it with one header flag, and average output token consumption for tool calling workloads drops by 14%. For workloads that make heavy use of parallel tool calls, savings go up to 70%.
This feature is backwards compatible. No changes required to your tool definitions, parsing logic or error handling. There is no downside to enabling this on every single request you make today.
Also use the new native text_editor tool for any document or code modification tasks. This eliminates the extremely wasteful pattern where Claude would re-output an entire 1000 line file just to change 3 lines.
Claude Code session management rules
Claude Code is not just a chat interface for writing code. It is the most mature production agent runtime available today. How you manage sessions will determine 80% of your success with it.
The 1 million token context window is not an invitation to run one session forever. Context rot is real. Model performance degrades linearly as the context window fills. By the time you hit 800k tokens, reasoning quality has dropped by roughly 30%.
Follow these hard rules, directly from the Claude Code engineering team:
- Start a new session for every new discrete task.
- If you are correcting a failed attempt, do not reply "that didn't work". Rewind to before the attempt, and restate the prompt with the new information.
- Never let autocompact run. Always run compact manually with explicit instructions for what to retain and what to drop.
- For any task that will generate more than 100 lines of intermediate output, explicitly tell Claude to spawn a subagent. Only the final conclusion will be brought back into the parent context.
If you follow these rules you will get consistently better results, run into far fewer strange failures, and spend half as much on tokens.
Stop using Markdown. Use HTML.
This is the single most counterintuitive, highest impact productivity tip that has come out of the Claude team in the last year.
Markdown was designed for humans writing text. It is a terrible format for agents generating output for humans to read. Once agents are producing documents longer than 100 lines, almost no one actually reads the full Markdown output.
Every engineer on the Claude Code team now asks for HTML output by default. HTML lets Claude generate structured documents, tabbed interfaces, inline annotations, color coding, interactive diagrams and adjustable parameters. You can share a link directly. No one has to download an attachment.
You do not need to write any templates. Just add "output the result as a single standalone HTML file" to the end of your prompt. Claude will handle everything else.
This is not about pretty output. This is about actually reading the work the agent produces. The probability that a colleague will read and understand a report generated by Claude doubles when it is delivered as HTML instead of Markdown.
How non engineers are building production tools at Anthropic
The most important story about the Claude platform is not being told to engineers. Jared Sires was an account executive. He had never opened a terminal, never written a line of code. 12 months after joining Anthropic he is the GTM product manager building internal tools used by 80% of the sales organization.
He built CLAFTS, the internal email drafting tool, using Claude Code. It is 4300 lines of code. Almost all of it was written by Claude. It saves every sales rep 10-15 hours per week.
This is not an edge case. This is the new normal. The barrier to building production tooling is no longer knowing how to code. It is knowing how to clearly describe a problem, and iterate on the output.
All of the tooling discussed in this article was built to enable exactly this. This is not about replacing engineers. It is about expanding who can build useful software.
Claude Cowork and the end of the conversational AI interface
Almost every AI tool released in the last three years has followed the same pattern: you type a question, you get an answer. All of the work to turn that answer into something useful remains manual.
Claude Cowork breaks this pattern. It does not reply to questions. It executes tasks end to end. It reads and writes local files, connects to your calendar, CRM, email and internal tools. It runs multi step workflows, reports progress, and asks for approval only when it hits a decision boundary.
This is not a chat bot. This is the first general purpose knowledge work agent that you can actually delegate work to. When configured correctly it will run in the background all day, doing the administrative work that currently occupies 60% of most knowledge worker time.
Distillation on Bedrock
For high volume production workloads, distillation is now the default correct approach. You can now distill task specific behaviour from Sonnet down to Haiku, with near identical accuracy, at 20% of the cost and 60% lower latency.
Unlike fine tuning, you do not need to prepare training data. Amazon Bedrock will automatically generate synthetic training examples from the teacher model, run training, evaluate and host the distilled model.
This is the correct pattern for any workload that runs more than 10,000 requests per month. Build and validate your workflow once on Sonnet or Opus. Distill it down to Haiku for production.
Production operational tips
Run four variants of every workload in your eval suite: raw Haiku, Haiku + advisor, Sonnet + advisor, raw Opus. In 9 out of 10 cases one of the advisor configurations will beat the raw larger model on both cost and quality.
Use the Anthropic Console for all prompt development. Stop copying prompts between Slack and Notion. The console has native version control, side by side evaluation, automatic prompt refinement and one click production code export.
Always set max_uses on the advisor tool. 3 is a good default for most workloads. You will almost never see more than 3 advisor escalations on any single task.
Cache everything. If you are sending the same 100 line system prompt on every request, you are throwing away money and throughput.
Closing observations
None of this is secret. All of this has been published publicly on the Claude blog over the last six months. Almost no one is using most of it.
Most engineers still treat LLM platforms as a black box text completion API. They spend thousands of hours writing orchestration code, fine tuning models and optimizing prompts, while ignoring the 10x improvements that are already available, supported, and documented natively on the platform.
The teams that win over the next 12 months will not be the teams that build the best custom models. They will be the teams that actually use the full capability of the platforms that already exist.