Agent Skills Are Eating MCP: The Infrastructure Stack for Production Agents

Summary

MCP tool registries are collapsing under their own weight as agents scale. Skills-based architectures, persistent memory, and serverless deployment patterns form the replacement stack engineers are actually shipping. Anthropic's public skills repository and projects like CLI-Anything represent the shift from protocol-heavy integration to composable, self-contained agent capabilities.

Background & Context

The Model Context Protocol (MCP) was designed to solve a real problem: giving LLMs standardized access to external tools and data sources. For a few integrations, it works fine. You wire up a file system tool, a database query tool, maybe a web search tool. The context window can handle the tool descriptions. The agent picks the right tool most of the time.

Then you add ten more tools. Then twenty. Each tool needs a schema description injected into the prompt. Each tool adds decision surface area for the model. The context window fills with tool metadata instead of user context. Latency climbs because the model has to reason over more options per turn. This is the MCP bloat problem, and it gets worse as agent complexity grows.

The community response has been to patch MCP with workarounds. "Code Mode" is one such approach: instead of registering every tool through MCP, you give the agent the ability to write and execute code directly. The agent generates its own integration logic on the fly rather than depending on a pre-registered tool for every possible operation. It trades tool-calling determinism for flexibility, and for many use cases that trade pays off.

But Code Mode is a band-aid. The deeper fix is rethinking the abstraction entirely. Instead of registering individual tools through a protocol, you give agents skills: self-contained, composable capabilities that bundle the tool, the reasoning about when to use it, and the execution logic into one unit. Anthropic's public skills repository (github.com/anthropics/skills) is the clearest signal of this direction. So is K-Dense-AI's scientific-agent-skills, which packages research, analysis, and writing capabilities as ready-to-deploy agent modules.

Technical deep dive

Skills vs. tools: the architecture shift

A tool in MCP is a function signature plus a natural language description. The model sees the signature, reads the description, and decides whether to call it. The orchestration logic lives in the model's reasoning, which means it lives in the prompt. More tools, more prompt tokens, more reasoning overhead.

A skill packages three things together: the tool interface (what it does), the invocation logic (when and how to call it), and the post-processing (what to do with the result). The model does not need to reason over raw tool schemas. It reasons over skill descriptions that are already scoped to a task domain.

Anthropic's skills repository demonstrates this pattern. Each skill is a self-contained module with its own prompt templates, tool definitions, and execution hooks. The agent loads skills relevant to its current task rather than loading every available tool into context. This is a retrieval problem, not a prompt-size problem.

CLI-Anything takes this further. Instead of wrapping individual CLI commands as MCP tools, it makes the entire CLI surface of any software available to agents through a unified interface. The agent discovers available commands, constructs invocation strings, and parses output. No per-tool MCP registration required. The CLI itself becomes the skill interface, and the agent dynamically determines what it needs from the available command set.

Engineering agent memory

Skills solve the "what can the agent do" problem. Memory solves the "what does the agent know" problem. Stateless agents start fresh every turn. They cannot build on prior interactions, maintain context across sessions, or learn from their own execution history.

The progression from stateless to persistent intelligence follows three layers.

Working memory is the current conversation context. This is what most agents already have. It lives in the prompt and disappears when the session ends.

Episodic memory stores interaction histories that the agent can retrieve across sessions. This requires a vector store or similar retrieval system. When the agent encounters a situation similar to one it has handled before, it can pull relevant prior interactions into context. The engineering challenge is retrieval quality: pulling the right episodes without flooding the context window with irrelevant history.

Semantic memory is the agent's accumulated knowledge about the world, its users, and its own capabilities. This is structured knowledge that persists indefinitely. It includes user preferences, domain knowledge, and procedural memory (how to accomplish specific tasks). This is where skills and memory intersect: a skill can update semantic memory with new procedural knowledge, and semantic memory can inform which skills the agent should load.

Kenwalger's "Engineering Agent Memory" article maps this progression clearly. The key implementation detail is that each memory layer requires its own storage backend, its own retrieval mechanism, and its own update policy. Working memory is just the prompt. Episodic memory needs a vector database with cosine similarity search and a relevance threshold. Semantic memory needs a structured store (a knowledge graph or relational database) with explicit schema and update logic.

Running agents on serverless infrastructure

AWS Lambda got a persistent file system (EFS) and someone immediately put AI agents on it. This is more interesting than it sounds.

The traditional agent deployment pattern uses a long-running server or container. The agent maintains state in memory, calls external APIs, and waits for responses. This works but you pay for idle time and you need to manage the infrastructure.

Lambda with EFS changes the equation. The agent can persist files across invocations without S3 round trips. It can maintain a working directory with downloaded models, cached API responses, and intermediate results. Cold starts are still a problem, but for many agent workloads the latency budget is already measured in seconds, not milliseconds.

The pattern looks like this: an S3 event or API Gateway request triggers the Lambda function. The agent loads its current state from EFS, executes its task (which may involve multiple LLM calls, tool invocations, and file operations), writes updated state back to EFS, and returns. Each invocation is a discrete agent step. For longer-running tasks, you chain Lambda invocations through SQS or Step Functions.

This is not the right pattern for every agent. Real-time voice agents need sub-100ms response times. Computer-use agents need to maintain a browser session. But for batch processing, document generation, and data analysis agents, serverless with persistent storage is a clean fit. Debs Obrien's documentation agent, which produced 55 pages of documentation and 59 screenshots in 4 days, is a textbook candidate for this pattern: discrete generation tasks that can run as independent Lambda invocations with shared EFS state for the output directory.

Computer-use and voice as alternative interaction models

Most agent frameworks assume a text-in, text-out interaction model. Two emerging approaches break this assumption.

Agent-S (from simular-ai) is an open framework for computer-use agents. Instead of calling APIs or CLI tools, the agent interacts with a graphical interface the way a human would: clicking buttons, typing in fields, reading screen content. This is slower and less reliable than API calls but it works with any software, including software that has no API. The agent uses visual grounding to locate UI elements, plans a sequence of actions, and executes them with error recovery.

Dograh is an open-source voice agent platform. Voice agents have stricter latency requirements than text agents. The user expects a response within 500ms, which means the agent cannot afford multiple LLM calls per turn for most interactions. Dograh handles this with a streaming architecture: the LLM generates tokens as the user speaks, and the text-to-speech pipeline starts before the full response is complete. The skill system is integrated at the intent-classification level: a fast model classifies the user's intent, loads the relevant skill, and the skill's execution logic runs in parallel with response generation.

The Sweets Vault project from Google demonstrates multimodal agent integration with physical hardware. The agent uses Gemini's multimodal capabilities to process images of handwritten text (from a child's practice sheet), integrates with a physical reward dispenser, and maintains a motivational system across sessions. This is a consumer-facing example, but the architecture pattern (multimodal input, hardware actuation, persistent state) applies to industrial settings too.

Comparison & analysis

The skills approach and the MCP approach solve the same problem with different tradeoffs.

MCP centralizes tool registration. Every tool is defined once, and any connected agent can use it. This is clean for small tool counts. Anthropic's Claude Desktop with five MCP servers works well. The protocol handles discovery, authentication, and execution. But MCP does not scale to hundreds of tools because the model must reason over all available tool schemas every turn. There is no built-in mechanism for selective tool loading.

Skills decentralize capability management. Each skill is a self-contained module. The agent loads only the skills it needs for the current task. This keeps the context window focused. The tradeoff is that skills require a retrieval or routing mechanism: the agent (or a supervisor) must decide which skills are relevant before loading them. A bad routing decision means the agent lacks a capability it needs.

In practice, the two approaches are converging. MCP servers can implement skill-like patterns by grouping related tools and exposing them conditionally. Skills can use MCP as a transport layer for individual tool calls within a skill. The ZenStack article on saving bloated MCP with Code Mode describes exactly this hybrid: keep MCP for well-defined tools, use code generation for ad-hoc operations that do not warrant a dedicated tool definition.

Agent-S vs. traditional tool-calling is a starker contrast. Traditional tool-calling is fast and deterministic. If the agent calls the get_weather tool with the right parameters, it gets the weather. Agent-S's computer-use approach is slower (seconds per action vs. milliseconds per API call) and probabilistic (the agent might click the wrong button). But computer-use works with any application. You do not need an API. You do not even need the application's source code. For legacy systems, internal tools without APIs, and cross-application workflows, computer-use is the only automated option short of building custom integrations for every system.

Langflow sits in a different position in the stack. It is a visual workflow builder for constructing agent pipelines. You drag nodes (LLM calls, tool invocations, conditional logic) onto a canvas and connect them. This is useful for prototyping and for teams that prefer visual configuration over code. The tradeoff is flexibility: complex agent behaviors that require dynamic skill loading, conditional memory retrieval, or custom execution logic are harder to express in a node-graph model than in code. Langflow works best for well-defined, repeatable workflows. Hand-coded agent loops work better for adaptive, context-dependent behavior.

The xianyu-auto-reply system demonstrates what production agent deployment looks like when you need multi-account management, automated fulfillment confirmation, and a web admin interface for monitoring. It is not a research project. It is a shipping product with real users. The architecture choices (multi-account isolation, message queue decoupling, admin dashboard) reflect production constraints that most agent framework tutorials ignore.

Practical implications

For engineers building production agents right now, three decisions matter most.

First, choose your capability abstraction early. If you are building an agent with fewer than ten tools, MCP is fine. If you expect the tool count to grow, start with a skills-based architecture from the beginning. Migrating from MCP tools to skills later is painful because it requires rethinking your prompt construction, your tool discovery logic, and your execution pipeline. Anthropic's skills repository provides a starting structure. Fork it, add domain-specific skills, and build your retrieval layer. The awesome-llm-apps repository (100+ runnable agent and RAG applications) provides working reference implementations for most common agent patterns if you need a starting point.

Second, invest in memory infrastructure before you need it. Stateless agents are easy to build and hard to scale. The moment your agent needs to maintain user preferences, learn from past interactions, or operate across sessions, you need episodic and semantic memory. Set up a vector store for episodic retrieval (pgvector, Qdrant, or Chroma all work) and a structured store for semantic knowledge. The retrieval logic can be simple at first: cosine similarity with a relevance threshold. It will get more complex as your agent handles more diverse tasks, but the storage layer should be stable from day one.

Third, match your deployment model to your latency requirements. Serverless (Lambda with EFS) for batch agents with second-level latency budgets. Containers with persistent connections for real-time agents. Edge deployment with model caching for voice agents that need sub-500ms response times. The Sweets Vault project shows that even hardware-integrated agents can run on cloud infrastructure if you accept some latency in the physical actuation path.

The human-in-the-loop question is also worth deciding early. Dead Star AI (built on Gemma 4) forces explicit approval steps before the agent takes action. This adds latency but prevents autonomous mistakes. For agents that modify production systems, send communications, or spend money, human approval gates are not optional. For agents that read data, generate drafts, or perform analysis, autonomous execution is fine. Build the approval mechanism into your skill definitions from the start: each skill declares whether it requires human confirmation before execution.

The open-source tooling for this stack is maturing fast. CLI-Anything gives agents access to any CLI application. Shadowbroker demonstrates multi-source data aggregation with agent-driven correlation analysis across flight tracking, satellite data, and seismic events. Dograh handles voice interaction. Agent-S handles computer-use. Scientific-agent-skills provides domain-specific capabilities for research and engineering work. The missing piece is orchestration: a standard way to compose these components into a coherent agent system. Langflow addresses this for visual workflows. For code-first teams, the orchestration layer is still custom-built, and that is probably fine for now. The components are stable enough to compose; the glue just needs to be written per-project.

References

"How to Save Bloated MCP with Code Mode" - https://dev.to/zenstack/how-to-save-bloated-mcp-with-code-mode-33e3
"Building 'Sweets Vault': a multimodal Gemini Agent with physical hardware integration" - https://dev.to/googleai/building-sweets-vault-a-multimodal-gemini-agent-with-physical-hardware-integration-1nmh
"Lambda Just Got a File System. I Put AI Agents on It." - https://dev.to/aws/lambda-just-got-a-file-system-i-put-ai-agents-on-it-1ej8
"How I Documented an Entire Product in 4 Days with an AI Agent" - https://dev.to/debs_obrien/how-i-documented-an-entire-product-in-4-days-with-an-ai-agent-3338
"I Got Tired of AI Black Boxes So I Built One You Can Break Into" - https://dev.to/itxashancode/i-got-tired-of-ai-black-boxes-so-i-built-one-you-can-break-into-295n
"Engineering Agent Memory" - https://dev.to/kenwalger/engineering-agent-memory-4a42
"CLI-Anything: Making ALL Software Agent-Native" - https://github.com/HKUDS/CLI-Anything
"Shadowbroker: Open-source intelligence for the global theater" - https://github.com/BigBodyCobain/Shadowbroker
"Dograh: Open Source Voice Agent Platform" - https://github.com/dograh-hq/dograh
"Scientific Agent Skills" - https://github.com/K-Dense-AI/scientific-agent-skills
"Awesome LLM Apps" - https://github.com/Shubhamsaboo/awesome-llm-apps
"Anthropic Skills" - https://github.com/anthropics/skills
"Agent-S: Open agentic framework for computer use" - https://github.com/simular-ai/Agent-S
"Langflow: Building and deploying AI-powered agents and workflows" - https://github.com/langflow-ai/langflow
"Xianyu Auto Reply Fix: AI customer service with multi-account management" - https://github.com/GuDong2003/xianyu-auto-reply-fix

Agent Skills Are Eating MCP: The Infrastructure Stack for Production Agents

Summary ​

Background & Context ​

Technical deep dive ​

Skills vs. tools: the architecture shift ​

Engineering agent memory ​

Running agents on serverless infrastructure ​

Computer-use and voice as alternative interaction models ​

Comparison & analysis ​

Practical implications ​

References ​