Skip to content

Architectures for Agent Systems: A Survey of Isolation, Integration, and Governance

Large Language Model (LLM) based agent systems – software that leverages LLMs to autonomously plan and execute multi-step tasks using external tools – are rapidly moving from proof-of-concept demos into enterprise deployment. These agents promise to automate coding, IT operations, data analysis, and more, but deploying them in production raises new challenges in security, reliability, and integration. Over the last half-year, the community has converged on key strategies: strong isolation for executing untrusted actions, standardized protocols for tool integration, and governance frameworks to align agent behavior with enterprise policies. This survey provides a systematic review of recent developments (roughly the latter half of 2025), including agent sandbox architectures, emerging standards like MCP, open-source projects, industry initiatives, and research advances. We focus on the pain points encountered when bringing agent systems to production and how the latest solutions address (or still fall short on) those needs.

1. Agent System Architecture in the Enterprise

An enterprise-ready agent system typically consists of several layers: (i) an LLM-based reasoning core (the "agent" that decides which actions to take), (ii) an interface to invoke external tools or services (e.g. via APIs, command-line, databases), and (iii) an execution environment or runtime where the agent's tool actions (like running code or shell commands) actually occur. Surrounding these are components for memory/state storage, orchestration (especially if multiple agents work together), and monitoring & control (for safety and compliance). The overarching architectural challenge is that these systems are highly dynamic and open-ended: the agent may generate arbitrary code or tool requests at runtime, often based on unpredictable input. This requires a different approach to software architecture than traditional deterministic services.

Isolation and Safety by Design. Unlike a bounded microservice, an AI agent might decide to execute unvetted code or make system-altering calls. A core architectural principle emerging in 2025 is to sandbox the agent's actions – running them in an isolated environment that protects the host system and network. For example, the open-source Agent Sandbox for Kubernetes was introduced as a new Kubernetes primitive to run AI agents safely. Instead of letting LLM-generated code run in a standard container (which could still abuse the host kernel or other pods), Agent Sandbox uses lightweight VMs (gVisor-based userland kernel, with optional Kata Containers support) to create a secure barrier between the agent's code and the cluster node's OS. This isolates potentially malicious or errant code from interfering with other applications or the host. The Sandbox is managed via a custom Kubernetes resource (CRD) called Sandbox, which represents a single, stateful, long-lived pod with a stable identity and persistent storage. This design reflects a shift from treating agent workloads as ephemeral stateless functions to treating them as session-oriented services that may hold state over time. Indeed, the Agent Sandbox supports features like pausing and resuming the VM, automatically reviving it if a network reconnect is needed, and even memory sharing across sandboxes for efficiency. It also provides a templating and pool mechanism – SandboxTemplate and SandboxClaim – to manage pools of pre-warmed sandbox pods. Pre-warming is crucial because launching a fresh isolated VM can be slow; by keeping a pool of ready-to-go sandboxes, startup latency for a new agent session is dramatically reduced (Google reports sub-second startup latency, a ~90% improvement over cold-starting sandboxes). In Google's GKE, this is paired with a new Pod Snapshots feature that can checkpoint and restore running sandbox pods (even GPU workloads), cutting startup from minutes to seconds and avoiding idle resource waste. In short, the sandbox architecture is purpose-built for autonomous agents: it provides stronger isolation than ordinary containers, yet supports persistent state and fast elasticity to accommodate long-running, interactive agent tasks at scale.

Stateful Singleton Runtimes. Traditional cloud apps often scale by running many stateless instances behind a load balancer, but agent use-cases (like an AI coding assistant or an autonomous scheduler) often manifest as a single specialized "worker" with memory (such as cached tools or context) that persists across many tool calls. The Kubernetes Agent Sandbox explicitly targets these singleton, stateful workloads – not just for AI agents but also things like CI/CD build agents or single-node databases that require stable identity and disk state. This reflects a broader industry recognition: agent applications need new runtime primitives that can maintain continuity of state and identity across a session (for example, so the agent can incrementally build on previous tool outputs, or maintain an authenticated session to a service). Recent designs propose durable execution for agents – the ability to pause an agent's process, snapshot its memory or file system, and later resume or even migrate it. The GKE Agent Sandbox + Pod Snapshot combo is an early real-world example of this, effectively treating an agent's environment as a checkpointable virtual machine. We anticipate emerging orchestration support where an agent can be hibernated when idle and quickly reawakened when needed, balancing responsiveness with efficient resource use.

Tool Interface Layer. The other critical piece of architecture is how agents interface with external tools and data. Historically, each AI assistant platform invented its own plugin system or API schema (e.g. OpenAI's Plugins, LangChain's tool abstractions). This led to a fragmented ecosystem where tools had to be rewritten for each agent framework. Over 2025, a consensus has grown around Model Context Protocol (MCP) as a standard interface between AI models (the clients) and tools or services (the servers). MCP was released by Anthropic in late 2024 and by 2025 it has become "the universal standard protocol for connecting AI models to tools, data, and applications". Conceptually, MCP defines a simple JSON-RPC-based client-server protocol by which an AI agent can discover available tools and invoke them with arguments, and receive results/observations. The tools can be anything: database queries, file system operations, web requests, code compilation – each exposed by an MCP server that the agent connects to. The power of a common protocol is that it transforms the integration problem from M×N (every model integrating with every tool) to M+N modularity. A tool developer can create an MCP server once, and any compliant agent (whether it's OpenAI's, Anthropic's, or an open-source project) can use it. This dramatically reduces duplicated effort and makes the system more maintainable. GitHub engineers describe MCP as creating a "USB-C for AI" – a universal port for tools. In practice, MCP connections can be local (via stdio pipes) or remote (HTTP+SSE streams), and are typically stateful sessions, which aligns well with the idea of agent tools that maintain context (e.g. a database connection that stays open, or a browser that retains cookies).

Orchestration and Multi-Agent Workflows. Many real tasks may be too complex for a single agent or might benefit from specialized agents collaborating. The architecture is therefore expanding to support multi-agent systems where agents communicate or coordinate. Some protocols, like Agent-to-Agent (A2A) messaging, are emerging to standardize inter-agent communication (for instance, Google's Agent2Agent protocol and Microsoft's adoption of A2A in their framework). In a multi-agent setup, you might have one agent that specializes in planning, another in executing code, another in validation, etc., passing context or subtasks among them. Orchestration frameworks now often support deterministic workflows (where the chain of sub-tasks is predefined, akin to a business process) alongside LLM-driven orchestration (where agents dynamically decide how to break down and assign tasks). For example, Microsoft's new open-source Agent Framework explicitly supports both Agent Orchestration (LLM-driven, creative, adaptive) and Workflow Orchestration (fixed logic, for reliable repeatability) within one runtime. This framework, released in late 2025, consolidates previous research prototypes (like Semantic Kernel's planner and AutoGen from MSR) into an enterprise-ready SDK. It emphasizes connectors to enterprise systems, open standards (MCP, A2A, OpenAPI), and built-in telemetry, approvals, and long-running durability to meet enterprise needs. The trend here is that agents are being treated as first-class components of software systems, with the same expectations for monitoring, security, and lifecycle management as microservices or human-in-the-loop workflows.

Summary: The architecture of modern agent systems is coalescing around a modular, layered design. A secure sandboxed execution layer ensures that any generated code or commands run in isolation with controlled privileges. A standardized tool interface layer (MCP and similar protocols) decouples agent reasoning from the implementation of tools, enabling a rich ecosystem of reusable capabilities. On top of these, orchestration mechanisms allow composing multiple agents and tools into larger autonomous workflows, while providing hooks for humans and existing DevOps processes to supervise and intervene when needed. In the following sections, we delve deeper into three crucial aspects of enterprise agent systems: (a) the sandbox and runtime isolation mechanisms, (b) the emerging standards and ecosystems of tools/plugins, and (c) the security, governance, and observability considerations that are top-of-mind as organizations deploy these systems.

2. Isolated Execution Environments for Agents (Sandboxing)

Running untrusted or machine-generated code has always been risky – the difference now is that with LLM agents the code is being generated and executed on the fly, without a human vetting each command. This opens the door to accidental failures or even malicious exploits if the agent is tricked or if its outputs are unsafe. As a result, sandboxing has become a foundational requirement for agent systems. Sandboxing in this context means confining the agent's actions (code execution, file system writes, network calls, etc.) to an environment where it can't harm other processes or breach data it shouldn't access.

Table 1: Research / OSS Projects (Papers, Benchmarks, Open-Source Runtimes)

Name Category Sandbox/Isolation Boundary Key Capabilities Reference
Kubernetes SIGs: agent-sandbox OSS (K8s Primitives/Controller) Sandbox CRD in Kubernetes (with Template/Claim/WarmPool) Manage "isolated + stateful + singleton" workloads; standardized API for agent runtime GitHub
AIO Sandbox (agent-infra/sandbox) OSS (All-in-One Environment) Single Docker container (integrated multi-tools) Browser/Shell/File/MCP/VSCode Server unified; unified workspace for agents & dev GitHub
Alibaba OpenSandbox OSS (Universal Sandbox Platform) Unified protocol + multi-language SDK + sandbox runtime Universal sandbox foundation for command/file/code/browser/agent execution GitHub
E2B (e2b-dev/E2B) OSS (Cloud Sandbox Infrastructure) Cloud-isolated sandbox (SDK controlled) Run AI-generated code in cloud; Python/JS SDK; for agent code interpreter GitHub
E2B Desktop (e2b-dev/desktop) OSS (Virtual Desktop Sandbox) Isolated virtual desktop environment "Computer Use" agent: desktop GUI, customizable dependencies, per-sandbox isolation GitHub
LLM Sandbox (vndee/llm-sandbox) OSS (Lightweight Code Sandbox) Containerized isolation (configurable security policies) Run LLM-generated code; customizable security policies and isolated container environments GitHub
SkyPilot Code Sandbox (alex000kim/…) OSS (Self-hosted Execution Service) SkyPilot deployment + Docker sandboxing Self-hosted, multi-language execution, token auth, MCP integration (for agent tools) GitHub
Microsandbox (zerocore-ai/microsandbox) OSS (microVM Execution Environment) Hardware-isolated microVM (fast startup) Run untrusted workloads via microVM; emphasis on isolation strength and startup speed GitHub
ERA (BinSquare/ERA) OSS (Local microVM Sandbox) Local microVM ("microVM with container ease-of-use") Run untrusted/AI-generated code locally with hardware-level isolation GitHub
SandboxAI (substratusai/sandboxai) OSS (Runtime) Isolated sandbox Secure execution runtime for AI-generated Python code and shell commands GitHub
Python MCP Sandbox (JohanLi233/mcp-sandbox) OSS (MCP Server) Docker container isolation Expose "secure Python execution" as a tool to agent/LLM clients via MCP GitHub
Code Sandbox MCP (Automata-Labs-team/…) OSS (MCP Server) Docker container isolation MCP server: provide containerized secure code execution environment for AI applications GitHub
ToolSandbox (Apple) Research + OSS (Evaluation Benchmark) Evaluation sandbox with "stateful tool execution + user simulator" Evaluate LLM tool-use: state dependencies, multi-turn dialogue, dynamic evaluation; open-source arXiv
ToolEmu Research (Risk Evaluation Framework) LM-emulated sandbox (simulate tool execution with LM) Use LM to simulate tool execution for scalable agent risk testing; includes automatic safety evaluator OpenReview
HAICOSYSTEM Research + OSS (Safety Evaluation Ecosystem) Modular interaction sandbox (human-agent-tool multi-turn simulation) Multi-domain scenario simulation and multi-dimensional risk evaluation (operational/content/social/legal); code platform arXiv
EnterpriseBench Research (Enterprise Environment Evaluation Sandbox) "Evaluation environment" for enterprise tasks/tools/data Evaluate LLM agents in enterprise scenarios (task execution, tool dependencies, data retrieval)
Managing Linux servers with LLM-based AI agents Research (Empirical Evaluation) Dockerized Linux sandbox Let agents execute server tasks in Dockerized Linux environment and evaluate performance ScienceDirect
Multi-Programming Language Sandbox for LLMs Research (Multi-language Execution Sandbox) Container-isolated sub-sandbox Multi-language compilation/execution isolation (sub-sandbox isolated from main environment) arXiv
awesome-sandbox (restyler/awesome-sandbox) OSS (Ecosystem Overview/List) N/A (aggregation) Systematic curated list & analysis of "code sandboxing solutions"; good entry point for long-tail coverage GitHub

Note: Achieving exhaustive coverage is impractical (especially given the long tail of the MCP ecosystem), so this table covers mainstream/representative projects plus ecosystem indexes. The awesome-sandbox list serves as an entry point for additional coverage.

Table 2: Commercial / Cloud Service Projects (Agent Sandbox / Code Sandbox / Runtime)

Product/Service Vendor Isolation/Execution Model Key Capabilities Reference
Code Interpreter (Tools) OpenAI Managed Python sandbox execution Model writes and runs Python; for data analysis/coding/math OpenAI Platform
Code Interpreter (Assistants on Azure) Microsoft Azure OpenAI Managed Python sandbox execution Assistants API runs Python in sandbox environment (per Azure docs) Microsoft Learn
E2B (Managed Cloud) E2B Managed cloud sandbox (enterprise agent cloud) Sandbox as agent runtime; emphasis on concurrency and execution infrastructure E2B
Daytona Daytona Managed/platform sandbox infrastructure "Stateful infra for AI agents"; ultra-fast creation and isolated execution Daytona
Agent Sandbox Novita AI Managed agent runtime Low startup latency, high concurrency; code execution/network access/browser automation Novita AI
Sandboxes (Desktop / GUI) Bunnyshell Firecracker microVM virtual desktop For GUI/Computer Use: isolated desktop, VNC/noVNC, desktop automation API Bunnyshell
Agent Sandbox on GKE Google Cloud (GKE) Deploy/run Agent Sandbox controller on GKE Isolated execution of untrusted commands in cluster; official installation and usage guide Google Cloud Documentation
AgentCore "agent sandbox" AWS Bedrock AgentCore Console testing sandbox AWS docs: test agents in agent sandbox AWS Documentation
Modal Sandboxes Modal Modal platform sandbox execution unit Official example: build code-executing agent with Modal Sandboxes + LangGraph Modal
Vercel Sandbox Vercel Vercel managed execution environment (Sandbox product) For scalable execution (fluid compute/pay-per-active-CPU, etc.) Vercel
Docker Sandboxes (Experimental) Docker Local containerized sandbox (for coding agents) Docker official: use local isolated environments to run coding agents, enforce boundaries Docker

Agent Sandbox on Kubernetes. The Kubernetes-based Agent Sandbox, spearheaded by Google and open-sourced as a SIG project in late 2025, exemplifies state-of-the-art sandbox design. A sandbox instance is essentially a microVM (micro virtual machine) launched per agent session, managed through K8s APIs. Internally it leverages technologies like gVisor (userspace kernel) to intercept syscalls and Kata Containers (lightweight VM isolation) to provide a robust security boundary. This means even if an agent's code tries to perform a malicious syscall or exploit a kernel bug, it's constrained within a sandbox kernel that has minimal privileges on the host. The sandbox also limits network access by default on GKE (only allowing what's necessary for the agent tools), reducing the risk of an agent scanning internal networks or exfiltrating data. At KubeCon NA 2025, Google showcased how they can schedule thousands of sandbox pods in parallel, thanks to the lightweight nature of gVisor, and how pre-warmed sandbox pools enable sub-second startup latencies even with the isolation. This addresses the performance concern that isolation often introduces: by carefully engineering snapshot/restore and pooling, the overhead can be kept low enough for interactive use.

From an API standpoint, the Sandbox CRD provides features tailored to long-running agent processes: you can specify resource limits, attach persistent volumes for agent state, and use the Kubernetes scheduler to place sandboxes on appropriate nodes (e.g. ones with GPU if the agent needs it). It also has life-cycle controls like scheduled deletion (to clean up sandboxes after use) and the mentioned pause/resume. Collectively, these features fulfill OWASP's top recommendation for mitigating agent risks: "system isolation, access segregation, permission management, command validation, and other safeguards". In fact, OWASP added an entry to its Top 10 for LLMs called "Agent Tool Interaction Manipulation" – the risk of an AI agent being induced to misuse its tools or perform unintended actions. The primary defense listed is to run the agent in a locked-down environment with fine-grained permission controls on what it can do. By confining an agent to a Kubernetes sandbox with only specific Kubernetes API access (or none at all beyond its tools) and no broad host access, even a compromised agent will have limited blast radius.

Local Sandboxing Solutions. Not all organizations use Kubernetes or need cloud-scale multi-tenancy; for individual developers or on-prem deployment, there are lighter-weight sandbox solutions emerging. One notable project is ERA (by BinSquare), which provides a local sandbox for running AI-generated code with "microVM security guarantees plus containers ease of use". ERA uses technologies like krunvm (firecracker microVM runner) under the hood, orchestrated in a way that feels like using Docker containers. The idea is to give developers a quick way to test AI-written scripts safely on their laptop or CI pipeline, without having to set up full Kubernetes. Similarly, some frameworks allow using WebAssembly (Wasm) sandboxes for certain tasks (since Wasm can restrict file and network access for code running within it). The InfoQ article on sandboxing mentions Lightning AI's LitSandbox and a library called container-use as alternatives, which likely explore isolating Python execution or providing wrapper APIs that simulate a sandbox. While these are not yet as standardized as the Kubernetes Agent Sandbox, they indicate a broad interest in making sandboxing accessible across environments.

Integration with Agent Frameworks. Modern agent frameworks are starting to build in assumptions about sandboxing. For example, LangChain (one of the earliest agent libraries) historically would just execute Python code or bash commands directly on the host, which is obviously dangerous in production. By late 2025, we see frameworks like LangGraph 1.0 (the evolution of LangChain's agent module) emphasizing "durable and safe" execution, and CrewAI (another open-source agent framework) adding features for asynchronous tool execution and monitoring to potentially plug into sandboxed runtimes. Microsoft's Agent Framework integrates with their Azure Foundry services, which likely means an agent's code execution can be routed to a managed sandbox (e.g. an isolated Azure Function or container instance) – in their blog they highlight "enterprise-grade deployment from the beginning", including security and compliance hooks. We also see new tools like Aspire's AI agent isolation module (by Microsoft) which aims to allow developers to run multiple agent instances in parallel without conflict, hinting at port isolation and MCP proxy layers. All these efforts point to execution isolation becoming a default part of agent system design. It's no longer assumed that an agent's code runs in the same process as the host application or with full OS privileges – instead, agents run in a contained, observable slot, much like how web browsers run untrusted JavaScript in a sandboxed process.

Transactional and Fault-Tolerant Execution. A sophisticated angle to sandboxing is making execution fault-tolerant. If an agent's action fails or does something unwanted, can we roll it back? One recent research prototype, Fault-Tolerant Sandboxing for AI Coding Agents, introduced a transactional file system wrapper for agent execution. It intercepts file system writes and system changes during an agent's tool use, and if the agent misbehaves or a policy violation is detected, the sandbox can rollback to a clean snapshot. In their experiments, 100% of unsafe actions were intercepted and rolled back, at a cost of ~14.5% performance overhead. However, they note a key limitation: this works for local state (files, processes) but not for external side-effects. If the agent made a cloud API call that created resources or sent emails, a local rollback doesn't undo those. This is pushing the conversation toward distributed transaction semantics for agents – treating a sequence of tool API calls as a saga that might need compensating actions if aborted. While not solved yet, it's a recognized gap (researchers call for integrating compensating transactions for external tools to truly sandbox at the multi-system level). For now, sandboxing primarily ensures the agent's local environment can be reset to a safe state even if one step goes awry.

Human Takeover and Hybrid Sandboxes. An intriguing development in sandbox design is support for human-in-the-loop interventions not just via yes/no approval prompts, but via full manual control of the sandbox. The idea is that if an agent reaches a step where it is stuck or needs privileged action (like entering a password or solving a tricky problem), a human operator can seamlessly take over the agent's sandbox session, do what's needed, and then hand control back to the AI. The research prototype AgentBay embodies this concept: it provides a unified isolated session that the AI agent can control via API (e.g. issuing OS commands, browser actions) and that a human can remote into graphically at any moment. AgentBay implements a custom Adaptive Streaming Protocol (ASP) to make this possible with very low latency. Unlike traditional screen sharing (RDP/VNC), ASP dynamically switches between sending high-level commands and video frames, adjusting to network conditions and whether the AI or human is currently in charge. The result is a much smoother experience for the human supervisor, even on weaker networks. In tests, allowing a human to intervene in AgentBay's sandbox improved task success rates by over 48% on complex benchmarks, showing the value of fluid HITL (Human-In-The-Loop) control. This approach directly addresses enterprise needs for control: rather than the agent being a black-box automation that might get stuck, it becomes a cooperative automation that an analyst or engineer can jump into whenever needed, without compromising the isolation or requiring the task to be restarted. We foresee future enterprise agent platforms offering a "panic button" or agent assist mode that spawns a secure VNC/Browser session for an operator, all actions logged, then closes back to autonomous mode.

In summary, sandboxing in agent systems has evolved into a multi-faceted capability: it's not only about securing the environment (with VMs, syscall filters, network restrictions), but also about managing the agent's lifecycle and state (persistent storage, snapshots, warm pools) and facilitating controlled handoffs (pause/resume and human takeover). The investments by major players – e.g. Google building Agent Sandbox as a CNCF project – indicate that these sandboxing techniques will likely become standard infrastructure in cloud platforms. Just as Kubernetes gave us primitives for scalable microservices, we are now getting primitives for safe autonomous agent execution on the cloud and the edge.

3. Tool Ecosystem and Standardization: From Plugins to MCP

In parallel with sandboxing the runtime, the industry has tackled the tool integration problem for agents. Early agent implementations often hard-coded a set of tools or required developers to write custom "plugin" adapters for each use case. This doesn't scale when enterprises might want agents to access dozens of internal APIs, databases, and third-party services. The last six months have seen a strong push toward standardizing how agents discover and use tools, yielding a more interoperable ecosystem.

3.1 Model Context Protocol (MCP) and the AAIF

Model Context Protocol (MCP) has emerged as the de facto standard protocol in this space. As mentioned, MCP defines a client-server schema where the AI agent (client) can list what tools a server offers, call those tools with JSON arguments, and receive results. It also covers things like authentication handshakes (e.g. OAuth flows to let an agent "login" to use a tool on a user's behalf) and streaming responses (for tools that send incremental results). By late 2025, MCP's momentum was cemented by the formation of the Agentic AI Foundation (AAIF) under the Linux Foundation. In December 2025, the Linux Foundation announced AAIF with MCP as a founding contribution alongside OpenAI's AGENTS.md and Block's Goose. The goal is to provide a neutral, open governance home for these agent standards so that no single company controls them. The AAIF launch PR notes MCP had already exploded in adoption: over 10,000 MCP servers published covering everything from dev tools to Fortune 500 internal integrations, and support built into major AI platforms including Claude, ChatGPT, GitHub Copilot, Google Gemini, VS Code, Cursor, and many others. This is remarkable considering MCP was only open-sourced in late 2024 – it resonated because it addressed an urgent pain point: without it, every AI vendor and every enterprise would be duplicating integrations. By rallying around MCP, the community effectively agreed on a "lingua franca" between agents and tools.

From an enterprise perspective, MCP brings several benefits:

  • Interoperability: A tool (say a database query interface) can be implemented once as an MCP server and then used by different agents (Anthropic's, OpenAI's, self-hosted ones) without custom adapters. This has analogies to drivers or connectors in classical software – build it once, use anywhere.
  • Security and Auditability: MCP messages are structured (JSON) and typically go through a client library in the agent runtime, where they can be logged and inspected. This makes it easier to audit what the agent asked a tool to do, as opposed to the agent running free-form shell commands that are hard to intercept. The protocol includes a capability advertisement step (the server tells what it can do), which can be checked against policies. It also often requires an auth handshake (e.g. OAuth) for the agent to gain access to the tool on behalf of a user, which means existing identity systems can mediate access.
  • Modularity and Future-proofing: As InfoQ summarized, MCP shifts integration from a tangled web into a modular architecture, reducing the "plugin fatigue" problem and making it easier to add new tools or swap out models. It also levels the playing field – small open-source projects can publish MCP servers that become as easily usable as those from big vendors, fostering a community ecosystem of tools.
  • Neutral Governance: With AAIF, companies like AWS, Google, Microsoft, Anthropic, and OpenAI are all at the same table (indeed all are listed as platinum members). This reduces the risk that MCP splinters into competing versions; it's likely to become analogous to HTML or SQL – a baseline standard that everyone implements, with maybe some extensions.

It's worth noting that MCP is evolving to cover more than just "traditional API calls." Recent extensions include Agent-to-Agent messaging (so an agent can expose itself as a tool to others via MCP) and binary data support (for image and file transfer). The AGENTS.md standard, also under AAIF, complements MCP by providing a way for software projects to declare to agents how to interact with them. AGENTS.md is essentially a README for AI agents, placed in a code repo to describe the project, its build/test tools, key contexts, and constraints. Over 60k open-source repos have adopted AGENTS.md to guide coding agents. By standardizing this, when an agent (like GitHub Copilot or Cursor) is working on a new codebase, it can automatically read AGENTS.md to understand the project's specific commands (e.g. how to run tests) rather than relying on general knowledge. This reduces errors and makes code-writing agents more reliable across different environments.

MCP Tool Ecosystem. Many companies and open-source teams have published MCP servers for their systems. For instance, GitHub released an official GitHub MCP Server that exposes GitHub operations (issues, PRs, repo contents, etc.) via MCP. This allows an agent to perform GitHub actions (like creating an issue or commenting on a PR) in a safe way – the server enforces GitHub's API policies and scopes. Similarly, we have MCP servers for databases (SQL tools), cloud resources (AWS, Azure MCP servers), information lookups (Wikipedia, web search), and even OS-level tasks (there are MCP servers that wrap shell commands or Docker). A typical enterprise might run a suite of internal MCP servers: one for their ticketing system, one for their customer database, one for DevOps (Kubernetes control like the mcp-server-kubernetes we saw). By doing so, they create a catalog of approved tools that their AI agents can use. Some companies are building MCP Gateways or registries to manage this catalog, which we'll discuss in the security section.

Local-First and Offline Agents. While MCP often assumes a client (agent) connecting to a server over HTTP, it's flexible enough to work in "all local" scenarios too (using stdio pipes). The Goose framework (contributed by Block to AAIF) is described as a "local-first AI agent framework". Goose uses MCP for tool extensions – meaning you can run goose agents on your laptop, and they can spin up local MCP servers for local tools (say, accessing a local filesystem or application) without needing cloud connectivity. This is important for cases where data privacy requires everything to remain on-prem or on-device. It also means an enterprise could package up an agent + tool suite to run entirely in an isolated network (e.g. an AI agent that helps with internal network diagnostics, running in a secure enclave with no internet access, but with MCP hooking into internal systems). The push toward standardization via MCP doesn't imply centralization in the cloud – on the contrary, it can democratize who provides tools (open-source implementations, self-hosted services, etc.) as long as they speak the protocol.

Beyond MCP: Other Standards. While MCP is currently the frontrunner, there are other noteworthy efforts. OpenAPI-based tool use: some agent frameworks allow importing any OpenAPI spec and will auto-generate an "agent tool" from it. For example, Microsoft's Agent Framework highlights that any REST API with an OpenAPI definition can be instantly turned into a tool, with the framework handling schema parsing and secure invocation. This is complementary to MCP: one could imagine MCP servers automatically exposing an OpenAPI, or vice versa. Another is the concept of capability description languages – OpenAI's Function Calling spec is one example, where the model is told function signatures and it outputs JSON for calls. Some researchers propose more formal schemas for tool affordances. At the moment, however, MCP seems to be converging those threads: it provides a structured way for an agent to query "what can I do?" and then invoke a function with arguments, which is essentially function calling over a channel. It's likely we'll see alignment or bridging between OpenAPI, JSON-RPC, and whatever else emerges, to avoid fragmenting this again.

In essence, if sandboxing addresses the agent's "body," MCP addresses the agent's "arms and legs". It standardizes how the agent reaches out to interact with the world. This was a necessary step for agents to become truly useful in enterprise settings, because no single vendor can supply every integration. By lowering the integration barrier, companies can leverage a far broader set of tools. However, as we'll discuss next, giving an AI agent access to many tools also broadens the attack surface and governance burden – thus, standardization and security have to go hand in hand.

4. Security, Governance, and Trust in Agent Systems

Deploying autonomous agents in an enterprise inherently raises the question: how do we trust them? Unlike a deterministic script, an AI agent can come up with unexpected actions, and it might be influenced by inputs (or adversaries) in ways we can't fully predict. Over the past months, a significant focus of both practitioners and researchers has been on closing the "trust gap" – ensuring that agents do what they're supposed to and nothing more, or at least that we can detect and mitigate when they misbehave. Several key themes have emerged: permission and policy models, supply chain security of tools, prompt injection defenses, auditing and observability, and fail-safe mechanisms. We'll examine each in turn.

4.1 Prompt Injection and Confused Deputy Problems

Prompt injection – where an external input is crafted to manipulate the agent's LLM into ignoring its instructions or performing unintended actions – has proven to be a very real threat. In the context of agent tools, prompt injection can become a "confused deputy" attack: the LLM is the deputy that has privileges (access to tools) and the attacker exploits it via crafted input (a prompt) to misuse those privileges. A simple example: an attacker might embed a malicious command in a user-provided email, which the agent then dutifully executes with its shell tool. Real incidents and proofs-of-concept have shown this is not just theoretical. The consensus in discussions (e.g. on Hacker News) is that prompt injection is analogous to XSS (cross-site scripting) in web apps – you cannot fully eliminate it just by sanitizing inputs, because the model's behavior with arbitrary text is hard to constrain. Thus, relying solely on prompt-based safeguards (like "don't execute if user says to do something bad") is brittle.

The more robust approach is structural: limit what the agent can do even if it's tricked. This means enforcing policy at the tool invocation layer. For instance, if the agent tries to run a shell command, have a policy that disallows rm -rf or network calls to sensitive endpoints. If it uses a database tool, ensure it cannot query tables it shouldn't. This is where sandboxing and permission models overlap. In a sandbox, you can intercept system calls – e.g. prevent file writes outside a certain directory, or limit network access to only whitelisted domains. With MCP, you can implement an allow-deny policy per tool – e.g. forbid a certain combination of API calls or detect if the arguments look suspicious (like a SQL query that's dumping all user data).

One concrete advancement is the research AgentBound framework, which proposes attaching a declarative access control policy to MCP servers. Inspired by Android's app permissions, AgentBound allows a tool to declare what host resources it needs (files, network targets, etc.), and an admin can approve or limit those. At runtime, an enforcement engine monitors the agent's calls and blocks anything outside the allowed scope. Impressively, AgentBound's evaluation auto-generated policies for 296 popular MCP servers with about 80.9% accuracy from the code, and could block the majority of malicious actions with negligible overhead. This suggests that intelligent tooling can help manage the policy burden: we can analyze a tool's code to infer "this tool should only ever need to access X API or Y file", then use that as a sandbox rule.

Another line of defense is schema validation. Many tools expect inputs of a certain form (JSON with specific fields, numbers in ranges, etc.). If the agent's output deviates, it can indicate either a prompt injection or a model error. Rigorously validating the agent's action format before executing it can catch some attacks or mistakes. In fact, OWASP's recommendation of command validation falls here – e.g. if an agent tries to execute sudo rm -rf /, the sandbox or tool wrapper should detect that and refuse.

It's widely acknowledged that prompt injection cannot be fully solved at the model level, so enterprise systems are layering these runtime controls. Some are even exploring two-model setups: one model generates a plan or interprets user input without any tools (and thus with no privileges), then a separate "execution model" with tools enabled but a much more constrained input (only the sanitized plan). This is analogous to separating policy decision and policy enforcement. However, this approach is in its infancy – researchers have noted it's tricky to ensure the two models stay in sync and that the first model doesn't inadvertently become a covert channel for bad instructions.

4.2 Tool Supply Chain Security

As the MCP tool ecosystem grows, a new class of security concerns appears: the tools themselves may have vulnerabilities or could be malicious. We've effectively extended our "attack surface" to any code that implements a tool API. In July 2025, security researchers disclosed critical flaws in some community-developed MCP servers:

  • The MCP Server for Kubernetes (an MCP tool that allowed agents to run kubectl commands on a cluster) had a command injection flaw. It constructed shell commands from user input without sanitization, so an attacker could embed | or && to execute arbitrary commands on the host. Not only that, the advisory demonstrated a prompt injection chain: if an agent was asked to read a pod's logs (which contained malicious instructions), the agent might then call a vulnerable kubectl tool with those instructions, leading to RCE (Remote Code Execution) on the MCP server host. This is a vivid example of how an innocuous high-level task (read logs) can cascade into a full compromise via weaknesses in the tool implementation. It underscores that agent security is only as strong as the weakest tool in its arsenal.
  • Another advisory for mcp-package-docs (a tool for reading package documentation) had a similar shell injection issue. Essentially, many early tools naively used exec() on strings, a practice long known to be dangerous in any software context.
  • The AI coding assistant Cursor found an even more subtle exploit: an agent could be tricked into writing a malicious MCP server configuration to disk (effectively "installing" a new tool) which would then be loaded and executed, giving the attacker code execution on the system. In response, Cursor had to forbid agents from writing to certain config directories.

These incidents highlight supply chain risk: when you install an MCP server from NPM or pip, do you know it's safe? Could it have a dependency hijacked to steal data? Traditional supply chain best practices – code signing, vetting maintainers, vulnerability scanning – all apply here. But additionally, the dynamic nature of agent tool use requires new thinking. For example, an agent might fetch a tool definition (schema) from somewhere at runtime – that channel could be compromised (a malicious tool listing that lies about what it does). To address this, the community is discussing tool registries with verification. Imagine an "App Store" for MCP tools where each tool is reviewed, sandboxed, and cryptographically signed. The Linux Foundation AAIF might play a role in hosting a global registry, or there may be vendor-specific ones.

Some researchers call for transparency logs and a "SBOM" (Software Bill of Materials) approach for agent tools. For instance, an enterprise might want a log of every tool version the agent ever used, so if one is later found malicious they can audit past agent runs. They also want assurance that the tool code running is exactly the code that was audited. This is akin to how modern browsers handle extensions: with strict signing and review processes.

On the defense side, one idea is dynamic tool vetting – before an agent uses a new tool, run that tool in a test mode on known benign inputs to see if it behaves correctly, or run it in a shadow sandbox with instrumented monitoring to detect unexpected actions. This is analogous to how app stores do a review, but potentially automated and at runtime. For now, this is an open research problem; we haven't seen full implementations yet, but it's identified in literature as a needed control.

In summary, securing the tool ecosystem requires both preventive measures (secure coding practices for tool developers, automated scans for dangerous patterns like execSync on inputs) and mitigations (running tools with least privilege, e.g. a tool that only needs to read a database should not also have OS write access). The principle of least privilege should apply at every level: the agent only has access to certain tools, the tool only has access to certain system resources. Achieving this in practice means plumbing through the user's identity and intent: e.g., if an agent is acting on behalf of Alice, the database tool should run under Alice's credentials or a role with her permissions, not a superuser. This is an area where enterprise IAM (Identity and Access Management) integration is critical – mapping the human user's identity to the agent's allowed actions. Recent work is exploring how to tie enterprise SSO/OAuth tokens into agent sessions in a fine-grained way, so that an agent cannot escalate its privileges beyond what the user would normally have through regular apps.

4.3 Monitoring, Auditing, and Policy Enforcement

Observability is notoriously difficult for AI systems because of their nondeterminism and unstructured outputs. But for agents, observability is non-negotiable in enterprise settings. Operators need to be able to ask: "What sequence of steps did the agent take? Why did it take a certain action? What tool calls were made with what parameters? Did anything unusual happen?" To that end, agent platforms are incorporating extensive logging and tracing capabilities:

  • Structured Traces: There's a push to use standards like OpenTelemetry to trace agent execution like any microservice call graph. Each agent action (e.g. "called Tool X with params Y, got result Z") can be a span in a trace. This allows using existing APM (Application Performance Monitoring) tools to visualize agent workflows. Some commercial platforms now show a real-time step-by-step trace of the agent's reasoning and tool use (often known as an "Agent console" or debug pane).
  • Semantic Logging: Beyond raw tool call logs, there's interest in capturing higher-level events. For example, flag if an agent's plan changed drastically mid-execution (could indicate it got confused or was manipulated), or if it requested an unusually large amount of data from a tool. Logging the content of prompts and responses is tricky (for privacy reasons), but logging the intents and outcomes is feasible. Additionally, cryptographic logging (hash chaining the logs) has been suggested so that forensic analysis can trust that logs weren't tampered with.
  • Auditing for Compliance: In sectors like finance or healthcare, any automated system needs audit trails for compliance. If an agent made a change to a customer's record, we need to know who/what prompted that and that it was authorized. Solutions here include linking agent actions to a user session and storing that context (e.g. "Agent acted on behalf of Alice, in response to request R, at time T"). Some enterprises restrict certain tools to manual-confirmation mode where a human must approve the agent's action in a dashboard (common for things like executing a trade or sending an email). Ensuring the agent properly presents the action for approval (and doesn't hide the true intent) is an active UX/security challenge.
  • Policy Engines: Enterprises are beginning to employ policy-as-code systems (like Open Policy Agent or custom rule engines) to govern agent behavior. For example, a policy might be: "Agents cannot call the production database tool with a WHERE clause missing a limit, unless the user is in admin role." When an agent attempts such a query, the policy engine can intercept and either block it or route it for approval. This ties into MCP Gateway architectures, where instead of the agent connecting directly to tool servers, it connects to a Gateway proxy that mediates all calls. Microsoft's preview of an MCP Gateway shows features like session persistence (to keep agent-tool sessions sticky) and a central place to enforce auth, rate limiting, etc. We can foresee these gateways becoming very sophisticated, implementing org-wide guardrails (e.g. no agent can call external web APIs that are not in a vetted list, to prevent data exfiltration).
  • Evaluation and Testing: An emerging practice is to treat agents like code and develop evaluation suites for them. Before deploying an agent update (new model version or new tool), run a battery of scenarios (some normal, some adversarial) to see how it behaves. In late 2025, multiple benchmarks for agent safety were released to facilitate this. The MCP-SafetyBench is one such benchmark: it tests LLM agents on realistic multi-step tasks across five domains (web browsing, financial analysis, code repo management, navigation, and web search) while injecting 20 types of attacks (from prompt tampering to tool output manipulation). The sobering result: no current model is remotely immune to MCP-based attacks – even top-tier models had 30–48% of tasks compromised. They also found a negative correlation between task performance and security: models that are more capable at completing tasks also tend to be more exploitable, presumably because they more eagerly follow any instruction including malicious ones. This points to a fundamental safety-utility trade-off. Enterprises must calibrate how "aggressive" or autonomous they want the agent to be. Some are introducing adjustable risk settings – e.g. a slider from conservative (fewer tools, more confirmations) to aggressive (full autonomy, high risk). A metric called NRP (Normalized Risk-Performance) was proposed to quantify this balance. Ultimately, continuous evaluation will be key: as new attacks are discovered, adding them to test suites and ensuring the agent (with all its tools and policies) can handle or resist them.

4.4 Identity, Authentication, and Governance

A less glamorous but absolutely crucial aspect is identity and access management (IAM) for agents. When an agent performs an action, whose authority is it under? In a multi-user environment (say an AI assistant in a company), the agent might have to act as different users at different times. Traditional OAuth wasn't designed for a scenario where an LLM is effectively a headless client acting interactively on behalf of a user. Over the past months, developers have hit practical snags integrating OAuth with MCP. For example, the OAuth Dynamic Client Registration used by MCP (so an agent can automatically register itself to use an API) sometimes fails with enterprise IdPs due to strict URL checks. Some IdPs don't allow dynamic clients at all. There are calls to allow static client credentials or out-of-band provisioning for agents in such cases. This is more of a standards gap than a research one – it's being worked through in the MCP working group.

From an enterprise architecture view, many want the agent to integrate with existing SSO. That means when an employee invokes an agent, the agent should use that employee's OAuth token to access tools. This ensures all actions are attributable and within the user's permissions. It's straightforward for some tools (like an MCP server can simply require a token from the agent), but complex for others (e.g. a shell tool on a server – how to scope that per user?). Some solutions involve impersonation tokens or scoped API keys: e.g. the agent might have a key that only allows certain operations and is tagged to the user.

The concept of "least privilege" comes into sharp focus here: the agent should only have the minimum access needed for the task, and ideally only for the duration needed. Techniques like OAuth token exchange or short-lived credentials are recommended. If an agent is spun up to do a build job, give it a temporary token that expires after, so even if it went rogue, it couldn't do damage later. One recent architecture paper emphasizes integrating enterprise identity with these agents so that all actions flow through the normal IAM checks and logs of the enterprise. That means, for instance, an agent using a Jira tool would appear in the Jira audit logs as "actions performed via AI agent on behalf of Bob". This transparency is needed for trust – people won't use the agent if it's a black box doing things in the shadows.

Governance also extends to deciding which tasks to automate vs require human approval, what data agents are allowed to see, and how to prevent data leakage. Some enterprises restrict agents from accessing production data entirely, using them only on sanitized or test datasets until trust is built. Others put heavy monitoring on outputs (e.g. scanning everything the agent is about to output to a user for sensitive data). These are areas where data loss prevention (DLP) tools intersect with AI. A future vision is that an enterprise agent platform will integrate DLP classifiers that flag if an agent's response likely contains company confidential info, and either redact it or alert a human.

Finally, we must mention user trust and adoption: beyond technical measures, building trust in agents involves user education and incremental rollout. Many organizations start with "read-only" agents (they can suggest actions but not execute them) and then gradually allow more autonomy as confidence grows. By having robust logs and a clear override path, users are more likely to accept the agent's help. Trust is also enhanced by making the agent's reasoning visible (hence the popularity of chain-of-thought traces displayed to users) and by giving users easy ways to correct or stop the agent. In essence, transparency and control are the antidotes to the unpredictability of AI.

The advancements in the last half-year – from sandbox isolation to protocol standardization and new benchmarks – all aim to shrink the trust gap. Yet, open challenges remain (discussed in the next section) before one can confidently say an autonomous agent is as well-understood and controlled as a traditional software microservice.

5. Open Challenges and Future Directions

Despite rapid progress, enterprise agent systems still have unsolved research questions and practical gaps. We conclude by highlighting some of the most pressing ones, as identified by recent discussions and publications, which represent opportunities for future work:

  • Unified Cross-Layer Security Model: Today we have pieces – OAuth for identity, MCP scopes for tool access, sandbox for OS isolation – but they don't always speak the same language. There is no single policy that says, for example, "User X's agent can read from database Y but not write, and can run code but only use 2 CPU and no internet, and these conditions are cryptographically verified." A comprehensive model that ties user identity, agent capabilities, tool permissions, and sandbox OS permissions into one coherent framework is needed. Early proposals like AgentBound (inspired by mobile app permissions) are a start. In the future, we might see capability tokens that encode all these at once – the agent carries a token which the sandbox and tools all check, limiting what it can do in each context. Formal verification of such models (to prove an agent cannot do X) would greatly enhance trust.
  • Rollback of External Side Effects: As noted, while we can rollback filesystem changes in a sandbox, we cannot yet rollback an email sent or a transaction made. Developing agent transaction protocols or sagas is an open challenge. One idea is to require critical tools to provide a compensation function – e.g. an MCP server for cloud VMs could have an "undo" for creating a VM (which would delete it). An agent planner could then use these to revert a series of actions if needed. This also ties into training the LLM or using a secondary verifier to decide when to rollback (e.g. if it notices an outcome diverges from expected state). Without solving this, enterprises will be hesitant to let agents perform irreversible operations autonomously.
  • Advanced Threat Defenses: The taxonomy of potential attacks (context injection, tool poisoning, cross-tool data leaks, etc.) is growing. Defenses like context signing (cryptographically signing tool outputs or important prompts to prevent tampering) have been suggested but not widely implemented. The idea there is: an agent would only trust tool outputs that come with a signature or hash, so an attacker who intercepts or modifies the content (like a man-in-the-middle on an HTTP tool) would fail. Similarly, isolating tools from each other (so one tool can't directly influence another except through the agent's vetted reasoning) is a challenge – currently the agent's memory is the meeting point of all tool data, making it a melting pot where a malicious output in one tool can affect decisions involving another.
  • Benchmarking and Standards for Evaluation: The community has started benchmarks like MCP-SafetyBench and MSB, but we need continuous evaluation pipelines. Perhaps an open leaderboard where agent developers can submit their agent (with a certain set of tools and policies) to be evaluated against a suite of scenarios, similar to how language models are benchmarked on GLUE or SuperGLUE for NLP. This could drive competition and improvement in safety. Also, evaluation should include cost and latency metrics – an agent that is safe but takes hours or $$$ to complete a task isn't practical. Balancing efficiency with safety will likely lead to innovations like adaptive risk modes (the agent switches to a more cautious approach if it senses something sensitive, trading speed for safety dynamically).
  • Human-Agent Interaction Paradigms: AgentBay's approach to HITL is one example of making agents more usable in the real world. There is still work to do on when and how an agent should ask for help. If it asks too often, it's not useful; if it asks too rarely, it might make an irrecoverable error. Finding that sweet spot (perhaps through reinforcement learning or feedback from users) is an ongoing area. Also, UI/UX research into how to present agent decisions to users in a clear way will be important (so users can confidently approve or deny actions). In enterprises, this might mean integrating agent controls into existing interfaces – e.g. showing an "AI agent suggestion" in a Jira ticket with a one-click approve.
  • Cross-Organization Collaboration and Data Sharing: Enterprise agents often need to work across silos – e.g. an agent might coordinate between a supplier's system and the company's internal system. This raises questions of federated trust: how do you let an agent use two domains' tools in a secure way? This touches on things like standardizing how agents convey identity across org boundaries, and how audit logs are shared. The AAIF being under Linux Foundation hints at future inter-company standards to address this, since agents won't stop at the corporate firewall.
  • Ethical and Compliance Considerations: Beyond security, enterprises must ensure agents comply with regulations and ethical norms. For example, if an agent interacts with personal data, privacy laws apply. How do we audit that an agent didn't retain or leak personal data beyond allowed purposes? Techniques like data tagging and tracking could be employed – marking certain outputs as containing sensitive info and preventing them from being used in contexts that aren't allowed. Ensuring AI explanations for decisions (especially if used in regulated domains) is another angle – if an agent makes a decision that affects a customer, one might need a rationale logged for compliance, which is tricky given the opaque reasoning of LLMs.
  • Improving Model Robustness: Finally, at the heart is the LLM itself. There's ongoing research into fine-tuning models to be more resistant to manipulation (advantageous to safety but often at odds with capability). Techniques like constitutional AI or adversarial training on tool-use scenarios might yield models that inherently refuse certain dangerous actions or at least flag uncertainty. Also, specialized models for parsing and validating the agent's outputs (e.g. a secondary model that checks if a proposed action seems safe/rational) could be integrated. OpenAI and others are exploring "moderator" models that look at the main model's outputs. In agents, a "policy model" might examine the plan and tool uses and raise red flags for anything that violates training-time learned safe patterns.

Outlook: The next year will likely bring a maturation of the agent ecosystem akin to what 2010-2015 saw for cloud microservices – an explosion of tools and best practices to handle deployment, security, monitoring, and standardization. The formation of AAIF is a strong indicator that industry players see collaboration as the way forward; no one wants a fragmented, Wild West environment when so much is at stake (both in terms of safety and potential business value). We will probably see AgentOps teams emerge in organizations, analogous to MLOps, focused on managing and supervising fleets of agents. They'll use dashboards (like GitHub's Agent HQ mission control) to oversee agent activities across the enterprise. And just as DevOps developed guardrails and CI/CD for code, AgentOps will develop guardrails and continuous evaluation for autonomous AI behaviors.

In conclusion, enterprise agent systems are transitioning from the lab to the real world, carrying with them both excitement (unprecedented automation capabilities) and caution (novel failure modes). Sandbox architectures and protocols like MCP have laid a foundation that makes these systems more modular, controllable, and interoperable than before. Yet, achieving a level of trust comparable to traditional software will require continued innovation in permission modeling, verification, and human oversight integration. The last half-year's progress has been remarkable – what was mostly sci-fi a year ago (multiple AIs collaborating on complex tasks with minimal human input) is now demonstrably feasible. The coming months will likely see pilots turn into production deployments in enterprises, each teaching new lessons. By actively sharing these lessons and converging on open standards and benchmarks, the community can accelerate the safe adoption of agentic AI. The end goal is an ecosystem where AI agents become reliable teammates – tirelessly automating drudgery and navigating complexity – while humans retain ultimate control and understanding of their behavior. The path to get there is challenging, but as this survey shows, the groundwork is rapidly being put in place.

References

Share on Share on