Skip to content

Runtime Observability and Enforcement for Opaque AI Agents with eBPF: Beyond Sandboxes and Approvals

As AI coding agents run autonomously for hours inside harnesses and sandboxes the platform team may not own, approval-based control breaks down. This post argues for separating agent security into three layers (intent authorization, execution isolation, side-effect verification) and using eBPF-based observability (AgentSight) and enforcement (ActPlane) as an independent runtime observability and enforcement below the harness.

AI coding agents now run for hours, complete entire features end-to-end, optimize production GPU kernels, and merge thousands of pull requests autonomously. Meanwhile, most agent security still relies on human-in-the-loop approval, and Anthropic's own data shows users approve 93% of prompts without meaningful review. The result is predictable: products add bypass modes, users disable permission gates, and 65% of firms report agent security incidents.

But the deeper problem is not approval fatigue. It is that the agent harness (the prompt loop, tool routing, permission logic, and sandbox defaults) is increasingly a third-party product the platform team did not write, running in a sandbox the platform team may not own. The harness is not a trusted security boundary. This post argues for separating agent security into three layers with three different owners: intent authorization (harness-owned), execution isolation (ownership contested), and side-effect verification (must be platform-owned). When the layers agree, you have confidence. When they disagree, you need independent observability and enforcement at the OS level to detect it, and that is exactly the layer most agent platforms are missing. We are building projects towards this direction: AgentSight for runtime observation and ActPlane for runtime harness enforcement, both using eBPF to provide an independent runtime observability and enforcement below the agent harness.

Why Now: Complexity Up, Guardrails Behind

The important change in 2026 is not that agents exist. It is the scale and duration of what they do.

A year ago, the typical agent task was "fix this bug" or "write this function." In 2026, agents routinely run for hours on complex, multi-step work. OpenAI documented a Codex session that ran for 25 hours uninterrupted, consuming 13 million tokens and producing 30,000 lines of code from a blank repository. Anthropic's agentic coding report cites a 12.5-million-line codebase change completed in a single 7-hour run. Meta's KernelEvolve uses multi-agent coordination to write and optimize production GPU kernels, compressing work that previously required weeks of expert systems engineering into hours. On SWE-bench Verified, top agents now resolve 60–70% of real GitHub issues, up from under 30% in early 2024. Devin has merged hundreds of thousands of pull requests across enterprise customers with a 67% merge rate. Goldman Sachs deployed hundreds of Devin instances across a 12,000-person engineering team.

Beyond coding, general-purpose autonomous agents have gone mainstream. OpenClaw, an open-source agent with over 300,000 GitHub stars, connects to LLMs and executes shell commands, browser automation, email, calendar, and file operations on the user's machine. CrowdStrike called it "the AI Super Agent" security teams need to worry about: between January and April 2026, 470 security advisories were filed against it across three disclosure waves.

These are not research demos. They are production workflows: background tasks, parallel execution, multi-hour sessions, end-to-end feature development, kernel optimization, and enterprise-scale code changes.

Meanwhile, the guardrails designed to keep agents safe have not kept pace.

Most agent security still relies on human-in-the-loop approval: a prompt asks the user to approve or deny each action before it executes. This works for short sessions with a few tool calls. It does not work when an agent makes hundreds of decisions over hours of autonomous operation.

The evidence suggests that approval-based control is already failing in practice. Anthropic's own data shows that Claude Code users approve 93% of permission prompts, a rate consistent with rubber-stamping rather than meaningful review. An independent stress test of Claude Code's auto mode found an 81% false negative rate on ambiguous state-changing actions, meaning the classifier allowed 4 out of 5 actions that should have required human review. Real incidents have followed: in documented cases, users running agents without permission gates had their home directories deleted by rm -rf commands the agent generated. A 2026 industry survey found that 65% of firms reported AI agent security incidents, primarily unauthorized data access, credential exposure, and exfiltration to external endpoints, with most involving organizations lacking proper agent access controls.

Products have responded by adding bypass mechanisms. Claude Code offers --dangerously-skip-permissions. Windsurf's Cascade agent proceeds autonomously where Cursor stops to ask. Community guides now focus on "how to safely use YOLO mode." Anthropic researcher Nicholas Carlini ran 16 parallel Claude agents with permissions bypassed, with the caveat: "Run this in a container, not your actual machine."

This is the tension: the more capable agents become, the more users want to let them run uninterrupted, and the less effective human-in-the-loop becomes as the primary security boundary.

That tension is what creates the need for a different security model.

The Accountability Gap

The deeper issue is not just that agents are more capable. It is that the agent harness, the component that decides what the agent does, is increasingly a third-party product the platform team did not write.

A modern agent harness is not a thin wrapper around a model. It includes a prompt loop, planning and retry logic, tool routing, MCP clients, permission modes, approval gates, hooks, memory, logs, credential handling, and sometimes sandbox defaults. In many deployments, that harness comes from a hosted coding-agent service or an open-source framework the platform team does not control.

This is already visible across the ecosystem. GitHub Copilot's coding agent runs autonomously in GitHub Actions, researching repositories, creating plans, making changes, and opening pull requests. OpenAI Codex runs background tasks in sandboxed cloud environments with controlled network access. Claude Code runs cloud sessions in Anthropic-managed VMs with scoped credentials. Kubernetes SIG is defining Agent Sandbox for isolated, stateful agent workloads. Recent research datasets show agent-authored pull requests at scale across real repositories.

The ownership split is now explicit in major platforms. Anthropic's shared responsibility framework divides agent security into four layers (Model, Harness, Tools, Environment) and stresses that an agent's behavior depends on all four working together, so the harness, tools, and environment, the layers shaped by the deploying party, are as decisive as the model itself. Anthropic itself notes that even together, these layered safeguards are not a guarantee. The question the framework leaves open is what happens when a failure crosses these layers, and whether the deployer has independent observability to detect it. In cloud infrastructure, the analogous gap in shared responsibility led to independent observability and audit services (CloudTrail, Config, GuardDuty) controlled by the customer, not the provider. Agent infrastructure has no equivalent yet: the deployer is told it owns harness, tools, and environment, but often has no independent way to verify what those layers actually did at runtime.

GitHub's agentic workflow architecture starts from the premise that "agents cannot be trusted by default, especially in the presence of untrusted inputs", using kernel-enforced communication boundaries that hold even if the agent container is compromised. OpenAI's Codex documentation acknowledges that "devcontainers provide substantial protection, but they do not prevent every attack."

The platform team still owns the repository, the CI runner, the Kubernetes cluster, the service accounts, the secrets, and the internal network. But the runtime acting on those assets may be opaque.

There is also a second split that matters even more for platform teams: the sandbox may not be controlled by the environment owner either. If the agent runs in a provider-managed cloud (Claude Code on the web runs in Anthropic-managed isolated VMs with scoped credential proxies; Codex runs in OpenAI-managed containers), the platform team cannot attach its own monitoring, modify isolation policy, or inspect the sandbox internals. Even Anthropic's own managed agent architecture explicitly decouples the "brain" (Claude + harness) from the "hands" (sandboxes), treating containers as disposable and ensuring tokens are never reachable from the sandbox where generated code runs. This is good architecture, but it is the provider's architecture, not the platform team's.

When agents run locally or on self-hosted infrastructure (GitHub now supports self-hosted runners for its coding agent, and Kubernetes Agent Sandbox provides gVisor/Kata-backed isolation under the platform operator's control), the environment owner can wrap the agent in its own sandbox and observability. When agents run in provider-managed environments, independent observability and enforcement must move to the boundaries the platform team does control.

This creates the accountability gap: the platform team is responsible for production impact from a workload it cannot fully inspect, running in a sandbox it may not own.

The old mental model was simple: the agent is risky, so put it in a sandbox. The new reality has a different trust boundary: the agent and its harness are part of the workload, and the environment owner needs independent runtime observability.

Three Layers, Three Questions

MCP, sandboxes, and OS-level observability are all necessary for agent security. They are not interchangeable. Each answers a fundamentally different question, and each has a different owner.

Intent authorization (MCP, tool gateways, approval prompts) answers: what is the agent supposed to do? Which tools may it call, under which identity, with which scopes? This is the right place to enforce access control before a dangerous action happens. But a tool approval is not proof of side effects. A framework log saying "run tests" does not prove that the process tree only ran tests. An MCP server can be well-authenticated and still be part of a workflow that causes unexpected local effects. This layer is typically owned or mediated by the agent harness.

Execution isolation (containers, VMs, network policy, namespaces) answers: what can the agent reach? Which files, network endpoints, credentials, and syscalls are available? This is the right place to limit blast radius. But a sandbox does not automatically record what the agent attempted within its constraints: which process read a secret, which subprocess opened a network connection, whether the sandbox policy matched the approved intent. This layer's ownership is contested: it may belong to the agent provider, the platform team, or both.

Side-effect verification (OS/runtime observability) answers: what actually happened? Which processes ran, which files were read, which network connections were opened, which credentials were accessed? This layer provides facts about execution, independent of what the framework reported or the sandbox intended. This layer must be owned by the environment operator. Otherwise there is no independent source of truth.

The security model is the combination:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

When all three layers agree, you have confidence. When they disagree, you need OS-level observability and controls, independent of the harness, to detect the mismatch, contain the damage, and reconstruct what happened.

Why Independence Matters

The reason to keep these layers independent follows from the trends above, but also from a deeper structural argument about ownership and trust.

Approval fatigue

When approvals are relaxed (as the evidence above shows they routinely are), the other two layers must compensate. If you auto-approve routine actions, you need an independent way to verify what those actions actually did. If you bypass permissions for speed, you need stronger containment and stronger observability.

Harness opacity

When the harness is opaque, application-level telemetry cannot be the sole source of truth. OpenTelemetry GenAI conventions and framework-level tracing are valuable when you own the framework. But opaque agent apps, closed-source runtimes, hosted execution, stripped binaries, and arbitrary subprocess trees can all break the assumption that the framework trace is complete. OpenClaw illustrates this directly: its behavior is non-deterministic across runs, producing different tool-calling sequences for the same input, which makes static code review inadequate and drove multiple teams to build dedicated runtime observability tools for it (OneClaw, ClawTrace). Security researchers have already found 30+ vulnerabilities across all major AI IDEs (Cursor, Copilot, Windsurf, Claude Code), enabling data theft and remote code execution through prompt injection into agent tool chains.

The MCP layer records intended tool calls. The OS layer records actual side effects. When the harness is opaque, the gap between these two is exactly where security incidents live.

The trust boundary is an ownership boundary

The deepest reason for independence is that the three layers serve different owners with different incentives.

The harness provider's goal is to complete the user's task: maximize autonomous coding productivity, reduce permission friction, deliver results. The platform team's goal is to protect the repository, secrets, cluster, CI runner, internal network, and production APIs. These goals are not opposed, but they are not identical. When they conflict, when the fastest path to task completion involves reading credentials, opening network connections, or modifying files outside the workspace, the harness will optimize for completion unless an independent boundary stops it.

This is why Bhattarai and Vu argue that "probabilistic compliance is not compliance": training-based and classifier-based defenses may reduce empirical attack rates, but cannot provide deterministic guarantees under adversarial conditions. Only architectural enforcement can. Red Hat's experience deploying multi-agent systems on Kagenti frames the same insight differently: this is "a multi-tenancy problem disguised as an AI problem". The agent is an untrusted tenant. The platform needs the same kind of isolation, identity, and audit controls it would apply to any untrusted workload.

The OWASP Top 10 for Agentic Applications reinforces this framing. Its top risk (ASI01, Agent Goal Hijacking) is that "agents cannot reliably distinguish instructions from data," and a single malicious input from a repository, issue, MCP response, or web page can redirect the agent to perform harmful actions using its legitimate tools. This is not a hypothetical: Bishop Fox demonstrated confused deputy attacks where instructions embedded in support tickets caused agents to exfiltrate data using authorized tools, with "the user's name on every audit log entry." Docker documented a GitHub prompt injection chain where a malicious issue hijacked an MCP-connected agent to steal confidential data from private repositories.

The threat model for platform teams therefore has three adversary categories:

ThreatWhich layer failsRuntime observability detects
Compromised agent (prompt injection, malicious repo/issue/MCP response)Intent layer: agent is tricked into unintended actionsActual side effects diverge from stated intent
Untrusted harness (opaque permission logic, incomplete logs, unauditable internal state)Cannot verify harness completenessOS-level facts independent of harness reporting
Sandbox escape or policy gap (container breakout, mounted credentials, network bypass)Isolation layer fails or is misconfiguredDetects behavior outside expected sandbox boundary

AISI's SandboxEscapeBench makes the third category concrete: frontier models can reliably escape container sandboxes under misconfigurations that plausibly occur in real systems, and the researchers discovered four unintended escape paths the benchmark designers had missed. Their recommendation: "treat plain Docker isolation as insufficient by default."

In all three cases, OS/runtime observability is the independent control that lets the platform team detect the problem, regardless of which other layer failed.

What OS-Level Monitoring Captures

At the OS/runtime layer, observability captures:

  • Process lineage: the full tree from agent to subprocess to network call
  • File access: which paths were read or written, including credential paths
  • Network behavior: connections, destinations, timing, data volume
  • Container metadata: namespace, cgroup, pod identity, service account
  • Subprocess behavior: commands that bypass framework instrumentation

This data is collected below the application layer, typically via eBPF, audit subsystems, or kernel instrumentation. It does not require modifying the agent app. Its key property is independence: the observability is owned and operated by the environment operator, not by the agent provider.

This makes cross-layer comparison possible:

Framework report:    run tests
Sandbox policy:      workspace mounted, registry allowed, SA token mounted
OS observability:       agent → shell → python → curl
                     read: /var/run/secrets/.../token
                     connect: unknown external host

Each layer saw a different part of the event. Without the OS layer, this is an undetected credential theft: a service account token read and exfiltrated while the framework logged only "running tests." The platform team discovers the breach days later, if at all. OS-level observability is what turns an invisible data leak into a real-time detection.

Deployment Reality

OS-level observability is strongest when you control the host, node, or VM where the agent executes. If the agent runs entirely in a provider-managed environment, you may not be able to attach eBPF inside it.

In that case, the same model applies, but observability shifts to the boundaries you do control:

  • Repository permissions and branch protection
  • Scoped credentials with minimal lifetime
  • CI/CD and GitHub audit logs
  • Network proxies and webhook events
  • Artifact access logs
  • Provider-supplied session logs

This observability is weaker than owning the runtime boundary, but it is still better than treating the agent transcript as the only source of truth.

The design question for platform teams is:

Where is the lowest layer I actually control? That is where independent observability should live.

AgentSight and ActPlane: Observe, Then Enforce

We are building open-source tools that implement the verification layer described above, each addressing a different half of the problem.

AgentSight is a zero-instrumentation observability tool for AI agents. It uses eBPF to intercept SSL/TLS traffic and monitor process behavior at the system boundary, with no code changes, no SDKs, and no framework integration required. Point it at any agent process (Claude Code, Codex, a custom Python agent) and it captures the full picture: process lineage, LLM API calls (prompts and completions), file access, network connections, and tool invocations, all correlated into a live timeline. This is the "see what actually happened" layer. Because it operates below the application, it works even when the agent runtime is opaque, closed-source, or running arbitrary subprocesses that bypass framework-level tracing. In practice, this means detecting credential access, data exfiltration attempts, and unauthorized network connections as they happen, not days later when an external party reports the breach.

ActPlane is an OS-level harness for AI agents. Where AgentSight observes, ActPlane enforces. You write behavioral contracts in a YAML-based rule language (labeled information-flow control, not static allow-lists), and ActPlane compiles them into an eBPF program that enforces constraints at the kernel level: every exec, file open, and network connect in the agent's entire process tree is checked against the policy. When a rule is violated, ActPlane blocks the action and feeds a human-readable reason back to the agent through its hook system, so the agent self-corrects rather than failing silently. The rule language supports data-flow tracking across fork/exec chains, causal ordering ("run tests before committing"), and staleness invalidation, going well beyond what sandboxes or tool-layer guards can express.

The two tools are complementary. AgentSight provides runtime observability: independent, below-the-application visibility into what the agent did. ActPlane provides the enforcement plane: deterministic, kernel-level guarantees about what the agent cannot do. Together they implement the "verify side effects" layer of the three-layer model, independent of the harness provider and independent of who owns the sandbox.

Both are possible implementations of this architecture, not the only ones. The important point is the separation: observe and enforce at a layer the environment operator controls, regardless of which agent runtime sits above.

This also addresses ecosystem gaps Anthropic identifies: the need for cross-deployment security telemetry sharing and open standards for agent security. Independent runtime observability that travels with the workload, rather than being locked to a specific harness or provider, is the foundation for both.

Practical Checklist

If you are building or evaluating an agent platform, ask these questions at each layer.

Intent authorization (MCP / tool access):

  • Are MCP servers allowlisted?
  • Are OAuth scopes minimal and audience-bound?
  • Are local MCP servers treated as code execution risk?
  • Are high-risk tools gated by human approval?
  • Are tool calls logged with enough context for audit?

Execution isolation (sandboxing):

  • Is filesystem access default-deny or broad workspace mount?
  • Can the agent reach cloud metadata endpoints?
  • Is network egress restricted by domain, IP, or proxy?
  • Are service account tokens mounted into the environment?
  • Are process, memory, CPU, and runtime duration bounded?
  • Who owns the sandbox policy: the platform team or the agent provider?

Side-effect verification (runtime observability):

  • Can you reconstruct process lineage for an agent session?
  • Can you see file and credential access below the framework?
  • Can you correlate network egress with pod, service account, and command?
  • Can you detect mismatch between tool intent and OS side effects?
  • Can you replay an incident without trusting only framework logs?
  • Can you demonstrate to auditors (SOC 2, ISO 27001) how automated agent access to production data and credentials is monitored and logged?

Guardrail integration:

  • Which side effects should be blocked immediately?
  • Which should trigger alert or human review?
  • Which policies belong in MCP config, sandbox config, Kubernetes policy, eBPF/LSM, or network controls?
  • What happens when framework logs and OS-level observability disagree?

Closing

Agent runtimes are becoming more capable, more managed, and more opaque. The security model cannot depend on any single layer, especially when the layers have different owners.

The harness is not a trusted boundary. The sandbox ownership depends on the deployment model. The only layer the environment operator can guarantee it owns is OS/runtime observability.

MCP authorizes intent. Sandboxes constrain execution. OS-level observability verifies side effects. Each is necessary; none is sufficient. The practical model is their separation:

authorize intent  →  isolate execution  →  verify side effects
(harness-owned)      (ownership contested)  (must be platform-owned)

The implementation details vary by deployment, but the separation, and the ownership question, is the part that should remain stable.

If you are exploring this space, AgentSight and ActPlane are our open-source starting points for the observation and enforcement layers respectively.

References