Profiling AI Agents with Semantic Flamegraphs

Your AI agent spent $3000 this month. Which activities consumed that budget? agentpprof applies the flamegraph paradigm to AI agent traces, mapping natural language prompts to semantic tags and aggregating them like a CPU profiler. This post explains why existing observability tools fail at budget attribution and how semantic flamegraphs restore aggregation for agent workloads.

End of month, the bill shows the agent spent $3000. What types of work consumed that budget? How much went to code review, how much to debugging, how much to documentation? This question seems simple, but none of the existing agent observability tools can answer it directly.

agentpprof is a profiling tool built for exactly this question. It reads local agent trace history and aggregates prompts and tool calls by semantic intent into flamegraphs: width represents token consumption, execution time, or operation count. At a glance, you can see where the budget went by category. It is part of the AgentSight project, which provides eBPF-based observability for AI agent behavior.

The Aggregation Problem

LLM observability platforms like LangSmith, Langfuse, and Phoenix can show token counts and latency for each call, but when you have 80,000 calls, they can only arrange them by timestamp into a timeline. You can inspect each one and see "this call used 500 tokens," but you cannot answer "how much did review tasks cost in total." These tools are designed for single-trace debugging: timeline views help you locate the failing span at 14:03, span trees show call hierarchy, waterfall charts reveal parallelism. They excel at answering "what happened" but for the question "where did the budget go by category," inspecting 80,000 spans one by one simply does not scale.

Datadog and Laminar are starting to move in the right direction with semantic classification. Datadog uses topic clustering to group user messages, Laminar uses Signals to extract structured events from traces. But their clustering primarily targets the distribution of user inputs, not "width represents budget share" aggregate views. You can see "30% of users asked about code," but not "code review consumed 40% of the token budget."

CPU profilers solved a similar aggregation problem long ago. Flamegraphs compress millions of function calls into one chart, width representing time share. The stack indicates context, and repeated calls to the same function merge into wider bars. This works because function names are deterministic: the same code path produces the same stack, and identical stacks can be directly merged.

Agent traces break this assumption. Prompts are natural language: non-deterministic, variable-length, multilingual, and often conversational. "Fix the bug" and "修一下这个 error" express the same intent but share no common string. If you use raw prompt text as frame labels, the flamegraph becomes too wide to read, with each prompt as an isolated bar, losing the point of aggregation. And raw prompts often contain sensitive information, making them unsuitable for sharing.

Semantic Flamegraphs: Restoring Aggregation

agentpprof restores aggregation by introducing semantic tagging: mapping free-form prompts to short, stable labels like debug, review, paper, or docs. Once tagged, prompts behave like function names, repeated activities merge, and the flamegraph becomes readable.

The value of flamegraphs is not just aggregation but also stack-based causal linking. Traditional CPU flamegraph stacks are function call chains: main → parse → tokenize means tokenize was called by parse, which was called by main. Semantic flamegraph stacks are agent behavior causal chains: prompt:debug → call:llm/analysis → tool:bash → file:src/main.rs means this file modification was triggered by bash, bash was decided by the LLM, and the LLM was responding to a debug-type prompt.

	Traditional CPU Flamegraph	Semantic Flamegraph
Stack meaning	Function call chain	prompt → LLM → tool → effect causal chain
Aggregation	Same function name merges	Same semantic tag merges
Width meaning	CPU time share	token / time / operation count share
Question answered	Where does the program spend CPU	Where does the agent spend budget by category

This causal linking lets you trace back or drill down from any layer: from a file being modified, trace back to which tool, which LLM decision, which user intent caused it; or from a prompt category, see what LLM calls, tool executions, and system effects it triggered.

Multiple Views, Different Questions

agentpprof exposes several projections over the same data, each answering a different question:

View	Width means	Primary question
`tokens`	reported token count (input/output/cache)	Which prompts consumed the most model budget?
`time`	duration in seconds	How long did each prompt/activity take?
`files`	file/path effect count	Which prompts touched which parts of the repository?
`network`	network/domain effect count	Which prompts contacted which domains?

Start with tokens to find cost hotspots, use time to trace where wall-clock time went, and use files and network for security audits.

Real Examples from AgentSight Development

The examples below were generated from AgentSight's own development traces (Claude Code). They demonstrate what insights each view provides.

Tokens View: Where Did the Model Budget Go?

Tokens flamegraph

The token distribution shows that code review (prompt:review) dominated the model budget, followed by git operations (prompt:git), code work (prompt:code), editing (prompt:edit), and debugging (prompt:debug). Through the stack, you can trace which LLM calls each prompt category triggered: call:llm/usage for token statistics events, call:llm/code and call:llm/test for code-related responses, call:llm/tool for tool calls, and call:llm/edit for modification responses.

Time View: Where Did Wall-Clock Time Go?

Time flamegraph

Wall-clock time distribution follows a similar pattern to token consumption: review (prompt:review) leads, followed by git, edit, docs, and code prompts. Continuation prompts (prompt:continue) appear frequently, reflecting a workflow pattern where complex tasks required multiple follow-up exchanges. The prompt:inspect category captures quick look-at-this requests that are common in iterative development.

Files View: Which Parts of the Codebase Were Touched?

Files flamegraph

File access patterns show heavy activity in collector/src/ (the Rust codebase) and collector/Cargo.toml, consistent with development work. External paths (external/tmp, external/home, external/codex) appear frequently, reflecting tool invocations that touch temporary files, home directory configs, and Codex session data. The flamegraph distinguishes between read and write effects, revealing the balance of inspection versus modification across both project and external paths.

Network View: Which External Services Were Contacted?

Network flamegraph

Network activity is sparse relative to file operations, confirming that most development work occurred locally. The contacted domains include anthropic.com for model inference, crates.io for Rust dependencies, github.com for version control, and various localhost ports for local development servers. Process chains visible in the upper frames show which tools initiated network requests, enabling attribution of network activity to specific agent actions.

The Tagging Problem: An Open Challenge

The core technical challenge in semantic flamegraphs is mapping natural language prompts to stable, meaningful tags. This is fundamentally harder than CPU profiling, where function names are deterministic symbols. We have working solutions but not solved solutions, and we are explicit about the limitations.

Why Tagging Is Hard

Consider real prompts from a development session:

"fix the 编译 error"          # Mixed language
"嗯"                          # Single character confirmation
"ok"                          # Ambiguous intent
"继续"                        # Context-dependent
"[Session continued...]"      # System-generated
"看看 collector/src/main.rs"  # Inspection request
"为啥 cargo test 失败了"       # Debug question

These prompts exhibit properties that break naive classification:

Multilingual mixing: English and Chinese in the same prompt, sometimes in the same sentence
Extreme length variance: From 1 character to multi-paragraph context restorations
Context dependence: "继续" (continue) means nothing without knowing what preceded it
Implicit intent: "嗯" could be confirmation, acknowledgment, or thinking pause
System noise: Auto-generated session continuations, tool outputs, error messages

No single approach handles all cases well. We currently provide three backends, each with different tradeoffs:

Current Approaches

Regex + Agent Iteration: The production default. Rules like prompt:debug='(?i)fix|error|bug|broken|为啥' are pattern-matched against prompt text. The workflow is iterative: run agentpprof, observe unmatched samples, write rules, repeat until coverage exceeds 95%. This typically takes 5-10 rounds for a new project.

Strengths: Deterministic, reproducible, fast, no external dependencies. Rules can be version-controlled and run in CI.

Weaknesses: Requires manual effort per project. Rules are brittle to prompt style changes. Cannot handle semantic similarity (e.g., "fix the bug" vs "resolve the issue").

LLM Tagger: Local inference via llama.cpp with grammar-constrained decoding to ensure valid one-word output. We use small models (0.6B-3B parameters) with aggressive caching.

Strengths: Handles semantic similarity and multilingual prompts. No rule writing required.

Weaknesses: Non-deterministic (same prompt may get different tags across runs). Requires local model setup. Tag quality depends on model capability. Our experiments show 285/300 exact-stable fragments with a 3B model, meaning 5% of prompts get different tags on repeated runs.

TF-IDF + K-Means Clustering: Unsupervised clustering to discover natural groupings. Automatically selects cluster count (5-25) and generates tag names from cluster keywords.

Strengths: No predefined categories needed. Discovers structure you did not anticipate.

Weaknesses: Cluster boundaries are arbitrary. Tag names are keyword-derived, not semantic. Requires post-hoc interpretation.

What We Do Not Know

Several fundamental questions remain open:

Tag adequacy: We can verify that tags are syntactically valid and stable across runs (our R180 experiment shows 900/900 grammar-valid outputs from three model sizes). But we have no evidence that one-word tags capture enough semantic information for human understanding. "debug" might conflate bug fixing, error investigation, and performance debugging, each of which has different cost implications.

Cross-project transfer: Rules developed for one project may not transfer to another. A Rust systems project has different prompt patterns than a React frontend project. We do not yet know how much rule overlap exists across project types.

Optimal granularity: Should "code review" be one tag, or should it split into "review:style", "review:logic", "review:security"? Finer granularity preserves information but fragments the flamegraph. We have no principled way to choose.

Multilingual normalization: "Fix the bug" and "修一下这个 bug" should probably get the same tag, but regex rules cannot express this. LLM taggers can, but with stability tradeoffs.

Why We Ship Anyway

Despite these limitations, agentpprof is useful in practice. The key insight is that perfect tagging is not required for useful aggregation. Even with 20% unmatched prompts and imperfect tag boundaries, the flamegraph reveals structure that was previously invisible: which activity categories dominate, how token consumption distributes across intent types, which prompts trigger the most tool calls.

The goal is not ground-truth classification but actionable visibility. If the flamegraph shows "review" consuming 40% of tokens, the exact boundary of what counts as "review" matters less than knowing that review-like activities are the dominant cost driver.

We are actively working on:

LLM-assisted rule generation (model proposes rules from unmatched samples)
Embedding-based similarity for multilingual normalization
Human evaluation of tag adequacy (currently missing from our evidence base)

Privacy by Default

Local agent histories can contain prompts, tool outputs, paths, commands, repository names, and model responses. agentpprof is conservative by default:

SVG, pprof, and folded outputs contain stack labels and weights, not raw prompts or model responses.
Absolute paths outside the selected project root are grouped into stable buckets such as external/home, external/tmp, external/codex, and external/claude.
Private-looking domains are collapsed instead of exposing user-specific hostnames.

Part of AgentSight

agentpprof is the offline profiling component of AgentSight, an eBPF-based observability framework for monitoring AI agent behavior. While AgentSight provides live visibility through SSL/TLS interception and process monitoring, agentpprof provides aggregate analysis of already-recorded agent traces.

A typical workflow combines both:

Record agent activity with agentsight record
Generate summary reports with agentsight report
Profile token consumption with agentpprof --view tokens
Audit file access patterns with agentpprof --view files
Check network destinations with agentpprof --view network

For installation and detailed usage, see the AgentSight repository and the agentpprof documentation.

From Visibility to Action: The Harder Problem

Generating a flamegraph is the easy part. The harder question is: what do you do with it?

CPU profilers lead to clear actions: find the hot function, optimize the algorithm, reduce allocations. But agent cost profiles are different:

You will not stop doing code review because it consumes 40% of tokens
You will not skip debugging because it is expensive
The flamegraph shows WHERE budget goes, not WHY it goes there or HOW to reduce it

The actionable insights require drilling deeper:

Within-category analysis: Review consumes 40% of tokens, but is that because of repeated reviews of the same file? Unnecessarily broad context windows? Verbose review prompts? The flamegraph shows the category; understanding the cause requires examining individual sessions.
Workflow pattern detection: Continuation prompts (prompt:continue) appearing frequently may indicate tasks that should be structured differently upfront. High prompt:unmatched rates may indicate prompt styles that need standardization.
Cross-session comparison: Is this month's token distribution different from last month's? Did a workflow change increase debugging costs? Trend analysis requires baseline comparison.

We are working on combining agentpprof with interaction analysis to produce reports that recommend specific changes: CLAUDE.md rules to prevent repeated file reviews, prompt templates to reduce context overhead, workflow restructuring to minimize continuation churn.

Current Limitations

Agent coverage: Currently reads Codex and Claude Code local traces only. Gemini, Cursor, and other agents require parser extensions via the agent-session crate.

Tagging: As discussed above, semantic tagging remains an open challenge. Project-specific rules are required, and we do not yet have evidence that one-word tags are semantically adequate.

Validation: We have mechanism evidence (the flamegraph correctly aggregates by tag) but not user evidence (developers make better decisions with this view). The latter requires user studies we have not yet conducted.

Cost attribution: Token counts come from agent-reported usage, which may not reflect actual billing (cached tokens, batch discounts, model-specific pricing). The flamegraph shows relative distribution, not dollar amounts.

agentpprof is open source and part of the AgentSight project. Contributions and feedback are welcome.

Continue exploring

Back to index

Blog

Technical articles on eBPF, bpftime, AI agent observability, GPU tracing, userspace runtimes, and systems research from Eunomia.

Can an AI Agent Tune the Linux Scheduler? Inside SchedCP

SchedCP gives AI agents a controlled path from workload intent to verified schedext policies, achieving up to 1.79x performance improvement and 13x lower optimization cost in the paper's evaluation.

ActPlane: Pushing Agent Harness Enforcement Down to Kernel eBPF

ActPlane is an eBPF-based policy engine that observes and enforces AI agent behavior at the OS kernel level. This post analyzes the systemic blind spots of prompt constraints, tool-layer guards, and sandboxes, and explains how ActPlane uses label propagation and temporal predicates to implement a deterministic agent harness.

Last updated: Jul 1, 2026
First published: Jun 24, 2026
Contributors: 云微, LinuxDev9002

Edit this page Share on X Share on Facebook Join discussion RSS feed

Was this page helpful?