iaprof: AI/GPU Flame Graph Profiler source code analysis
Background and Motivation
The GPU Performance Visibility Problem
Modern AI and GPU workloads present unique profiling challenges:
-
Execution Disconnect: Application code runs on the CPU while compute-intensive work executes on the GPU, creating a visibility gap between what developers write and where performance bottlenecks actually occur.
-
Deep Stack Complexity: A single GPU kernel launch involves multiple software layers:
- High-level frameworks (PyTorch, TensorFlow, JAX)
- Runtime libraries (Level Zero, Vulkan, OpenCL, CUDA)
- Kernel drivers (i915, Xe)
-
GPU hardware
-
Instruction-Level Bottlenecks: GPU performance issues often manifest at specific shader instructions (memory access patterns, control flow, synchronization), not just at the kernel level.
-
Hardware Sampling Overhead: Traditional profilers either:
- Use software instrumentation (high overhead, changes behavior)
- Provide only GPU-side metrics (disconnected from CPU code)
- Require code modifications (not practical for production workloads)
Why Existing Tools Fall Short
General-Purpose Profilers (perf, VTune): - Show CPU execution well but treat GPU as a black box - Cannot attribute GPU stalls to specific CPU call paths - Miss the correlation between framework code and GPU bottlenecks
GPU-Specific Profilers (Intel GPA, nvidia-smi, rocprof): - Provide detailed GPU metrics but lack CPU context - Don't show which application code triggered slow kernels - Require manual correlation between CPU and GPU timelines
Flame Graphs (traditional): - Excellent for CPU profiling (Brendan Gregg's innovation) - Don't integrate GPU execution - Can't show instruction-level GPU performance
The iaprof Solution
iaprof bridges this gap by creating AI Flame Graphs - a unified visualization that:
✓ Connects CPU to GPU: Shows complete call stacks from Python/C++ application code down to specific GPU instructions ✓ Uses Hardware Sampling: Low-overhead EU (Execution Unit) stall sampling via Intel's Observability Architecture ✓ Requires No Code Changes: Uses eBPF kernel tracing to intercept GPU driver calls transparently ✓ Instruction-Level Attribution: Associates hardware stall samples with specific shader instructions and CPU call paths ✓ Interactive Visualization: Generates clickable, searchable flame graphs for rapid bottleneck identification
Key Innovation: The Complete Stack
Traditional profilers show either CPU or GPU. iaprof shows the complete execution path:
Python Application
↓
PyTorch Framework
↓
Level Zero Runtime
↓
Xe/i915 Kernel Driver ← eBPF tracing captures this
↓
GPU Hardware ← OA sampling captures this
↓
Shader Instructions ← Debug API captures binaries
↓
EU Stall Metrics ← Hardware counters
All of this appears as a single, unified flame graph where you can: - Click a GPU instruction to see which CPU function called it - Search for a framework function to see which GPU code it triggered - Identify the slowest shader instruction and trace back to source code
Executive Summary
iaprof is a sophisticated GPU profiling tool that creates AI Flame Graphs - interactive visualizations linking CPU execution to GPU performance bottlenecks at the instruction level. It achieves this by combining:
- eBPF kernel tracing - Intercepts GPU driver calls to track kernel launches and memory mappings without code changes
- Hardware sampling - Collects EU (Execution Unit) stall samples via Intel's Observability Architecture with <5% overhead
- Debug API - Retrieves shader binaries and detailed execution information for instruction-level analysis
- Symbol resolution - Resolves CPU and kernel stacks to human-readable symbols with file/line information
The result is an interactive flame graph showing the complete execution path from application code (Python, C++, etc.) through runtime libraries (PyTorch, Level Zero, Vulkan) and kernel driver to specific GPU shader instructions, each annotated with hardware stall metrics (memory access, control flow, synchronization, etc.).
Use Cases: - Optimize AI training/inference workloads - Identify GPU bottlenecks in rendering pipelines - Understand framework overhead in ML applications - Validate GPU compiler optimizations - Performance regression testing
Supported Hardware: Intel Data Center GPU Max (Ponte Vecchio), Intel Arc B-series (Battlemage), Xe2-based GPUs
System Overview
┌─────────────────────────────────────────────────────────────────────┐
│ User Application │
│ (e.g., PyTorch, TensorFlow, SYCL, Vulkan, etc.) │
└────────────────────────────┬────────────────────────────────────────┘
│ API calls
┌────────────────────────────▼────────────────────────────────────────┐
│ Runtime Libraries │
│ (Level Zero, Vulkan, OpenCL, oneDNN, cuDNN, etc.) │
└────────────────────────────┬────────────────────────────────────────┘
│ ioctl()
┌────────────────────────────▼────────────────────────────────────────┐
│ Linux Kernel │
│ ┌──────────────────┐ ┌─────────────────┐ ┌────────────────────┐ │
│ │ DRM Subsystem │ │ i915/Xe Driver │ │ eBPF Subsystem │ │
│ │ │◄─┤ (GPU Driver) │ │ │ │
│ │ - Device mgmt │ │ - Memory mgmt │◄─┤ - iaprof BPF │ │
│ │ - IOCTL routing │ │ - Execution │ │ programs │ │
│ │ │ │ - Batch buffers│ │ - Ringbuffer │ │
│ └──────────────────┘ └────────┬────────┘ └─────────┬──────────┘ │
│ │ │ │
│ │ Commands │ Events │
│ ▼ │ │
│ ┌────────────────────────────────────────────────────┼──────────┐ │
│ │ Intel GPU Hardware │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌─────────────────┐ │ │ │
│ │ │ EU Array │ │ L1/L3/RAM│ │ OA (Observ. │ │ │ │
│ │ │ (Compute)│ │ (Memory) │ │ Architecture) │◄─┘ │ │
│ │ │ │ │ │ │ - Sampling HW │ │ │
│ │ │ Execution│ │ Data │ │ - Perf counters│ │ │
│ │ └──────────┘ └──────────┘ └─────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────┬──────────────────────────────┘
│
│ 1. eBPF events (ringbuffer)
│ 2. EU stall samples (debug API)
│ 3. Shader uploads (debug API)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ iaprof │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ BPF Collector│ │Debug Collect.│ │EUStall Coll. │ │
│ │ │ │ │ │ │ │
│ │ - Load BPF │ │ - Shader bin │ │ - HW samples │ │
│ │ - Parse BB │ │ - Context │ │ - Stall attr.│ │
│ │ - Stacks │ │ tracking │ │ -ibution │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────┬────────┴────────┬────────┘ │
│ ▼ ▼ │
│ ┌────────────────────────────────┐ │
│ │ GPU Kernel Store (Tree) │ │
│ │ - Shader metadata & binaries │ │
│ │ - CPU/kernel stacks │ │
│ │ - Per-offset stall counts │ │
│ └───────────────┬────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────┐ │
│ │ Stack Printer │ │
│ │ - Symbol resolution │ │
│ │ - Stack deduplication │ │
│ │ - Folded stack output │ │
│ └───────────────┬────────────────┘ │
│ │ │
└─────────────────────────┼────────────────────────────────────────────┘
│
▼
Folded Stacks
(CPU_stack;GPU_kernel;GPU_instruction stall_counts)
│
▼
FlameGraph.pl
│
▼
Interactive SVG
(AI/GPU Flame Graph)
Core Components
1. eBPF Subsystem (Kernel Space)
Location: src/collectors/bpf/bpf/*.bpf.c
Purpose: Trace GPU driver operations in real-time without modifying kernel source
Key eBPF Programs
a) Execution Buffer Tracing (i915/execbuffer.bpf.c
, xe/exec.bpf.c
)
Hooks: i915_gem_do_execbuffer
(fexit) / xe_exec_ioctl
(fexit)
Captures:
- Execution buffer ID
- VM and context IDs
- CPU user-space stack (via bpf_get_stack
with BPF_F_USER_STACK
)
- Kernel stack (via bpf_get_stack
)
- Process metadata (PID, TID, CPU, timestamp, command name)
- Batch buffer location (GPU address, CPU mapping)
Sends: struct execbuf_info
to ringbuffer
b) Batch Buffer Parser (batchbuffer.bpf.c
)
Runs in: Same hook context as execbuffer
Process:
1. Translates GPU batch buffer address to CPU address using gpu_cpu_map
2. Reads batch buffer from userspace memory (bpf_probe_read_user
)
3. Parses GPU commands DWORD-by-DWORD:
- Identifies command type from bits 31:29
- Calculates command length (static or dynamic based on command)
- Follows BATCH_BUFFER_START
to chained buffers (up to 3 levels deep)
- Extracts Kernel State Pointers (KSPs) from COMPUTE_WALKER
and 3DSTATE commands
- Extracts System Instruction Pointer (SIP) from STATE_SIP
4. If buffer incomplete (NOOPs encountered): Defers parsing with BPF timer
5. Emits KSPs and SIP to ringbuffer
Challenges: - eBPF verifier constraints (max instruction count, bounded loops) - Buffer may not yet be written when execbuffer called - Nested batch buffers require state tracking
c) Memory Mapping Tracing (i915/mmap.bpf.c
, xe/mmap.bpf.c
)
Hooks:
- i915_gem_mmap_offset_ioctl
(fexit) - Records fake offset
- i915_gem_mmap
(fexit) - Captures CPU address, associates with handle
- unmap_region
(fentry) - Cleans up mappings on munmap
Maintains: gpu_cpu_map
(GPU addr → CPU addr) and cpu_gpu_map
(reverse)
d) VM Bind Tracing (i915/vm_bind.bpf.c
, xe/vm_bind.bpf.c
)
Hooks:
- i915_gem_vm_bind_ioctl
/ xe_vm_bind
(fexit)
- i915_gem_vm_unbind_ioctl
/ xe_vm_unbind
(fentry)
Purpose: For discrete GPUs, maps GPU virtual addresses to buffer handles and CPU addresses
e) Context Tracking (i915/context.bpf.c
, xe/context.bpf.c
)
Tracks: Context ID → VM ID mapping
Needed: To resolve which VM an execbuffer belongs to
eBPF Maps
Key BPF maps:
rb
(ringbuffer): Main output channel, 512 MBgpu_cpu_map
: Hash table, GPU address → CPU address/sizecpu_gpu_map
: Reverse lookupdeferred_parse
: Batch buffers awaiting retrydeferred_timers
: BPF timer wrappers for deferred parsingbb_ksps
: Per-CPU hash table of discovered KSPscontext_create_wait_for_exec
: Context → VM ID mappings
2. Userspace BPF Collector
Location: src/collectors/bpf/bpf_collector.c
Responsibilities:
- Initialization:
- Loads compiled eBPF object file (
.bpf.o
) - Resolves BTF relocations
- Attaches programs to kernel tracepoints
-
Creates ringbuffer and maps file descriptors
-
Event Loop (runs in dedicated thread):
epoll_wait()
on ringbuffer file descriptorring_buffer__poll()
to consume events-
Dispatches to event handlers based on event type
-
Event Handlers:
- EXECBUF: Creates or updates shader in GPU kernel store
- EXECBUF_END: Marks batch buffer parsing complete
- KSP: Adds kernel pointer to shader
- SIP: Adds system routine pointer
- UPROBE_* (if enabled): User-space probe events
3. EU Stall Collector
Location: src/collectors/eustall/eustall_collector.c
Purpose: Collect and attribute hardware EU stall samples
Data Flow
GPU Hardware Sampling
↓
OA / Debug Interface
↓
struct eustall_sample {
uint32_t ip; // GPU instruction pointer (shifted by 3)
uint64_t active; // Active execution count
uint64_t control; // Control flow stall count
uint64_t send; // Memory access stall count
// ... more stall types
}
↓
handle_eustall_sample()
├─→ addr = ip << 3
├─→ shader = acquire_containing_shader(addr)
├─→ if shader found:
│ ├─→ offset = addr - shader->gpu_addr
│ └─→ shader->stall_counts[offset] += sample counts
└─→ else:
└─→ add to deferred_eustall_waitlist
Stall Types
From hardware sampling, categorized as:
- Active: EU actively executing (not stalled)
- Control: Control flow dependencies (branches, predication)
- Pipestall: Pipeline resource stalls
- Send: Memory/sampler/URB send stalls
- Dist_acc: Distributed accumulator stalls
- SBID: Scoreboard ID stalls (dependency chains)
- Sync: Synchronization (barriers, fences)
- Inst_fetch: Instruction fetch stalls
- TDR (Xe only): Thread dispatch/retire stalls
Deferred Attribution
When EU stall arrives before corresponding shader metadata:
struct deferred_eustall {
struct eustall_sample sample;
int satisfied; // 0 = still waiting, 1 = attributed
};
Separate thread periodically retries deferred samples when new shaders arrive.
4. Debug Collector
Location: src/collectors/debug/debug_collector.c
Purpose: Interface with GPU debug API for detailed execution information
Capabilities
- Shader Binary Collection:
- Receives shader upload events from driver
- Copies shader binary to shader store
-
Enables instruction-level disassembly
-
Context Tracking:
- Monitors context create/destroy events
-
Tracks VM associations
-
Execution Control (for deterministic sampling):
- Can SIGSTOP GPU processes for synchronized sampling
-
Triggers EU attention events
-
Platform-Specific Interfaces:
- i915: Uses
PRELIM_DRM_I915_DEBUG_*
ioctls (requires patched kernel) - Xe: Uses
DRM_XE_EUDEBUG_*
ioctls (mainline support)
5. OA Collector
Location: src/collectors/oa/oa_collector.c
Purpose: Initialize Observability Architecture hardware
Initialization Sequence
- Open DRM device
- Write to OA control registers:
- Enable OA unit
- Configure sampling period
- Select counter set (EU stalls)
- Enable triggers
- Start sampling
Interfaces with EU stall collector to provide sample stream.
6. GPU Kernel Store
Location: src/stores/gpu_kernel.c
Data Structure:
tree(uint64_t, shader_struct) shaders; // Global, sorted by GPU address
pthread_rwlock_t shaders_lock; // Protects tree operations
struct shader {
pthread_mutex_t lock; // Per-shader lock
// Identity
uint64_t gpu_addr;
uint64_t size;
enum shader_type type; // SHADER, SYSTEM_ROUTINE, DEBUG_AREA
// Attribution
uint32_t pid;
uint64_t proc_name_id, ustack_id, kstack_id;
uint64_t symbol_id, filename_id;
int linenum;
// Binary
unsigned char *binary;
// Profiling data
hash_table(uint64_t, offset_profile_struct) stall_counts;
};
Thread Safety:
Two-level locking: 1. RW-lock on tree (allows concurrent reads, exclusive writes) 2. Mutex per shader (protects shader fields)
Operations:
acquire_containing_shader(addr)
: Range query, finds shader whereaddr ∈ [gpu_addr, gpu_addr+size)
acquire_or_create_shader(addr)
: Gets existing or allocates newrelease_shader(shader)
: Unlocks shader mutex
7. Stack Printer
Location: src/printers/stack/stack_printer.c
Purpose: Transform collected data into folded stacks for flame graphs
Symbol Resolution
Kernel Symbols:
struct ksyms *ksyms = ksyms_load(); // Loads /proc/kallsyms
const struct ksym *sym = ksyms_search(ksyms, addr);
User-Space Symbols:
struct syms_cache *cache = syms_cache_create(); // Per-PID cache
const struct sym *sym = syms_cache_get(cache, pid, addr);
// Parses ELF + DWARF to get function, file, line
Output Generation
For each shader:
For each offset with stall counts:
1. Resolve CPU user stack to symbols: main;worker;submit_kernel
2. Append GPU kernel name: main;worker;submit_kernel;my_compute_kernel
3. Disassemble GPU instruction at offset: mad(8) r10.0<1>:f ...
4. Format stall counts: active=1234,send=567,control=89
5. Output: main;worker;submit_kernel;my_compute_kernel;mad(8) active=1234,send=567
Optimizations: - Stack string caching (hash table: stack → folded string) - Symbol caching (per-PID symbol tables) - Batch symbol lookups
Execution Flow
Startup Sequence
1. main(argc, argv)
↓
2. Parse command (record / flame / flamescope)
↓
3. record(argc, argv)
↓
4. check_permissions() - Verify root or CAP_PERFMON
↓
5. drm_open_device() - Open /dev/dri/card0
↓
6. drm_get_driver_name() - Detect i915 vs Xe
↓
7. Initialize collectors:
a. init_bpf()
├─→ bpf_object__open_file("main.bpf.o")
├─→ bpf_object__load(obj) // Loads into kernel
└─→ attach tracepoints
b. init_oa()
└─→ Configure OA hardware registers
c. init_debug()
└─→ debug_attach(pid=0) // Attach to all processes
d. init_eustall()
└─→ Initialize waitlists, start deferred thread
↓
8. Spawn collector threads:
- bpf_collect_thread()
- eustall_collect_thread()
- debug_collect_thread()
- eustall_deferred_attrib_thread()
↓
9. If command specified:
fork() + exec() child process
Else:
Wait for SIGINT (Ctrl-C)
Data Collection (Multi-threaded)
BPF Thread:
while (!should_stop) {
epoll_wait(rb_fd) // Block until events
ring_buffer__poll(rb, timeout)
↓ callback for each event
handle_execbuf_event()
├─→ shader = acquire_or_create_shader(gpu_addr)
├─→ shader->pid = execbuf->pid
├─→ shader->ustack_id = store_stack(execbuf->ustack)
└─→ release_shader(shader)
handle_ksp_event()
├─→ shader = acquire_shader(ksp->gpu_addr)
├─→ shader->type = SHADER_TYPE_SHADER
└─→ release_shader(shader)
}
EU Stall Thread:
while (!should_stop) {
debug_read_eustall_sample(&sample) // Block until sample
handle_eustall_sample(&sample)
├─→ addr = sample.ip << 3
├─→ shader = acquire_containing_shader(addr)
├─→ if shader:
│ ├─→ offset = addr - shader->gpu_addr
│ ├─→ profile = shader->stall_counts[offset]
│ ├─→ profile->active += sample.active
│ └─→ ... (other stall types)
└─→ else:
└─→ add to deferred_eustall_waitlist
}
Debug Thread:
while (!should_stop) {
debug_wait_event(&event) // Block until event
switch (event.type) {
case SHADER_UPLOAD:
shader = acquire_shader(event.addr)
shader->binary = malloc(event.size)
memcpy(shader->binary, event.data, event.size)
release_shader(shader)
signal_deferred_attrib_thread() // Wake deferred thread
break;
case CONTEXT_CREATE:
track_context_vm_mapping(event)
break;
}
}
Deferred Attribution Thread:
while (!should_stop) {
pthread_cond_wait(&deferred_cond) // Wait for signal
for each deferred_sample in waitlist:
shader = acquire_containing_shader(deferred_sample.addr)
if shader:
associate_sample(&deferred_sample, shader)
deferred_sample.satisfied = 1
release_shader(shader)
remove satisfied samples from waitlist
}
Shutdown and Output
SIGINT received
↓
collect_threads_should_stop = STOP_REQUESTED
↓
Join all collector threads
↓
Process remaining deferred samples (final retry)
↓
FOR_SHADER (shader in shaders tree):
pthread_rwlock_rdlock(&shaders_lock)
for each (offset, stall_profile) in shader->stall_counts:
cpu_stack_str = resolve_stack(shader->ustack_id, shader->pid)
kernel_name = get_kernel_name(shader)
inst_str = disassemble(shader->binary, offset)
stalls_str = format_stalls(stall_profile)
printf("%s;%s;%s %s\n", cpu_stack_str, kernel_name, inst_str, stalls_str)
pthread_rwlock_unlock(&shaders_lock)
↓
Cleanup:
deinit_bpf()
deinit_debug()
deinit_eustall()
free_profiles()
↓
Exit
Key Design Decisions
1. Why eBPF?
Pros: - No kernel module required - Safe (verified by kernel) - High performance (JIT compiled) - Comprehensive tracing capabilities
Cons: - Complexity constraints (verifier limits) - Requires BTF type information - Requires recent kernel (5.8+)
2. Why In-Kernel Batch Buffer Parsing?
Alternative: Parse in userspace after capturing buffer
Chosen: Parse in kernel with eBPF
Rationale: - Userspace copy would be expensive (buffers can be MB) - Need to capture before buffer reused/unmapped - Deferred parsing handles incomplete buffers
Trade-off: Higher kernel overhead, but more reliable
3. Why Deferred Attribution?
Problem: EU stall samples can arrive before shader metadata
Solution: Waitlist with retry mechanism
Triggered by: - New shader arrival (debug collector signals) - Periodic retry (every N seconds) - Final cleanup (at shutdown)
4. Why Two-Level Locking?
Alternative: Global lock for entire shader tree
Chosen: RW-lock on tree + per-shader mutex
Benefits: - Multiple readers can traverse concurrently - Stall attribution doesn't block other operations - Fine-grained locking reduces contention
5. Why Folded Stack Format?
Alternative: Custom binary format, JSON, protobuf
Chosen: Text-based folded stacks
Rationale: - Compatible with existing FlameGraph tools - Human-readable - Easy to process with standard tools (grep, sed, awk) - Compact (deduplication via semicolon syntax)
Performance Characteristics
Overhead
i915 Driver: - BPF Collection: 10-20% (batch buffer parsing expensive) - EU Stall: 2-5% (hardware sampling + attribution) - Debug: 1-3% (event-driven) - Total: 15-30% (workload-dependent)
Xe Driver: - BPF Collection: 3-5% (simpler batch buffer handling) - EU Stall: 2-5% - Debug: 1-3% - Total: 5-10% (generally lower)
Scalability
Number of Shaders: O(log n) lookup, O(n) iteration
Number of EU Stall Samples: O(1) attribution (hash table)
Stack Depth: Linear in depth (but capped at 512)
Multi-GPU: Currently limited to first GPU found
Memory Usage
Typical: - Shader store: 100-500 MB (depends on number/size of shaders) - eBPF ringbuffer: 512 MB (fixed) - Symbol caches: 50-200 MB (per-PID symbol tables) - Total: 1-2 GB
Platform Support
Supported Hardware
Platform | Architecture | Driver | Status |
---|---|---|---|
Ponte Vecchio (PVC) | Xe HPC | i915 (patched) | Fully supported |
Battlemage (BMG) | Xe2 | Xe | Fully supported |
Lunar Lake | Xe2 | Xe | Supported (iGPU) |
Meteor Lake | Xe | Xe | Partial support |
Older Gen9-Gen12 | Various | i915 | Limited support |
Kernel Requirements
- Linux 5.8+: eBPF CO-RE support
- BTF:
/sys/kernel/btf/vmlinux
must exist - i915 debug: Requires Intel backport kernel with
PRELIM_DRM_I915_DEBUG
patches - Xe debug: Mainline support in Linux 6.2+, full support in 6.6+
Future Enhancements
Planned Features
- Multi-GPU Profiling: Profile multiple GPUs simultaneously
- JSON Output: Structured output for analysis tools
- Differential Profiling: Compare two profile runs
- Continuous Profiling: Lower overhead, always-on mode
- TUI: Terminal UI for live profiling
- Remote Profiling: Profile remote systems
- Container Support: Profile containerized GPU workloads
Research Directions
- ML-Based Bottleneck Detection: Automatically identify performance issues
- Predictive Profiling: Suggest optimizations based on patterns
- Cross-Platform Support: NVIDIA, AMD GPU support
- Distributed Profiling: Multi-node GPU cluster profiling
References
External Resources
- FlameGraph - Original flame graph concept
- AI Flame Graphs - iaprof announcement blog
- eBPF Documentation - eBPF programming guide
- Intel GPU Architecture - Gen11 architecture (similar to PVC/Xe)
Internal Documentation
src/README.md
- Source code overviewsrc/collectors/bpf/README.md
- BPF collector detailssrc/stores/README.md
- GPU kernel store internalsdocs/README.pvc.md
- PVC-specific setup (if exists)docs/README.bmg.md
- Battlemage-specific setup (if exists)
Flame Graph Generation Pipeline
The transformation from raw profiling data to interactive flame graphs involves multiple stages, each adding value and enabling different analyses.
Stage 1: Data Collection (record command)
Input: GPU execution + hardware sampling
Output: Intermediate profile format (structured text)
Location: src/commands/record.c
, collectors
Collected Data Per Sample
For each EU stall sample, the record command collects:
- Process Context:
- PID, TID, CPU, timestamp
-
Process name (16 chars)
-
CPU Execution Context:
- User-space stack (up to 512 frames)
-
Kernel stack (up to 512 frames)
-
GPU Execution Context:
- GPU instruction pointer (IP)
- Shader GPU address
-
Shader binary (from debug collector)
-
Performance Metrics:
- EU stall type breakdown (active, control, send, sync, etc.)
- Sample counts per stall type
Output Format
The record command outputs a custom structured format:
STRING;<id>;<string_value>
PROC_NAME;<proc_name_id>;<pid>
USTACK;<ustack_id>;<frame1>;<frame2>;...
KSTACK;<kstack_id>;<frame1>;<frame2>;...
GPU_FILE;<gpu_file_id>;<filename>
GPU_SYMBOL;<gpu_symbol_id>;<symbol_name>
INSN_TEXT;<insn_text_id>;<instruction_text>
EUSTALL;<proc_name_id>;<pid>;<ustack_id>;<kstack_id>;<gpu_file_id>;<gpu_symbol_id>;<insn_text_id>;<stall_type_id>;<offset>;<count>
Key Features: - String Interning: Repeated strings stored once with ID references - Stack Deduplication: Identical stacks share same ID - Structured Events: Different event types (STRING, EUSTALL, etc.)
Example:
STRING;1;python3.11
STRING;2;libpython3.11.so.1.0
STRING;3;PyEval_EvalFrameDefault
STRING;4;libze_loader.so.1
STRING;5;zeCommandListAppendLaunchKernel
PROC_NAME;1;12345
USTACK;100;0x7f8000001234;0x7f8000005678;0x7f800000abcd
KSTACK;200;0xffffffffc0123456;0xffffffffc0234567
GPU_FILE;10;my_kernel.cpp
GPU_SYMBOL;11;my_compute_kernel
INSN_TEXT;20;mad(8) r10.0<1>:f r2.0:f r3.0:f r4.0:f
EUSTALL;1;12345;100;200;10;11;20;1;0x40;1234
This represents:
- Process: python3.11 (PID 12345)
- User stack: 3 frames (resolved later)
- Kernel stack: 2 frames (resolved later)
- GPU kernel: my_compute_kernel in my_kernel.cpp
- Instruction: mad(8) ...
at offset 0x40
- Stall count: 1234 samples
Stage 2: Intermediate Processing (flame command)
Input: Record output (structured text)
Output: Folded stacks (semicolon-separated)
Location: src/commands/flame.c
Processing Steps
-
Parse Structured Input:
-
Aggregate Samples:
- Uses hash table to deduplicate identical stacks
- Key:
(proc_name, pid, ustack_id, kstack_id, gpu_file, gpu_symbol, insn_text, stall_type, offset)
-
Value: Accumulated count
-
Resolve String IDs:
- Converts numeric IDs back to strings
-
Looks up in string table built from STRING events
-
Format Folded Stacks:
const char *flame_fmt = "%s;%u;%s%s-;%s_[G];%s_[G];%s_[g];%s_[g];0x%lx_[g]"; printf(flame_fmt, proc_name, // Process name pid, // Process ID ustack, // User stack (semicolon-separated) kstack, // Kernel stack (semicolon-separated) gpu_file, // GPU source file gpu_symbol, // GPU kernel name insn_text, // GPU instruction stall_type, // Stall type offset); // Instruction offset printf(";%lu\n", count); // Sample count
Folded Stack Format
Format: frame1;frame2;...;frameN count
Annotations (for coloring):
- _[G]
: GPU frame (uppercase G) - Major GPU components
- _[g]
: GPU instruction frame (lowercase g) - Individual instructions
- -
: Stack separator (between user and kernel)
Example Output:
python3.11;12345;PyEval_EvalFrameDefault;zeCommandListAppendLaunchKernel-;i915_gem_do_execbuffer_[G];my_kernel.cpp_[G];my_compute_kernel_[G];mad(8)_[g];active_[g];0x40_[g];1234
Breaking down this stack (bottom to top):
1. python3.11
- Process name
2. 12345
- PID
3. PyEval_EvalFrameDefault
- Python interpreter frame
4. zeCommandListAppendLaunchKernel
- Level Zero API call
5. -
- User/kernel separator
6. i915_gem_do_execbuffer
- Kernel driver function
7. my_kernel.cpp
- GPU source file
8. my_compute_kernel
- GPU kernel function
9. mad(8)
- GPU instruction (multiply-add, 8-wide SIMD)
10. active
- Stall type
11. 0x40
- Instruction offset
12. Count: 1234
samples
Stage 3: Flame Graph Rendering (flamegraph.pl)
Input: Folded stacks
Output: Interactive SVG
Location: deps/flamegraph/flamegraph.pl
This is Brendan Gregg's FlameGraph toolkit, invoked as:
Rendering Process
-
Parse Folded Stacks:
-
Build Call Tree:
- Creates hierarchical tree structure
- Each node:
{name, count, children}
-
Merges common prefixes (same call path)
-
Calculate Widths:
- Total width = image width (default 1200px)
- Each frame width = (count / total_samples) × image_width
-
Frames < minimum width omitted (default 0.1px)
-
Assign Colors:
--colors=gpu
uses custom GPU palette- Annotation-based coloring:
_[G]
: Bright blue/cyan (major GPU components)_[g]
: Light blue (GPU instructions)-
: Gray (stack separators)- Others: Warm colors (yellow/orange/red)
-
Hue varies by function name hash (consistent across runs)
-
Generate SVG:
- Rectangle per frame:
<rect x="..." y="..." width="..." height="16" fill="..." />
- Text label:
<text x="..." y="...">function_name</text>
- JavaScript for interactivity:
- Mouse-over: Tooltip with function name + percentage
- Click: Zoom to subtree
- Ctrl+F: Search and highlight
- Reset zoom button
SVG Structure
<svg xmlns="http://www.w3.org/2000/svg" width="1200" height="..." >
<!-- Title -->
<text y="24">AI Flame Graph</text>
<!-- Flame graph frames -->
<g class="func_g">
<title>python3.11 (1234 samples, 45%)</title>
<rect x="0" y="100" width="540" height="16" fill="rgb(230,150,80)" />
<text x="5" y="110">python3.11</text>
</g>
<g class="func_g">
<title>PyEval_EvalFrameDefault (1000 samples, 36%)</title>
<rect x="0" y="84" width="450" height="16" fill="rgb(240,160,90)" />
<text x="5" y="94">PyEval_EvalFrameDefault</text>
</g>
<!-- GPU frames (blue) -->
<g class="func_g">
<title>my_compute_kernel (500 samples, 18%)</title>
<rect x="0" y="36" width="225" height="16" fill="rgb(80,150,230)" />
<text x="5" y="46">my_compute_kernel</text>
</g>
<!-- GPU instruction (light blue) -->
<g class="func_g">
<title>mad(8) r10 (200 samples, 7%)</title>
<rect x="0" y="20" width="90" height="16" fill="rgb(150,200,250)" />
<text x="5" y="30">mad(8) r10</text>
</g>
<!-- JavaScript for interactivity -->
<script type="text/ecmascript">
<![CDATA[
function zoom(e) { ... }
function search() { ... }
function reset() { ... }
]]>
</script>
</svg>
Complete End-to-End Example
1. Run Profiling:
2. Profile Output (profile.txt, excerpt):
STRING;1;my_app
STRING;2;main
STRING;3;compute_kernel
STRING;4;libze_loader.so.1
STRING;5;zeCommandListAppendLaunchKernel
PROC_NAME;1;12345
USTACK;100;0x401234;0x4056ab;0x7f80001234
GPU_SYMBOL;10;compute_kernel
INSN_TEXT;20;send(8) null r20
EUSTALL;1;12345;100;0;0;10;20;0;0x80;5000
3. Generate Folded Stacks:
4. Folded Output (folded.txt):
5. Generate Flame Graph:
6. View in Browser:
Flame Graph Interpretation
Visual Encoding
- X-axis: Alphabetical ordering (NOT time!)
- Y-axis: Stack depth (bottom = entry point, top = sampled function)
- Width: Proportional to sample count (wider = more time/samples)
- Color:
- Warm (yellow/orange/red): CPU code
- Blue: GPU major components
- Light blue: GPU instructions
- Gray: Stack separators
Reading the Graph
Identify Hotspots: - Wide towers = frequently sampled = hot code paths - Look for wide boxes at the top = time spent in that function
Trace Call Paths: - Bottom to top = caller to callee - Follow a tower upward to see full call stack
Compare Alternatives: - Wide towers for function A vs narrow for function B = A is slower - Example: Three matrix multiply implementations side-by-side
Find Optimization Targets: - Wide GPU instruction frames = stall-heavy instructions - Wide CPU frames before GPU = submission overhead - Many small boxes = fragmented execution
Advanced Features
FlameScope (Time-Series Flame Graphs)
Command:
Output: Multiple flame graphs, one per 100ms interval
Use Cases: - Warm-up analysis (first intervals vs later) - Phase detection (compute phase vs memory phase) - Temporal patterns (periodic spikes)
Differential Flame Graphs
Workflow:
1. Profile baseline: iaprof record > baseline.txt
2. Apply optimization
3. Profile optimized: iaprof record > optimized.txt
4. Generate diff: difffolded.pl baseline.txt optimized.txt | flamegraph.pl --negate > diff.svg
Colors: - Red: Increased in optimized version (regression) - Blue: Decreased in optimized version (improvement)
Search and Filter
In the interactive SVG: - Ctrl+F: Search for function/pattern - Click frame: Zoom to subtree - Mouse-over: See exact sample counts and percentages
Customization Options
Color Palettes
# GPU-specific (default for iaprof)
flamegraph.pl --colors=gpu
# Memory access patterns
flamegraph.pl --colors=mem
# I/O operations
flamegraph.pl --colors=io
# Java-specific
flamegraph.pl --colors=java
Sizing
# Larger image
flamegraph.pl --width=2400 --height=24
# Filter small functions
flamegraph.pl --minwidth=0.5 # 0.5% of total time
Titles and Labels
flamegraph.pl \
--title="PyTorch Training Profile" \
--subtitle="Model: ResNet-50, Batch: 64" \
--countname="stalls" \
--nametype="Frame:"
Performance Considerations
Flame Command
- Aggregation: O(n) where n = number of samples
- Hash table: O(1) average lookup
- Memory: ~100 bytes per unique stack
- Typical: 1M samples → 5-10 seconds, <1GB RAM
FlameGraph.pl
- Parsing: O(n) where n = folded stack lines
- Tree building: O(n × d) where d = avg stack depth
- SVG generation: O(m) where m = unique frames
- Typical: 100K unique stacks → 10-30 seconds, <500MB RAM
Optimization Tips
- Filter during collection: Use
--quiet
to reduce overhead - Interval-based: Use
--interval
for time-series, reduces total data - Post-process filtering:
grep
folded stacks before flamegraph.pl - Parallel processing: Multiple
flamegraph.pl
invocations for different filters
Troubleshooting
Missing Symbols
Problem: Flame graph shows hex addresses instead of function names
Solutions:
- Install debug symbols: apt install libc6-dbg
- Compile with frame pointers: -fno-omit-frame-pointer
- Check BTF available: ls /sys/kernel/btf/vmlinux
Incomplete Stacks
Problem: Stacks end prematurely (don't reach main)
Solutions:
- Increase stack depth: sysctl kernel.perf_event_max_stack=512
- Enable frame pointers in all libraries
- Check for frame pointer optimizations
Wrong Colors
Problem: GPU frames not colored blue
Solutions:
- Ensure --colors=gpu
flag
- Check annotations in folded stacks (_[G]
, _[g]
)
- Verify flame command output format
Glossary
- eBPF: Extended Berkeley Packet Filter - Linux kernel technology for safe, JIT-compiled programs
- EU: Execution Unit - GPU compute core
- BTF: BPF Type Format - Compact type information for eBPF
- OA: Observability Architecture - Intel GPU hardware sampling infrastructure
- KSP: Kernel State Pointer - GPU address of shader/kernel binary
- SIP: System Instruction Pointer - GPU address of system routine (exception handler)
- Batch Buffer: GPU command stream
- Folded Stack: Text format for flame graphs:
func1;func2;func3 count
- VM: Virtual Memory - GPU virtual address space
- DRM: Direct Rendering Manager - Linux kernel graphics subsystem
- String Interning: Technique to store unique strings once, reference by ID