Write and Run eBPF on GPU with bpftime
bpftime provides GPU support through its CUDA/ROCm attachment implementation, enabling eBPF programs to execute within GPU kernels on NVIDIA and AMD GPUs. This brings eBPF's programmability and observability to GPU computing workloads, enabling real-time profiling and debugging of GPU applications without source code modification.
Experimental
Why eBPF on GPU?
GPUs are widely used for ML workloads and are typically SIMT (Single Instruction, Multiple Thread) accelerators with threads organized in warps executing on streaming multiprocessors (SMs). These threads are grouped into blocks and launched as kernels, utilizing a complex multi-level memory hierarchy including registers, shared memory/LDS (Local Data Share), L2 cache, and device memory. GPUs also have limited preemption capabilities compared to CPUs. This architectural complexity creates rich but challenging behavior patterns that are difficult to observe and customize, particularly when diagnosing performance bottlenecks, memory access patterns, warp divergence, or resource contention issues.
The Problem with Current GPU Observability Tools
Today's GPU tracing and profiling landscape suffers from two major limitations:
1. CPU-Boundary Tools Lack Device-Side Visibility
Many tracing tools operate at the CPU boundary by placing probes on CUDA userspace libraries (like libcuda.so
, libcudart.so
) or kernel drivers. While these tools can capture host-side events such as kernel launches, memory copies, and API calls, they treat the GPU device as a black box. This approach provides:
- No visibility into what happens inside a running kernel
- Weak or no linkage to device-side events like warp stalls, bank conflicts, or memory traffic patterns
- No ability to safely adapt or modify kernel behavior in-flight based on runtime conditions
- Limited correlation between host actions and device-side performance issues
2. GPU-Specific Profilers Are Siloed Device-side profilers like NVIDIA's CUPTI, Intel's GTPin, Nvbit, and Neutrino do provide detailed device-side visibility including instruction-level profiling, memory traces, and warp execution analysis. However, they suffer from: - Vendor lock-in: Each tool is typically tied to a specific GPU vendor (NVIDIA, AMD, Intel) - Isolation from eBPF ecosystems: These tools don't integrate with Linux's eBPF infrastructure, making it difficult to correlate GPU events with system-wide observability data from kprobes, uprobes, tracepoints, or network events - Limited programmability: Most provide fixed metrics rather than user-programmable instrumentation - High overhead: Binary instrumentation tools can introduce significant performance overhead (e.g., NVBit can be 10-100x slower)
bpftime's Unified Approach
bpftime bridges this gap by offloading eBPF programs directly into GPU device contexts, bringing the same programmability model that revolutionized kernel observability to GPUs. The implementation includes:
GPU-Side Attach Points: - Device function entry/exit for profiling kernel execution - Block begin/end for tracking thread block lifecycle - Barrier/synchronization points for analyzing warp coordination - Memory operation hooks for capturing access patterns - Stream operation events for tracking asynchronous execution
eBPF-to-GPU Compilation Pipeline: - Compiles standard eBPF bytecode into GPU-native instruction sets (PTX for NVIDIA, SPIR-V for AMD) - Includes full verifier support to ensure safety and prevent crashes - Provides GPU-optimized helper functions for timing, thread identification, and map operations - Supports standard eBPF maps (hash, array, ringbuf) with GPU-resident variants for zero-copy access
This unified approach enables:
- 3-10x faster performance than tools like NVBit for instrumentation
- Vendor-neutral design that works across NVIDIA and AMD GPUs
- Unified observability with Linux kernel eBPF programs (kprobes, uprobes)
- Fine-grained profiling at the warp or instruction level
- Adaptive GPU kernel memory optimization and programmable scheduling across SMs
- Accelerated eBPF applications by leveraging GPU compute power
Architecture
CUDA Attachment Pipeline
The GPU support is built on the nv_attach_impl
system (attach/nv_attach_impl/
), which implements an instrumentation pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ Application Process │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ CUDA App │────▶│ bpftime │────▶│ GPU Kernel │ │
│ │ │ │ Runtime │ │ with eBPF │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Shared Memory│ │ GPU Memory │ │
│ │ (Host-GPU) │ │ (IPC) │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Examples
Complete working examples with full source code, build instructions, and READMEs are available on GitHub:
- cuda-counter: Basic probe/retprobe with timing measurements
- cuda-counter-gpu-array: Per-thread counters using GPU array maps
- cuda-counter-gpu-ringbuf: Event streaming with ringbuf maps
- rocm-counter: AMD GPU instrumentation (experimental)
Each example includes CUDA/ROCm application source, eBPF probe programs, Makefile, and detailed usage instructions.
Key Components
- CUDA Runtime Hooking: Intercepts CUDA API calls using Frida-based dynamic instrumentation
- PTX Modification: Converts eBPF bytecode to PTX (Parallel Thread Execution) assembly and injects it into GPU kernels
- Helper Trampoline: Provides GPU-accessible helper functions for map operations, timing, and context access
- Host-GPU Communication: Enables synchronous calls from GPU to host via pinned shared memory
Attachment Types
bpftime supports three attachment types for GPU kernels (defined in attach/nv_attach_impl/nv_attach_impl.hpp:33-34
):
ATTACH_CUDA_PROBE
(8): Executes eBPF code at kernel entryATTACH_CUDA_RETPROBE
(9): Executes eBPF code at kernel exit- Memory capture probe (
__memcapture
): Special probe type for capturing memory access patterns
All types support specifying target kernel functions by name (e.g., _Z9vectorAddPKfS0_Pf
for mangled C++ names).
GPU-Specific BPF Maps
bpftime includes specialized map types optimized for GPU operations:
BPF_MAP_TYPE_NV_GPU_ARRAY_MAP
(1502)
GPU-resident array maps with per-thread storage for high-performance data collection.
Key Features:
- Data stored directly in GPU memory (CUDA IPC shared memory)
- Each thread gets isolated storage (max_entries × max_thread_count × value_size
)
- Zero-copy access from GPU, DMA transfers to host
- Supports bpf_map_lookup_elem()
and bpf_map_update_elem()
in GPU code
Implementation: runtime/src/bpf_map/gpu/nv_gpu_array_map.cpp:14-81
BPF_MAP_TYPE_NV_GPU_RINGBUF_MAP
(1527)
GPU ring buffer maps for efficient per-thread event streaming to host.
Key Features:
- Lock-free per-thread ring buffers in GPU memory
- Variable-size event records with metadata
- Asynchronous data collection with low overhead
- Compatible with bpf_perf_event_output()
helper
Implementation: runtime/src/bpf_map/gpu/nv_gpu_ringbuf_map.cpp
GPU Helper Functions
bpftime provides GPU-specific eBPF helpers accessible from CUDA kernels (attach/nv_attach_impl/trampoline/default_trampoline.cu:331-390
):
Core GPU Helpers
Helper ID | Function Signature | Description |
---|---|---|
501 | ebpf_puts(const char *str) |
Print string from GPU kernel to host console |
502 | bpf_get_globaltimer(void) |
Read GPU global timer (nanosecond precision) |
503 | bpf_get_block_idx(u64 *x, u64 *y, u64 *z) |
Get CUDA block indices (blockIdx) |
504 | bpf_get_block_dim(u64 *x, u64 *y, u64 *z) |
Get CUDA block dimensions (blockDim) |
505 | bpf_get_thread_idx(u64 *x, u64 *y, u64 *z) |
Get CUDA thread indices (threadIdx) |
506 | bpf_gpu_membar(void) |
Execute GPU memory barrier (membar.sys ) |
Standard BPF Helpers (GPU-Compatible)
The following standard eBPF helpers work on GPU with special optimizations:
bpf_map_lookup_elem()
(1): Fast path for GPU array maps, fallback to host for othersbpf_map_update_elem()
(2): Fast path for GPU array maps, fallback to host for othersbpf_map_delete_elem()
(3): Host call via shared memorybpf_trace_printk()
(6): Formatted output to host consolebpf_get_current_pid_tgid()
(14): Returns host process PID/TIDbpf_perf_event_output()
(25): Optimized for GPU ringbuf maps
Host-GPU Communication Protocol
For helpers requiring host interaction, bpftime uses a shared memory protocol with spinlocks and warp-level serialization for correctness. The protocol involves:
- GPU thread acquires spinlock
- Writes request parameters to shared memory
- Sets flag and waits for host response
- Host processes request and signals completion
- GPU reads response and releases lock
Building with GPU Support
Prerequisites
- NVIDIA CUDA Toolkit (12.x recommended) or AMD ROCm
- CMake 3.15+
- LLVM 15+ (for PTX generation)
- Frida-gum for runtime hooking
Build Configuration
# For NVIDIA CUDA
cmake -Bbuild \
-DBPFTIME_ENABLE_CUDA_ATTACH=1 \
-DBPFTIME_CUDA_ROOT=/usr/local/cuda-12.6 \
-DCMAKE_BUILD_TYPE=Release
# For AMD ROCm (experimental)
cmake -Bbuild \
-DBPFTIME_ENABLE_ROCM_ATTACH=1 \
-DROCM_PATH=/opt/rocm
make -j$(nproc)
References
- bpftime OSDI '25 Paper
- CUDA Runtime API
- PTX ISA
- eBPF Documentation
- eGPU: Extending eBPF Programmability and Observability to GPUs
Citation:
@inproceedings{yang2025egpu,
title={eGPU: Extending eBPF Programmability and Observability to GPUs},
author={Yang, Yiwei and Yu, Tong and Zheng, Yusheng and Quinn, Andrew},
booktitle={Proceedings of the 4th Workshop on Heterogeneous Composable and Disaggregated Systems},
pages={73--79},
year={2025}
}
For questions or feedback, please open an issue on GitHub or contact us.