eBPF Tutorial: Tracing CUDA GPU Operations
Have you ever wondered what's happening under the hood when your CUDA application is running? GPU operations can be challenging to debug and profile because they happen in a separate device with its own memory space. In this tutorial, we'll build a powerful eBPF-based tracing tool that lets you peek into CUDA API calls in real time.
Introduction to CUDA and GPU Tracing
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, several things happen behind the scenes:
- The host (CPU) allocates memory on the device (GPU)
- Data is transferred from host to device memory
- GPU kernels (functions) are launched to process the data
- Results are transferred back from device to host
- Device memory is freed
Each of these operations involves CUDA API calls like cudaMalloc
, cudaMemcpy
, and cudaLaunchKernel
. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations happen asynchronously, and traditional debugging tools often can't access GPU internals.
This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (libcudart.so
) before they reach the GPU. This gives us visibility into:
- Memory allocation sizes and patterns
- Data transfer directions and sizes
- Kernel launch parameters
- Error codes and failures
- Timing of operations
Key CUDA Functions We Trace
Our tracer monitors several critical CUDA functions that represent the main operations in GPU computing. Understanding these functions helps you interpret the tracing results and diagnose issues in your CUDA applications:
Memory Management
-
cudaMalloc
: Allocates memory on the GPU device. By tracing this, we can see how much memory is being requested, when, and whether it succeeds. Memory allocation failures are a common source of problems in CUDA applications. -
cudaFree
: Releases previously allocated memory on the GPU. Tracing this helps identify memory leaks (allocated memory that's never freed) and double-free errors.
Data Transfer
cudaMemcpy
: Copies data between host (CPU) and device (GPU) memory, or between different locations in device memory. The direction parameter (kind
) tells us whether data is moving to the GPU, from the GPU, or within the GPU.
The kind
parameter can be:
- cudaMemcpyHostToDevice
(1): Copying from CPU to GPU
- cudaMemcpyDeviceToHost
(2): Copying from GPU to CPU
- cudaMemcpyDeviceToDevice
(3): Copying within GPU memory
Kernel Execution
cudaLaunchKernel
: Launches a GPU kernel (function) to run on the device. This is where the actual parallel computation happens. Tracing this shows when kernels are launched and whether they succeed.
Streams and Synchronization
CUDA uses streams for managing concurrency and asynchronous operations:
-
cudaStreamCreate
: Creates a new stream for executing operations in order but potentially concurrently with other streams. -
cudaStreamSynchronize
: Waits for all operations in a stream to complete. This is a key synchronization point that can reveal performance bottlenecks.
Events
CUDA events are used for timing and synchronization:
-
cudaEventCreate
: Creates an event object for timing operations. -
cudaEventRecord
: Records an event in a stream, which can be used for timing or synchronization. -
cudaEventSynchronize
: Waits for an event to complete, which is another synchronization point.
Device Management
-
cudaGetDevice
: Gets the current device being used. -
cudaSetDevice
: Sets the device to be used for GPU executions.
By tracing these functions, we gain complete visibility into the lifecycle of GPU operations, from device selection and memory allocation to data transfer, kernel execution, and synchronization. This enables us to identify bottlenecks, diagnose errors, and understand the behavior of CUDA applications.
Architecture Overview
Our CUDA events tracer consists of three main components:
- Header File (
cuda_events.h
): Defines data structures for communication between kernel and user space - eBPF Program (
cuda_events.bpf.c
): Implements kernel-side hooks for CUDA functions using uprobes - User-Space Application (
cuda_events.c
): Loads the eBPF program, processes events, and displays them to the user
The tool uses eBPF uprobes to attach to CUDA API functions in the CUDA runtime library. When a CUDA function is called, the eBPF program captures the parameters and results, sending them to user space through a ring buffer.
Key Data Structures
The central data structure for our tracer is the struct event
defined in cuda_events.h
:
struct event {
/* Common fields */
int pid; /* Process ID */
char comm[TASK_COMM_LEN]; /* Process name */
enum cuda_event_type type;/* Type of CUDA event */
/* Event-specific data (union to save space) */
union {
struct { size_t size; } mem; /* For malloc/memcpy */
struct { void *ptr; } free_data; /* For free */
struct { size_t size; int kind; } memcpy_data; /* For memcpy */
struct { void *func; } launch; /* For kernel launch */
struct { int device; } device; /* For device operations */
struct { void *handle; } handle; /* For stream/event operations */
};
bool is_return; /* True if this is from a return probe */
int ret_val; /* Return value (for return probes) */
char details[MAX_DETAILS_LEN]; /* Additional details as string */
};
This structure is designed to efficiently capture information about different types of CUDA operations. The union
is a clever space-saving technique since each event only needs one type of data at a time. For example, a memory allocation event needs to store the size, while a free event needs to store a pointer.
The cuda_event_type
enum helps us categorize different CUDA operations:
enum cuda_event_type {
CUDA_EVENT_MALLOC = 0,
CUDA_EVENT_FREE,
CUDA_EVENT_MEMCPY,
CUDA_EVENT_LAUNCH_KERNEL,
CUDA_EVENT_STREAM_CREATE,
CUDA_EVENT_STREAM_SYNC,
CUDA_EVENT_GET_DEVICE,
CUDA_EVENT_SET_DEVICE,
CUDA_EVENT_EVENT_CREATE,
CUDA_EVENT_EVENT_RECORD,
CUDA_EVENT_EVENT_SYNC
};
This enum covers the main CUDA operations we want to trace, from memory management to kernel launches and synchronization.
The eBPF Program Implementation
Let's dive into the eBPF program (cuda_events.bpf.c
) that hooks into CUDA functions. The full code is available in the repository, but here are the key parts:
First, we create a ring buffer to communicate with user space:
The ring buffer is a crucial component for our tracer. It acts as a high-performance queue where the eBPF program can submit events, and the user-space application can retrieve them. We set a generous size of 256KB to handle bursts of events without losing data.
For each CUDA operation, we implement a helper function to collect relevant data. Let's look at the submit_malloc_event
function as an example:
static inline int submit_malloc_event(size_t size, bool is_return, int ret_val) {
struct event *e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
if (!e) return 0;
/* Fill common fields */
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_get_current_comm(&e->comm, sizeof(e->comm));
e->type = CUDA_EVENT_MALLOC;
e->is_return = is_return;
/* Fill event-specific data */
if (is_return) {
e->ret_val = ret_val;
} else {
e->mem.size = size;
}
bpf_ringbuf_submit(e, 0);
return 0;
}
This function first reserves space in the ring buffer for our event. Then it fills in common fields like the process ID and name. For a malloc event, we store either the requested size (on function entry) or the return value (on function exit). Finally, we submit the event to the ring buffer.
The actual probes are attached to CUDA functions using SEC annotations. For cudaMalloc, we have:
SEC("uprobe")
int BPF_KPROBE(cuda_malloc_enter, void **ptr, size_t size) {
return submit_malloc_event(size, false, 0);
}
SEC("uretprobe")
int BPF_KRETPROBE(cuda_malloc_exit, int ret) {
return submit_malloc_event(0, true, ret);
}
The first function is called when cudaMalloc
is entered, capturing the requested size. The second is called when cudaMalloc
returns, capturing the error code. This pattern is repeated for each CUDA function we want to trace.
One interesting case is cudaMemcpy
, which transfers data between host and device:
SEC("uprobe")
int BPF_KPROBE(cuda_memcpy_enter, void *dst, const void *src, size_t size, int kind) {
return submit_memcpy_event(size, kind, false, 0);
}
Here, we capture not just the size but also the "kind" parameter, which indicates the direction of the transfer (host-to-device, device-to-host, or device-to-device). This gives us valuable information about data movement patterns.
User-Space Application Details
The user-space application (cuda_events.c
) is responsible for loading the eBPF program, processing events from the ring buffer, and displaying them in a user-friendly format.
First, the program parses command-line arguments to configure its behavior:
static struct env {
bool verbose;
bool print_timestamp;
char *cuda_library_path;
bool include_returns;
int target_pid;
} env = {
.print_timestamp = true,
.include_returns = true,
.cuda_library_path = NULL,
.target_pid = -1,
};
This structure stores configuration options like whether to print timestamps or include return probes. The default values provide a sensible starting point.
The program uses libbpf
to load and attach the eBPF program to CUDA functions:
int attach_cuda_func(struct cuda_events_bpf *skel, const char *lib_path,
const char *func_name, struct bpf_program *prog_entry,
struct bpf_program *prog_exit) {
/* Attach entry uprobe */
if (prog_entry) {
uprobe_opts.func_name = func_name;
struct bpf_link *link = bpf_program__attach_uprobe_opts(prog_entry,
env.target_pid, lib_path, 0, &uprobe_opts);
/* Error handling... */
}
/* Attach exit uprobe */
if (prog_exit) {
/* Similar for return probe... */
}
}
This function takes a function name (like "cudaMalloc") and the corresponding eBPF programs for entry and exit. It then attaches these programs as uprobes to the specified library.
One of the most important functions is handle_event
, which processes events from the ring buffer:
static int handle_event(void *ctx, void *data, size_t data_sz) {
const struct event *e = data;
struct tm *tm;
char ts[32];
char details[MAX_DETAILS_LEN];
time_t t;
/* Skip return probes if requested */
if (e->is_return && !env.include_returns)
return 0;
time(&t);
tm = localtime(&t);
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
get_event_details(e, details, sizeof(details));
if (env.print_timestamp) {
printf("%-8s ", ts);
}
printf("%-16s %-7d %-20s %8s %s\n",
e->comm, e->pid,
event_type_str(e->type),
e->is_return ? "[EXIT]" : "[ENTER]",
details);
return 0;
}
This function formats and displays event information, including timestamps, process details, event type, and specific parameters or return values.
The get_event_details
function converts raw event data into human-readable form:
static void get_event_details(const struct event *e, char *details, size_t len) {
switch (e->type) {
case CUDA_EVENT_MALLOC:
if (!e->is_return)
snprintf(details, len, "size=%zu bytes", e->mem.size);
else
snprintf(details, len, "returned=%s", cuda_error_str(e->ret_val));
break;
/* Similar cases for other event types... */
}
}
This function handles each event type differently. For example, a malloc event shows the requested size on entry and the error code on exit.
The main event loop is remarkably simple:
This polls the ring buffer for events, calling handle_event
for each one. The 100ms timeout ensures the program remains responsive to signals like Ctrl+C.
CUDA Error Handling and Reporting
An important aspect of our tracer is translating CUDA error codes into human-readable messages. CUDA has over 100 different error codes, from simple ones like "out of memory" to complex ones like "unsupported PTX version."
Our tool includes a comprehensive cuda_error_str
function that maps these numeric codes to string descriptions:
static const char *cuda_error_str(int error) {
switch (error) {
case 0: return "Success";
case 1: return "InvalidValue";
case 2: return "OutOfMemory";
/* Many more error codes... */
default: return "Unknown";
}
}
This makes the output much more useful for debugging. Instead of seeing "error 2", you'll see "OutOfMemory", which immediately tells you what went wrong.
Compilation and Execution
Building the tracer is straightforward with the provided Makefile:
This creates two binaries:
- cuda_events
: The eBPF-based CUDA tracing tool
- basic02
: A simple CUDA example application
The build system is smart enough to detect your GPU architecture using nvidia-smi
and compile the CUDA code with the appropriate flags.
Running the tracer is just as easy:
# Start the tracing tool
sudo ./cuda_events -p ./basic02
# In another terminal, run the CUDA example
./basic02
You can also trace a specific process by PID:
# Run the CUDA example
./basic02 &
PID=$!
# Start the tracing tool with PID filtering
sudo ./cuda_events -p ./basic02 -d $PID
The example output shows detailed information about each CUDA operation:
Using CUDA library: ./basic02
TIME PROCESS PID EVENT TYPE DETAILS
17:35:41 basic02 12345 cudaMalloc [ENTER] size=4000 bytes
17:35:41 basic02 12345 cudaMalloc [EXIT] returned=Success
17:35:41 basic02 12345 cudaMalloc [ENTER] size=4000 bytes
17:35:41 basic02 12345 cudaMalloc [EXIT] returned=Success
17:35:41 basic02 12345 cudaMemcpy [ENTER] size=4000 bytes, kind=1
17:35:41 basic02 12345 cudaMemcpy [EXIT] returned=Success
17:35:41 basic02 12345 cudaLaunchKernel [ENTER] func=0x7f1234567890
17:35:41 basic02 12345 cudaLaunchKernel [EXIT] returned=Success
17:35:41 basic02 12345 cudaMemcpy [ENTER] size=4000 bytes, kind=2
17:35:41 basic02 12345 cudaMemcpy [EXIT] returned=Success
17:35:41 basic02 12345 cudaFree [ENTER] ptr=0x7f1234568000
17:35:41 basic02 12345 cudaFree [EXIT] returned=Success
17:35:41 basic02 12345 cudaFree [ENTER] ptr=0x7f1234569000
17:35:41 basic02 12345 cudaFree [EXIT] returned=Success
This output shows the typical flow of a CUDA application: 1. Allocate memory on the device 2. Copy data from host to device (kind=1) 3. Launch a kernel to process the data 4. Copy results back from device to host (kind=2) 5. Free device memory
benchmark
We also provide a benchmark tool to test the performance of the tracer and the latency of the CUDA API calls.
When there is no tracing, the result is like this:
Data size: 1048576 bytes (1024 KB)
Iterations: 10000
Summary (average time per operation):
-----------------------------------
cudaMalloc: 113.14 µs
cudaMemcpyH2D: 365.85 µs
cudaLaunchKernel: 7.82 µs
cudaMemcpyD2H: 393.55 µs
cudaFree: 0.00 µs
When the tracer is attached, the result is like this:
Data size: 1048576 bytes (1024 KB)
Iterations: 10000
Summary (average time per operation):
-----------------------------------
cudaMalloc: 119.81 µs
cudaMemcpyH2D: 367.16 µs
cudaLaunchKernel: 8.77 µs
cudaMemcpyD2H: 383.66 µs
cudaFree: 0.00 µs
The tracer adds about 2us overhead to each CUDA API call, which is negligible for most cases.
Command Line Options
The cuda_events
tool supports these options:
-v
: Enable verbose output for debugging-t
: Don't print timestamps-r
: Don't show function returns (only show function entries)-p PATH
: Specify the path to the CUDA runtime library or application-d PID
: Trace only the specified process ID
Next Steps
Once you're comfortable with this basic CUDA tracing tool, you could extend it to:
- Add support for more CUDA API functions
- Add timing information to analyze performance bottlenecks
- Implement correlation between related operations (e.g., matching mallocs with frees)
- Create visualizations of CUDA operations for easier analysis
- Add support for other GPU frameworks like OpenCL or ROCm
For more detail about the cuda example and tutorial, you can checkout out repo and the code in https://github.com/eunomia-bpf/basic-cuda-tutorial
References
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- NVIDIA CUDA Runtime API: https://docs.nvidia.com/cuda/cuda-runtime-api/
- libbpf Documentation: https://libbpf.readthedocs.io/
- Linux uprobes Documentation: https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
If you'd like to dive deeper into eBPF, check out our tutorial repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or visit our website at https://eunomia.dev/tutorials/.