GPU Application Extension Mechanisms: Modifying Behavior Without Source Code Changes
This document explores the various mechanisms for extending and modifying GPU application behavior without requiring source code changes to the original application. We'll examine what aspects of GPU behavior can be modified, which approaches require GPU-side code, and how these capabilities compare to eBPF-like functionality.
Table of Contents
- Introduction
- Extension Points in GPU Computing Stack
- API Interception and Redirection
- Memory Management Extensions
- Execution Control Extensions
- Runtime Behavior Modification
- Kernel Scheduling Manipulation
- Multi-GPU Distribution
- Comparison with eBPF Capabilities
- Case Studies
- Future Directions
- References
Introduction
GPU applications often have behaviors that users might want to modify without changing the original source code: - Resource allocation (memory, compute) - Scheduling priorities and policies - Error handling mechanisms - Performance characteristics - Monitoring and debugging capabilities
While CPUs benefit from advanced introspection tools like eBPF that allow dynamic behavior modification, GPUs have a different programming model that affects how extensions can be implemented. This document explores what's possible in the GPU ecosystem and the tradeoffs of different approaches.
Extension Points in GPU Computing Stack
The GPU computing stack offers several layers where behavior can be modified:
┌─────────────────────────────┐
│ Application │ ← Source code modification (not our focus)
├─────────────────────────────┤
│ GPU Framework/Library │ ← Library replacement/wrapper
│ (TensorFlow, PyTorch) │
├─────────────────────────────┤
│ CUDA Runtime API │ ← API interception
├─────────────────────────────┤
│ CUDA Driver API │ ← Driver API interception
├─────────────────────────────┤
│ GPU Driver │ ← Driver patches (requires privileges)
├─────────────────────────────┤
│ GPU Hardware │ ← Firmware modifications (rarely possible)
└─────────────────────────────┘
Each layer offers different extension capabilities and restrictions:
Layer | Extension Flexibility | Runtime Overhead | Implementation Complexity | Privileges Required |
---|---|---|---|---|
Framework | High | Low-Medium | Medium | None |
Runtime API | High | Low | Medium | None |
Driver API | Very High | Low | High | None |
GPU Driver | Extreme | Minimal | Very High | Root/Admin |
GPU Firmware | Limited | None | Extreme | Root + Specialized |
API Interception and Redirection
The most flexible and accessible approach for extending GPU applications is API interception, which requires no GPU-side code.
CUDA Runtime API Interception
What can be modified: - Memory allocations and transfers - Kernel launch parameters - Stream and event management - Device selection and management
Implementation approaches:
-
LD_PRELOAD mechanism (Linux):
-
DLL injection (Windows):
-
NVIDIA Intercept Library: A framework specifically designed for CUDA API interception.
Example: Memory Tracking Interceptor
// track_cuda_memory.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <cuda_runtime.h>
// Function pointer types
typedef cudaError_t (*cudaMalloc_t)(void**, size_t);
typedef cudaError_t (*cudaFree_t)(void*);
// Original function pointers
static cudaMalloc_t real_cudaMalloc = NULL;
static cudaFree_t real_cudaFree = NULL;
// Track total allocated memory
static size_t total_allocated = 0;
// Intercepted cudaMalloc
cudaError_t cudaMalloc(void** devPtr, size_t size) {
if (!real_cudaMalloc)
real_cudaMalloc = (cudaMalloc_t)dlsym(RTLD_NEXT, "cudaMalloc");
cudaError_t result = real_cudaMalloc(devPtr, size);
if (result == cudaSuccess) {
total_allocated += size;
printf("CUDA Malloc: %zu bytes at %p (Total: %zu)\n",
size, *devPtr, total_allocated);
}
return result;
}
// Intercepted cudaFree
cudaError_t cudaFree(void* devPtr) {
if (!real_cudaFree)
real_cudaFree = (cudaFree_t)dlsym(RTLD_NEXT, "cudaFree");
// We would need a map to track size per pointer for accurate accounting
printf("CUDA Free: %p\n", devPtr);
return real_cudaFree(devPtr);
}
Usage: LD_PRELOAD=./libtrack_cuda_memory.so ./my_cuda_app
GPU Virtualization and API Remoting
More advanced API interception approaches can completely redirect GPU operations:
- NVIDIA CUDA vGPU: Virtualization technology that redirects API calls to a hypervisor-controlled GPU
- rCUDA: Remote CUDA execution framework that intercepts API calls and forwards them to a remote server
Memory Management Extensions
What Can Be Modified
- Memory Allocation Policies:
- Custom allocation sizes (e.g., rounding up to specific boundaries)
- Allocation pooling to reduce fragmentation
-
Device memory prioritization across multiple kernels
-
Memory Transfer Optimizations:
- Automatic pinned memory usage
- Batching of small transfers
-
Compression during transfers
-
Memory Access Patterns:
- Memory prefetching
- Custom caching strategies
Does It Require GPU Code?
Most memory management extensions can be implemented entirely from the CPU side through API interception. However, some advanced optimizations might require GPU-side modifications:
CPU-Side Only (No GPU Code Required): - Allocation timing and batching - Host-device transfer optimization - Memory pool management
Requiring GPU Code: - Custom memory access patterns within kernels - Specialized caching strategies - Data prefetching within kernels
Example: Memory Pool Interceptor
// Simple memory pool for CUDA allocations
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <map>
#include <vector>
// Original function pointers
static cudaError_t (*real_cudaMalloc)(void**, size_t) = NULL;
static cudaError_t (*real_cudaFree)(void*) = NULL;
// Memory pool structures
struct MemBlock {
void* ptr;
size_t size;
bool in_use;
};
std::vector<MemBlock> memory_pool;
std::map<void*, size_t> allocation_map;
// Intercepted cudaMalloc with pooling
cudaError_t cudaMalloc(void** devPtr, size_t size) {
if (!real_cudaMalloc)
real_cudaMalloc = (cudaError_t(*)(void**, size_t))dlsym(RTLD_NEXT, "cudaMalloc");
// Round size up to reduce fragmentation (e.g., to 256-byte boundary)
size_t aligned_size = (size + 255) & ~255;
// Try to find a free block in the pool
for (auto& block : memory_pool) {
if (!block.in_use && block.size >= aligned_size) {
block.in_use = true;
*devPtr = block.ptr;
allocation_map[block.ptr] = aligned_size;
return cudaSuccess;
}
}
// Allocate new block if none found
void* new_ptr;
cudaError_t result = real_cudaMalloc(&new_ptr, aligned_size);
if (result == cudaSuccess) {
memory_pool.push_back({new_ptr, aligned_size, true});
allocation_map[new_ptr] = aligned_size;
*devPtr = new_ptr;
}
return result;
}
// Intercepted cudaFree with pooling
cudaError_t cudaFree(void* devPtr) {
if (!real_cudaFree)
real_cudaFree = (cudaError_t(*)(void*))dlsym(RTLD_NEXT, "cudaFree");
// Mark block as free but don't actually free memory
for (auto& block : memory_pool) {
if (block.ptr == devPtr) {
block.in_use = false;
allocation_map.erase(devPtr);
return cudaSuccess;
}
}
// If not found in pool, use regular free
return real_cudaFree(devPtr);
}
// Add function to actually free all pooled memory when app exits
__attribute__((destructor)) void cleanup_memory_pool() {
for (auto& block : memory_pool) {
real_cudaFree(block.ptr);
}
memory_pool.clear();
}
Execution Control Extensions
What Can Be Modified
- Kernel Launch Configuration:
- Block and grid dimensions
- Shared memory allocation
-
Stream assignment
-
Kernel Execution Timing:
- Kernel launch batching
- Execution prioritization
-
Work distribution across multiple kernels
-
Error Handling and Recovery:
- Custom handling of CUDA errors
- Automatic retry of failed operations
- Graceful degradation strategies
Does It Require GPU Code?
Basic execution control can be handled through API interception, but advanced optimizations may require GPU-side code:
CPU-Side Only (No GPU Code Required): - Launch configurations - Stream management - Basic error handling
Requiring GPU Code: - Kernel fusion or splitting - Advanced error recovery within kernels - Dynamic workload balancing within kernels
Example: Kernel Launch Optimizer
// kernel_optimizer.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
// Original kernel launch function
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;
// Optimized kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim,
void** args, size_t sharedMem, cudaStream_t stream) {
if (!real_cudaLaunchKernel)
real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");
// Get device properties
cudaDeviceProp prop;
int device;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
// Optimize block size for better occupancy
if (blockDim.x * blockDim.y * blockDim.z <= 256) {
// Adjust block size for better SM utilization
dim3 optimizedBlockDim;
optimizedBlockDim.x = 256;
optimizedBlockDim.y = 1;
optimizedBlockDim.z = 1;
// Adjust grid size to maintain total threads
dim3 optimizedGridDim;
int original_total = gridDim.x * gridDim.y * gridDim.z *
blockDim.x * blockDim.y * blockDim.z;
int threads_per_block = optimizedBlockDim.x * optimizedBlockDim.y *
optimizedBlockDim.z;
int num_blocks = (original_total + threads_per_block - 1) / threads_per_block;
optimizedGridDim.x = num_blocks;
optimizedGridDim.y = 1;
optimizedGridDim.z = 1;
// Launch with optimized configuration
return real_cudaLaunchKernel(func, optimizedGridDim, optimizedBlockDim,
args, sharedMem, stream);
}
// Fall back to original configuration
return real_cudaLaunchKernel(func, gridDim, blockDim, args, sharedMem, stream);
}
Runtime Behavior Modification
What Can Be Modified
- JIT Compilation Behavior:
- Optimization levels
- Target architectures
-
Code generation options
-
Error Detection and Reporting:
- Enhanced error checking
- Custom logging and diagnostic information
-
Performance anomaly detection
-
Device Management:
- Multi-GPU load balancing
- Power and thermal management
- Fault tolerance strategies
Does It Require GPU Code?
Many runtime behaviors can be modified through API interception and environment variables, but some advanced features require GPU-side code:
CPU-Side Only (No GPU Code Required): - JIT compilation flags - Device selection and configuration - Error handling policies
Requiring GPU Code: - Custom error checking within kernels - Specialized fault tolerance mechanisms - Runtime adaptive algorithms
Example: Error Resilience Extension
// error_resilience.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <stdio.h>
// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;
// Track kernel launches for retry
struct KernelInfo {
const void* func;
dim3 gridDim;
dim3 blockDim;
void** args; // Note: This is unsafe without deep-copying args
size_t sharedMem;
int retries;
};
#define MAX_TRACKED_KERNELS 100
static KernelInfo kernel_history[MAX_TRACKED_KERNELS];
static int kernel_count = 0;
// Enhanced kernel launch with automatic retry
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim,
void** args, size_t sharedMem, cudaStream_t stream) {
if (!real_cudaLaunchKernel)
real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");
// Save kernel info for potential retry
if (kernel_count < MAX_TRACKED_KERNELS) {
kernel_history[kernel_count].func = func;
kernel_history[kernel_count].gridDim = gridDim;
kernel_history[kernel_count].blockDim = blockDim;
kernel_history[kernel_count].args = args; // Note: This is a shallow copy
kernel_history[kernel_count].sharedMem = sharedMem;
kernel_history[kernel_count].retries = 0;
}
int current_kernel = kernel_count++;
// Launch kernel
cudaError_t result = real_cudaLaunchKernel(func, gridDim, blockDim,
args, sharedMem, stream);
// Check for errors and retry if needed
if (result != cudaSuccess) {
printf("Kernel launch failed: %s\n", cudaGetErrorString(result));
if (kernel_history[current_kernel].retries < 3) {
printf("Retrying kernel launch (attempt %d)...\n",
kernel_history[current_kernel].retries + 1);
// Reset device to recover from error
cudaDeviceReset();
// Increment retry count
kernel_history[current_kernel].retries++;
// Retry the launch
result = real_cudaLaunchKernel(func, gridDim, blockDim,
args, sharedMem, stream);
}
}
return result;
}
Kernel Scheduling Manipulation
What Can Be Modified
- Kernel Prioritization:
- Assigning execution priorities
- Preemption control (where supported)
-
Execution ordering
-
Stream Management:
- Custom stream creation and synchronization
- Work distribution across streams
-
Dependency management
-
Concurrent Kernel Execution:
- Controlling parallel kernel execution
- Resource partitioning between kernels
Does It Require GPU Code?
Most scheduling manipulations can be done from the CPU side, but fine-grained control may require GPU code:
CPU-Side Only (No GPU Code Required): - Stream creation and management - Basic priority settings - Kernel launch ordering
Requiring GPU Code: - Dynamic workload balancing within the GPU - Fine-grained synchronization between kernels - Custom scheduling algorithms within kernels
Example: Priority-Based Scheduler
// priority_scheduler.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <queue>
#include <mutex>
#include <thread>
#include <condition_variable>
// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;
// Kernel task with priority
struct KernelTask {
const void* func;
dim3 gridDim;
dim3 blockDim;
void** args;
size_t sharedMem;
cudaStream_t stream;
int priority; // Higher number = higher priority
bool operator<(const KernelTask& other) const {
return priority < other.priority; // Priority queue is max-heap
}
};
// Priority queue for kernels
std::priority_queue<KernelTask> kernel_queue;
std::mutex queue_mutex;
std::condition_variable queue_condition;
bool scheduler_running = false;
std::thread scheduler_thread;
// Scheduler function that runs in background
void scheduler_function() {
while (scheduler_running) {
KernelTask task;
{
std::unique_lock<std::mutex> lock(queue_mutex);
queue_condition.wait(lock, []{
return !kernel_queue.empty() || !scheduler_running;
});
if (!scheduler_running) break;
task = kernel_queue.top();
kernel_queue.pop();
}
// Launch the highest priority kernel
real_cudaLaunchKernel(task.func, task.gridDim, task.blockDim,
task.args, task.sharedMem, task.stream);
}
}
// Start scheduler if not running
void ensure_scheduler_running() {
if (!scheduler_running) {
scheduler_running = true;
scheduler_thread = std::thread(scheduler_function);
}
}
// Priority-based kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim,
void** args, size_t sharedMem, cudaStream_t stream) {
if (!real_cudaLaunchKernel)
real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");
ensure_scheduler_running();
// Determine kernel priority (example: based on grid size)
int priority = gridDim.x * gridDim.y * gridDim.z;
// Create task and add to queue
KernelTask task = {func, gridDim, blockDim, args, sharedMem, stream, priority};
{
std::lock_guard<std::mutex> lock(queue_mutex);
kernel_queue.push(task);
}
queue_condition.notify_one();
return cudaSuccess; // Note: This returns before kernel actually launches
}
// Clean up scheduler on program exit
__attribute__((destructor)) void cleanup_scheduler() {
if (scheduler_running) {
scheduler_running = false;
queue_condition.notify_all();
scheduler_thread.join();
}
}
Multi-GPU Distribution
What Can Be Modified
- Workload Distribution:
- Automatic work partitioning across GPUs
- Load balancing based on GPU capabilities
-
Data locality optimization
-
Memory Management Across GPUs:
- Transparent data mirroring
- Cross-GPU memory access optimization
-
Unified memory enhancements
-
Synchronization Strategies:
- Custom barriers and synchronization points
- Communication optimization
- Dependency management
Does It Require GPU Code?
Basic multi-GPU support can be implemented through API interception, but efficient implementations typically require GPU-side modifications:
CPU-Side Only (No GPU Code Required): - Basic work distribution - Memory allocation across GPUs - High-level synchronization
Requiring GPU Code: - Efficient inter-GPU communication - Custom data sharing mechanisms - GPU-side workload balancing
Example: Simple Multi-GPU Distributor
// multi_gpu_distributor.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <vector>
#include <map>
// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;
// Track available GPUs
static int num_gpus = 0;
static std::vector<cudaStream_t> gpu_streams;
static std::map<void*, std::vector<void*>> memory_mirrors;
static int next_gpu = 0;
// Initialize multi-GPU environment
void init_multi_gpu() {
if (num_gpus > 0) return; // Already initialized
cudaGetDeviceCount(&num_gpus);
if (num_gpus <= 1) num_gpus = 1; // Fall back to single GPU
// Create a stream for each GPU
for (int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
cudaStream_t stream;
cudaStreamCreate(&stream);
gpu_streams.push_back(stream);
}
}
// Distributed kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim,
void** args, size_t sharedMem, cudaStream_t stream) {
if (!real_cudaLaunchKernel)
real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");
init_multi_gpu();
if (num_gpus <= 1) {
// Single GPU mode
return real_cudaLaunchKernel(func, gridDim, blockDim, args, sharedMem, stream);
}
// Simple round-robin distribution
int gpu_id = next_gpu;
next_gpu = (next_gpu + 1) % num_gpus;
cudaSetDevice(gpu_id);
// Adjust grid dimensions for multi-GPU
dim3 adjusted_grid = gridDim;
adjusted_grid.x = (gridDim.x + num_gpus - 1) / num_gpus; // Divide work
// Launch on selected GPU
return real_cudaLaunchKernel(func, adjusted_grid, blockDim,
args, sharedMem, gpu_streams[gpu_id]);
}
// Memory allocation with mirroring
cudaError_t cudaMalloc(void** devPtr, size_t size) {
static auto real_cudaMalloc = (cudaError_t(*)(void**, size_t))dlsym(RTLD_NEXT, "cudaMalloc");
init_multi_gpu();
if (num_gpus <= 1) {
return real_cudaMalloc(devPtr, size);
}
// Allocate on primary GPU
cudaSetDevice(0);
cudaError_t result = real_cudaMalloc(devPtr, size);
if (result != cudaSuccess) return result;
// Allocate mirrors on other GPUs
std::vector<void*> mirrors;
mirrors.push_back(*devPtr); // Original pointer
for (int i = 1; i < num_gpus; i++) {
cudaSetDevice(i);
void* mirror_ptr;
result = real_cudaMalloc(&mirror_ptr, size);
if (result != cudaSuccess) {
// Clean up on failure
for (void* ptr : mirrors) {
cudaFree(ptr);
}
return result;
}
mirrors.push_back(mirror_ptr);
}
// Store mirrors for later use
memory_mirrors[*devPtr] = mirrors;
return cudaSuccess;
}
Comparison with eBPF Capabilities
eBPF provides dynamic instrumentation capabilities for CPUs that don't have exact equivalents in the GPU world, but we can compare approaches:
eBPF Capability | GPU Equivalent | Implementation Complexity | Limitations |
---|---|---|---|
Dynamic code loading | JIT compilation | High | Requires specialized tools |
Kernel instrumentation | API interception | Medium | Limited to API boundaries |
Process monitoring | CUPTI callbacks | Medium | Limited visibility into kernels |
Network packet filtering | N/A | N/A | No direct equivalent |
Performance monitoring | NVTX, CUPTI | Low | External profiling needed |
Security enforcement | API validation | Medium | Limited enforcement points |
Key Differences
- Runtime Safety Guarantees:
- eBPF: Static verification ensures program safety
-
GPU: No equivalent safety verification for dynamic code
-
Scope of Observation:
- eBPF: System-wide visibility across processes
-
GPU: Limited to single application or driver level
-
Privilege Requirements:
- eBPF: Requires varying levels of privileges
-
GPU: API interception usually requires no special privileges
-
Integration with Hardware:
- eBPF: Deep integration with CPU and OS
- GPU: Limited by vendor-provided interfaces
Case Studies
Case Study 1: Transparent Multi-GPU Acceleration
Challenge: Accelerate single-GPU applications to use multiple GPUs without code changes.
Solution: API interception library that: 1. Intercepts memory allocation and kernel launches 2. Divides data across available GPUs 3. Rewrites kernel launches to process data partitions 4. Gathers results back to the primary GPU
Results Example(Not real results): - Speedup of 1.8x on 2 GPUs for memory-bound applications - Limited scaling for compute-bound applications due to synchronization overhead - No source code changes required
Case Study 2: Adaptive Memory Management
Challenge: Reduce memory allocation overhead and fragmentation in deep learning frameworks.
Solution: Memory pooling extension that: 1. Intercepts all CUDA memory allocations 2. Maintains pools of pre-allocated memory 3. Implements custom allocation strategies based on usage patterns 4. Defers actual freeing until memory pressure requires it
Results Example(Not real results): - 30% reduction in training time for models with many small tensor allocations - 15% reduction in peak memory usage through better fragmentation management - Compatible with existing frameworks without source changes
Future Directions
The landscape of GPU extension mechanisms continues to evolve:
- Hardware-Level Extensibility:
- GPU vendors may provide more hooks for custom runtime behaviors
-
Hardware support for secure, dynamic code loading (GPU equivalent of eBPF)
-
Unified Programming Models:
- SYCL, oneAPI, and similar frameworks may provide more extension points
-
Heterogeneous programming models that span CPU and GPU
-
OS-Level GPU Resource Management:
- Integration of GPU resources into OS scheduling frameworks
-
Fine-grained control of GPU resources at the OS level
-
AI-Assisted Extensions:
- Automated optimization systems that dynamically modify GPU application behavior
- Machine learning models that predict and adapt to application requirements
References
- NVIDIA. "CUDA Driver API." https://docs.nvidia.com/cuda/cuda-driver-api/
- Gregg, Brendan. "BPF Performance Tools: Linux System and Application Observability." Addison-Wesley Professional, 2019.
- NVIDIA. "CUPTI: CUDA Profiling Tools Interface." https://docs.nvidia.com/cuda/cupti/