Skip to content

GPU Application Extension Mechanisms: Modifying Behavior Without Source Code Changes

This document explores the various mechanisms for extending and modifying GPU application behavior without requiring source code changes to the original application. We'll examine what aspects of GPU behavior can be modified, which approaches require GPU-side code, and how these capabilities compare to eBPF-like functionality.

Table of Contents

  1. Introduction
  2. Extension Points in GPU Computing Stack
  3. API Interception and Redirection
  4. Memory Management Extensions
  5. Execution Control Extensions
  6. Runtime Behavior Modification
  7. Kernel Scheduling Manipulation
  8. Multi-GPU Distribution
  9. Comparison with eBPF Capabilities
  10. Case Studies
  11. Future Directions
  12. References

Introduction

GPU applications often have behaviors that users might want to modify without changing the original source code: - Resource allocation (memory, compute) - Scheduling priorities and policies - Error handling mechanisms - Performance characteristics - Monitoring and debugging capabilities

While CPUs benefit from advanced introspection tools like eBPF that allow dynamic behavior modification, GPUs have a different programming model that affects how extensions can be implemented. This document explores what's possible in the GPU ecosystem and the tradeoffs of different approaches.

Extension Points in GPU Computing Stack

The GPU computing stack offers several layers where behavior can be modified:

┌─────────────────────────────┐
│    Application              │ ← Source code modification (not our focus)
├─────────────────────────────┤
│    GPU Framework/Library    │ ← Library replacement/wrapper
│    (TensorFlow, PyTorch)    │
├─────────────────────────────┤
│    CUDA Runtime API         │ ← API interception
├─────────────────────────────┤
│    CUDA Driver API          │ ← Driver API interception
├─────────────────────────────┤
│    GPU Driver               │ ← Driver patches (requires privileges)
├─────────────────────────────┤
│    GPU Hardware             │ ← Firmware modifications (rarely possible)
└─────────────────────────────┘

Each layer offers different extension capabilities and restrictions:

Layer Extension Flexibility Runtime Overhead Implementation Complexity Privileges Required
Framework High Low-Medium Medium None
Runtime API High Low Medium None
Driver API Very High Low High None
GPU Driver Extreme Minimal Very High Root/Admin
GPU Firmware Limited None Extreme Root + Specialized

API Interception and Redirection

The most flexible and accessible approach for extending GPU applications is API interception, which requires no GPU-side code.

CUDA Runtime API Interception

What can be modified: - Memory allocations and transfers - Kernel launch parameters - Stream and event management - Device selection and management

Implementation approaches:

  1. LD_PRELOAD mechanism (Linux):

    // Example of intercepting cudaMalloc
    void* cudaMalloc(void** devPtr, size_t size) {
        // Call the real cudaMalloc
        void* result = real_cudaMalloc(devPtr, size);
    
        // Add custom behavior
        log_allocation(*devPtr, size);
    
        return result;
    }
    

  2. DLL injection (Windows):

    BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpvReserved) {
        if (fdwReason == DLL_PROCESS_ATTACH) {
            // Hook CUDA functions
            HookFunction("cudaMalloc", MyCudaMalloc);
        }
        return TRUE;
    }
    

  3. NVIDIA Intercept Library: A framework specifically designed for CUDA API interception.

Example: Memory Tracking Interceptor

// track_cuda_memory.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <cuda_runtime.h>

// Function pointer types
typedef cudaError_t (*cudaMalloc_t)(void**, size_t);
typedef cudaError_t (*cudaFree_t)(void*);

// Original function pointers
static cudaMalloc_t real_cudaMalloc = NULL;
static cudaFree_t real_cudaFree = NULL;

// Track total allocated memory
static size_t total_allocated = 0;

// Intercepted cudaMalloc
cudaError_t cudaMalloc(void** devPtr, size_t size) {
    if (!real_cudaMalloc)
        real_cudaMalloc = (cudaMalloc_t)dlsym(RTLD_NEXT, "cudaMalloc");

    cudaError_t result = real_cudaMalloc(devPtr, size);

    if (result == cudaSuccess) {
        total_allocated += size;
        printf("CUDA Malloc: %zu bytes at %p (Total: %zu)\n", 
               size, *devPtr, total_allocated);
    }

    return result;
}

// Intercepted cudaFree
cudaError_t cudaFree(void* devPtr) {
    if (!real_cudaFree)
        real_cudaFree = (cudaFree_t)dlsym(RTLD_NEXT, "cudaFree");

    // We would need a map to track size per pointer for accurate accounting
    printf("CUDA Free: %p\n", devPtr);

    return real_cudaFree(devPtr);
}

Usage: LD_PRELOAD=./libtrack_cuda_memory.so ./my_cuda_app

GPU Virtualization and API Remoting

More advanced API interception approaches can completely redirect GPU operations:

  • NVIDIA CUDA vGPU: Virtualization technology that redirects API calls to a hypervisor-controlled GPU
  • rCUDA: Remote CUDA execution framework that intercepts API calls and forwards them to a remote server

Memory Management Extensions

What Can Be Modified

  1. Memory Allocation Policies:
  2. Custom allocation sizes (e.g., rounding up to specific boundaries)
  3. Allocation pooling to reduce fragmentation
  4. Device memory prioritization across multiple kernels

  5. Memory Transfer Optimizations:

  6. Automatic pinned memory usage
  7. Batching of small transfers
  8. Compression during transfers

  9. Memory Access Patterns:

  10. Memory prefetching
  11. Custom caching strategies

Does It Require GPU Code?

Most memory management extensions can be implemented entirely from the CPU side through API interception. However, some advanced optimizations might require GPU-side modifications:

CPU-Side Only (No GPU Code Required): - Allocation timing and batching - Host-device transfer optimization - Memory pool management

Requiring GPU Code: - Custom memory access patterns within kernels - Specialized caching strategies - Data prefetching within kernels

Example: Memory Pool Interceptor

// Simple memory pool for CUDA allocations
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <map>
#include <vector>

// Original function pointers
static cudaError_t (*real_cudaMalloc)(void**, size_t) = NULL;
static cudaError_t (*real_cudaFree)(void*) = NULL;

// Memory pool structures
struct MemBlock {
    void* ptr;
    size_t size;
    bool in_use;
};

std::vector<MemBlock> memory_pool;
std::map<void*, size_t> allocation_map;

// Intercepted cudaMalloc with pooling
cudaError_t cudaMalloc(void** devPtr, size_t size) {
    if (!real_cudaMalloc)
        real_cudaMalloc = (cudaError_t(*)(void**, size_t))dlsym(RTLD_NEXT, "cudaMalloc");

    // Round size up to reduce fragmentation (e.g., to 256-byte boundary)
    size_t aligned_size = (size + 255) & ~255;

    // Try to find a free block in the pool
    for (auto& block : memory_pool) {
        if (!block.in_use && block.size >= aligned_size) {
            block.in_use = true;
            *devPtr = block.ptr;
            allocation_map[block.ptr] = aligned_size;
            return cudaSuccess;
        }
    }

    // Allocate new block if none found
    void* new_ptr;
    cudaError_t result = real_cudaMalloc(&new_ptr, aligned_size);

    if (result == cudaSuccess) {
        memory_pool.push_back({new_ptr, aligned_size, true});
        allocation_map[new_ptr] = aligned_size;
        *devPtr = new_ptr;
    }

    return result;
}

// Intercepted cudaFree with pooling
cudaError_t cudaFree(void* devPtr) {
    if (!real_cudaFree)
        real_cudaFree = (cudaError_t(*)(void*))dlsym(RTLD_NEXT, "cudaFree");

    // Mark block as free but don't actually free memory
    for (auto& block : memory_pool) {
        if (block.ptr == devPtr) {
            block.in_use = false;
            allocation_map.erase(devPtr);
            return cudaSuccess;
        }
    }

    // If not found in pool, use regular free
    return real_cudaFree(devPtr);
}

// Add function to actually free all pooled memory when app exits
__attribute__((destructor)) void cleanup_memory_pool() {
    for (auto& block : memory_pool) {
        real_cudaFree(block.ptr);
    }
    memory_pool.clear();
}

Execution Control Extensions

What Can Be Modified

  1. Kernel Launch Configuration:
  2. Block and grid dimensions
  3. Shared memory allocation
  4. Stream assignment

  5. Kernel Execution Timing:

  6. Kernel launch batching
  7. Execution prioritization
  8. Work distribution across multiple kernels

  9. Error Handling and Recovery:

  10. Custom handling of CUDA errors
  11. Automatic retry of failed operations
  12. Graceful degradation strategies

Does It Require GPU Code?

Basic execution control can be handled through API interception, but advanced optimizations may require GPU-side code:

CPU-Side Only (No GPU Code Required): - Launch configurations - Stream management - Basic error handling

Requiring GPU Code: - Kernel fusion or splitting - Advanced error recovery within kernels - Dynamic workload balancing within kernels

Example: Kernel Launch Optimizer

// kernel_optimizer.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>

// Original kernel launch function
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;

// Optimized kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, 
                            void** args, size_t sharedMem, cudaStream_t stream) {
    if (!real_cudaLaunchKernel)
        real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");

    // Get device properties
    cudaDeviceProp prop;
    int device;
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&prop, device);

    // Optimize block size for better occupancy
    if (blockDim.x * blockDim.y * blockDim.z <= 256) {
        // Adjust block size for better SM utilization
        dim3 optimizedBlockDim;
        optimizedBlockDim.x = 256;
        optimizedBlockDim.y = 1;
        optimizedBlockDim.z = 1;

        // Adjust grid size to maintain total threads
        dim3 optimizedGridDim;
        int original_total = gridDim.x * gridDim.y * gridDim.z * 
                            blockDim.x * blockDim.y * blockDim.z;
        int threads_per_block = optimizedBlockDim.x * optimizedBlockDim.y * 
                               optimizedBlockDim.z;
        int num_blocks = (original_total + threads_per_block - 1) / threads_per_block;

        optimizedGridDim.x = num_blocks;
        optimizedGridDim.y = 1;
        optimizedGridDim.z = 1;

        // Launch with optimized configuration
        return real_cudaLaunchKernel(func, optimizedGridDim, optimizedBlockDim, 
                                   args, sharedMem, stream);
    }

    // Fall back to original configuration
    return real_cudaLaunchKernel(func, gridDim, blockDim, args, sharedMem, stream);
}

Runtime Behavior Modification

What Can Be Modified

  1. JIT Compilation Behavior:
  2. Optimization levels
  3. Target architectures
  4. Code generation options

  5. Error Detection and Reporting:

  6. Enhanced error checking
  7. Custom logging and diagnostic information
  8. Performance anomaly detection

  9. Device Management:

  10. Multi-GPU load balancing
  11. Power and thermal management
  12. Fault tolerance strategies

Does It Require GPU Code?

Many runtime behaviors can be modified through API interception and environment variables, but some advanced features require GPU-side code:

CPU-Side Only (No GPU Code Required): - JIT compilation flags - Device selection and configuration - Error handling policies

Requiring GPU Code: - Custom error checking within kernels - Specialized fault tolerance mechanisms - Runtime adaptive algorithms

Example: Error Resilience Extension

// error_resilience.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <stdio.h>

// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;

// Track kernel launches for retry
struct KernelInfo {
    const void* func;
    dim3 gridDim;
    dim3 blockDim;
    void** args;  // Note: This is unsafe without deep-copying args
    size_t sharedMem;
    int retries;
};

#define MAX_TRACKED_KERNELS 100
static KernelInfo kernel_history[MAX_TRACKED_KERNELS];
static int kernel_count = 0;

// Enhanced kernel launch with automatic retry
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, 
                            void** args, size_t sharedMem, cudaStream_t stream) {
    if (!real_cudaLaunchKernel)
        real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");

    // Save kernel info for potential retry
    if (kernel_count < MAX_TRACKED_KERNELS) {
        kernel_history[kernel_count].func = func;
        kernel_history[kernel_count].gridDim = gridDim;
        kernel_history[kernel_count].blockDim = blockDim;
        kernel_history[kernel_count].args = args;  // Note: This is a shallow copy
        kernel_history[kernel_count].sharedMem = sharedMem;
        kernel_history[kernel_count].retries = 0;
    }
    int current_kernel = kernel_count++;

    // Launch kernel
    cudaError_t result = real_cudaLaunchKernel(func, gridDim, blockDim, 
                                              args, sharedMem, stream);

    // Check for errors and retry if needed
    if (result != cudaSuccess) {
        printf("Kernel launch failed: %s\n", cudaGetErrorString(result));

        if (kernel_history[current_kernel].retries < 3) {
            printf("Retrying kernel launch (attempt %d)...\n", 
                   kernel_history[current_kernel].retries + 1);

            // Reset device to recover from error
            cudaDeviceReset();

            // Increment retry count
            kernel_history[current_kernel].retries++;

            // Retry the launch
            result = real_cudaLaunchKernel(func, gridDim, blockDim, 
                                         args, sharedMem, stream);
        }
    }

    return result;
}

Kernel Scheduling Manipulation

What Can Be Modified

  1. Kernel Prioritization:
  2. Assigning execution priorities
  3. Preemption control (where supported)
  4. Execution ordering

  5. Stream Management:

  6. Custom stream creation and synchronization
  7. Work distribution across streams
  8. Dependency management

  9. Concurrent Kernel Execution:

  10. Controlling parallel kernel execution
  11. Resource partitioning between kernels

Does It Require GPU Code?

Most scheduling manipulations can be done from the CPU side, but fine-grained control may require GPU code:

CPU-Side Only (No GPU Code Required): - Stream creation and management - Basic priority settings - Kernel launch ordering

Requiring GPU Code: - Dynamic workload balancing within the GPU - Fine-grained synchronization between kernels - Custom scheduling algorithms within kernels

Example: Priority-Based Scheduler

// priority_scheduler.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <queue>
#include <mutex>
#include <thread>
#include <condition_variable>

// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;

// Kernel task with priority
struct KernelTask {
    const void* func;
    dim3 gridDim;
    dim3 blockDim;
    void** args;
    size_t sharedMem;
    cudaStream_t stream;
    int priority;  // Higher number = higher priority

    bool operator<(const KernelTask& other) const {
        return priority < other.priority; // Priority queue is max-heap
    }
};

// Priority queue for kernels
std::priority_queue<KernelTask> kernel_queue;
std::mutex queue_mutex;
std::condition_variable queue_condition;
bool scheduler_running = false;
std::thread scheduler_thread;

// Scheduler function that runs in background
void scheduler_function() {
    while (scheduler_running) {
        KernelTask task;

        {
            std::unique_lock<std::mutex> lock(queue_mutex);
            queue_condition.wait(lock, []{
                return !kernel_queue.empty() || !scheduler_running;
            });

            if (!scheduler_running) break;

            task = kernel_queue.top();
            kernel_queue.pop();
        }

        // Launch the highest priority kernel
        real_cudaLaunchKernel(task.func, task.gridDim, task.blockDim, 
                             task.args, task.sharedMem, task.stream);
    }
}

// Start scheduler if not running
void ensure_scheduler_running() {
    if (!scheduler_running) {
        scheduler_running = true;
        scheduler_thread = std::thread(scheduler_function);
    }
}

// Priority-based kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, 
                            void** args, size_t sharedMem, cudaStream_t stream) {
    if (!real_cudaLaunchKernel)
        real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");

    ensure_scheduler_running();

    // Determine kernel priority (example: based on grid size)
    int priority = gridDim.x * gridDim.y * gridDim.z;

    // Create task and add to queue
    KernelTask task = {func, gridDim, blockDim, args, sharedMem, stream, priority};

    {
        std::lock_guard<std::mutex> lock(queue_mutex);
        kernel_queue.push(task);
    }

    queue_condition.notify_one();

    return cudaSuccess; // Note: This returns before kernel actually launches
}

// Clean up scheduler on program exit
__attribute__((destructor)) void cleanup_scheduler() {
    if (scheduler_running) {
        scheduler_running = false;
        queue_condition.notify_all();
        scheduler_thread.join();
    }
}

Multi-GPU Distribution

What Can Be Modified

  1. Workload Distribution:
  2. Automatic work partitioning across GPUs
  3. Load balancing based on GPU capabilities
  4. Data locality optimization

  5. Memory Management Across GPUs:

  6. Transparent data mirroring
  7. Cross-GPU memory access optimization
  8. Unified memory enhancements

  9. Synchronization Strategies:

  10. Custom barriers and synchronization points
  11. Communication optimization
  12. Dependency management

Does It Require GPU Code?

Basic multi-GPU support can be implemented through API interception, but efficient implementations typically require GPU-side modifications:

CPU-Side Only (No GPU Code Required): - Basic work distribution - Memory allocation across GPUs - High-level synchronization

Requiring GPU Code: - Efficient inter-GPU communication - Custom data sharing mechanisms - GPU-side workload balancing

Example: Simple Multi-GPU Distributor

// multi_gpu_distributor.c
#define _GNU_SOURCE
#include <dlfcn.h>
#include <cuda_runtime.h>
#include <vector>
#include <map>

// Original function pointers
typedef cudaError_t (*cudaLaunchKernel_t)(const void*, dim3, dim3, void**, size_t, cudaStream_t);
static cudaLaunchKernel_t real_cudaLaunchKernel = NULL;

// Track available GPUs
static int num_gpus = 0;
static std::vector<cudaStream_t> gpu_streams;
static std::map<void*, std::vector<void*>> memory_mirrors;
static int next_gpu = 0;

// Initialize multi-GPU environment
void init_multi_gpu() {
    if (num_gpus > 0) return; // Already initialized

    cudaGetDeviceCount(&num_gpus);
    if (num_gpus <= 1) num_gpus = 1; // Fall back to single GPU

    // Create a stream for each GPU
    for (int i = 0; i < num_gpus; i++) {
        cudaSetDevice(i);
        cudaStream_t stream;
        cudaStreamCreate(&stream);
        gpu_streams.push_back(stream);
    }
}

// Distributed kernel launch
cudaError_t cudaLaunchKernel(const void* func, dim3 gridDim, dim3 blockDim, 
                            void** args, size_t sharedMem, cudaStream_t stream) {
    if (!real_cudaLaunchKernel)
        real_cudaLaunchKernel = (cudaLaunchKernel_t)dlsym(RTLD_NEXT, "cudaLaunchKernel");

    init_multi_gpu();

    if (num_gpus <= 1) {
        // Single GPU mode
        return real_cudaLaunchKernel(func, gridDim, blockDim, args, sharedMem, stream);
    }

    // Simple round-robin distribution
    int gpu_id = next_gpu;
    next_gpu = (next_gpu + 1) % num_gpus;

    cudaSetDevice(gpu_id);

    // Adjust grid dimensions for multi-GPU
    dim3 adjusted_grid = gridDim;
    adjusted_grid.x = (gridDim.x + num_gpus - 1) / num_gpus; // Divide work

    // Launch on selected GPU
    return real_cudaLaunchKernel(func, adjusted_grid, blockDim, 
                               args, sharedMem, gpu_streams[gpu_id]);
}

// Memory allocation with mirroring
cudaError_t cudaMalloc(void** devPtr, size_t size) {
    static auto real_cudaMalloc = (cudaError_t(*)(void**, size_t))dlsym(RTLD_NEXT, "cudaMalloc");

    init_multi_gpu();

    if (num_gpus <= 1) {
        return real_cudaMalloc(devPtr, size);
    }

    // Allocate on primary GPU
    cudaSetDevice(0);
    cudaError_t result = real_cudaMalloc(devPtr, size);
    if (result != cudaSuccess) return result;

    // Allocate mirrors on other GPUs
    std::vector<void*> mirrors;
    mirrors.push_back(*devPtr); // Original pointer

    for (int i = 1; i < num_gpus; i++) {
        cudaSetDevice(i);
        void* mirror_ptr;
        result = real_cudaMalloc(&mirror_ptr, size);
        if (result != cudaSuccess) {
            // Clean up on failure
            for (void* ptr : mirrors) {
                cudaFree(ptr);
            }
            return result;
        }
        mirrors.push_back(mirror_ptr);
    }

    // Store mirrors for later use
    memory_mirrors[*devPtr] = mirrors;

    return cudaSuccess;
}

Comparison with eBPF Capabilities

eBPF provides dynamic instrumentation capabilities for CPUs that don't have exact equivalents in the GPU world, but we can compare approaches:

eBPF Capability GPU Equivalent Implementation Complexity Limitations
Dynamic code loading JIT compilation High Requires specialized tools
Kernel instrumentation API interception Medium Limited to API boundaries
Process monitoring CUPTI callbacks Medium Limited visibility into kernels
Network packet filtering N/A N/A No direct equivalent
Performance monitoring NVTX, CUPTI Low External profiling needed
Security enforcement API validation Medium Limited enforcement points

Key Differences

  1. Runtime Safety Guarantees:
  2. eBPF: Static verification ensures program safety
  3. GPU: No equivalent safety verification for dynamic code

  4. Scope of Observation:

  5. eBPF: System-wide visibility across processes
  6. GPU: Limited to single application or driver level

  7. Privilege Requirements:

  8. eBPF: Requires varying levels of privileges
  9. GPU: API interception usually requires no special privileges

  10. Integration with Hardware:

  11. eBPF: Deep integration with CPU and OS
  12. GPU: Limited by vendor-provided interfaces

Case Studies

Case Study 1: Transparent Multi-GPU Acceleration

Challenge: Accelerate single-GPU applications to use multiple GPUs without code changes.

Solution: API interception library that: 1. Intercepts memory allocation and kernel launches 2. Divides data across available GPUs 3. Rewrites kernel launches to process data partitions 4. Gathers results back to the primary GPU

Results Example(Not real results): - Speedup of 1.8x on 2 GPUs for memory-bound applications - Limited scaling for compute-bound applications due to synchronization overhead - No source code changes required

Case Study 2: Adaptive Memory Management

Challenge: Reduce memory allocation overhead and fragmentation in deep learning frameworks.

Solution: Memory pooling extension that: 1. Intercepts all CUDA memory allocations 2. Maintains pools of pre-allocated memory 3. Implements custom allocation strategies based on usage patterns 4. Defers actual freeing until memory pressure requires it

Results Example(Not real results): - 30% reduction in training time for models with many small tensor allocations - 15% reduction in peak memory usage through better fragmentation management - Compatible with existing frameworks without source changes

Future Directions

The landscape of GPU extension mechanisms continues to evolve:

  1. Hardware-Level Extensibility:
  2. GPU vendors may provide more hooks for custom runtime behaviors
  3. Hardware support for secure, dynamic code loading (GPU equivalent of eBPF)

  4. Unified Programming Models:

  5. SYCL, oneAPI, and similar frameworks may provide more extension points
  6. Heterogeneous programming models that span CPU and GPU

  7. OS-Level GPU Resource Management:

  8. Integration of GPU resources into OS scheduling frameworks
  9. Fine-grained control of GPU resources at the OS level

  10. AI-Assisted Extensions:

  11. Automated optimization systems that dynamically modify GPU application behavior
  12. Machine learning models that predict and adapt to application requirements

References

  1. NVIDIA. "CUDA Driver API." https://docs.nvidia.com/cuda/cuda-driver-api/
  2. Gregg, Brendan. "BPF Performance Tools: Linux System and Application Observability." Addison-Wesley Professional, 2019.
  3. NVIDIA. "CUPTI: CUDA Profiling Tools Interface." https://docs.nvidia.com/cuda/cupti/

Share on Share on