Skip to content

CUDA GPU Profiling and Tracing

This document provides a comprehensive guide to profiling and tracing CUDA applications to identify performance bottlenecks and optimize GPU code execution.

You can find the code in https://github.com/eunomia-bpf/basic-cuda-tutorial

Table of Contents

  1. Introduction to GPU Profiling
  2. Profiling Tools
  3. Key Performance Metrics
  4. Profiling Methodology
  5. Common Performance Bottlenecks
  6. Tracing Techniques
  7. Example Application
  8. Best Practices
  9. Further Reading

Introduction to GPU Profiling

GPU profiling is the process of measuring and analyzing the performance characteristics of GPU applications. It helps developers:

  • Identify performance bottlenecks
  • Optimize resource utilization
  • Understand execution patterns
  • Validate optimization decisions
  • Ensure scalability across different hardware

Effective profiling is essential for high-performance CUDA applications as the complex nature of GPU architecture makes intuitive optimization insufficient.

Profiling Tools

NVIDIA Nsight Systems

Nsight Systems is a system-wide performance analysis tool that provides insights into CPU and GPU execution:

  • System-level tracing: CPU, GPU, memory, and I/O activities
  • Timeline visualization: Shows kernel execution, memory transfers, and CPU activity
  • API trace: Captures CUDA API calls and their durations
  • Low overhead: Suitable for production code profiling

NVIDIA Nsight Compute

Nsight Compute is an interactive kernel profiler for CUDA applications:

  • Detailed kernel metrics: SM utilization, memory throughput, instruction mix
  • Guided analysis: Provides optimization recommendations
  • Roofline analysis: Shows performance relative to hardware limits
  • Kernel comparison: Compare kernels across runs or hardware platforms

NVIDIA Visual Profiler and nvprof

Legacy tools (deprecated but still useful for older CUDA versions):

  • nvprof: Command-line profiler with low overhead
  • Visual Profiler: GUI-based analysis tool
  • CUDA profiling APIs: Allow programmatic access to profiling data

Other Tools

  • Compute Sanitizer: Memory access checking and race detection
  • CUPTI: CUDA Profiling Tools Interface for custom profilers
  • PyTorch/TensorFlow Profilers: Framework-specific profiling for deep learning

Key Performance Metrics

Execution Metrics

  1. SM Occupancy: Ratio of active warps to maximum possible warps
  2. Higher values generally enable better latency hiding
  3. Target: >50% for most applications

  4. Warp Execution Efficiency: Percentage of threads active during execution

  5. Lower values indicate branch divergence
  6. Target: >80% for compute-bound kernels

  7. Instruction Throughput:

  8. Instructions per clock (IPC)
  9. Arithmetic intensity (operations per byte)
  10. Mix of instruction types

Memory Metrics

  1. Memory Throughput:
  2. Global memory read/write bandwidth
  3. Shared memory bandwidth
  4. L1/L2 cache hit rates
  5. Target: As close to peak hardware bandwidth as possible

  6. Memory Access Patterns:

  7. Load/store efficiency
  8. Global memory coalescing rate
  9. Shared memory bank conflicts

  10. Data Transfer:

  11. Host-device transfer bandwidth
  12. PCIe utilization
  13. NVLink utilization (if available)

Compute Metrics

  1. Compute Utilization:
  2. SM activity
  3. Tensor/RT core utilization (if used)
  4. Instruction mix (FP32, FP64, INT, etc.)

  5. Compute Efficiency:

  6. Achieved vs. theoretical FLOPS
  7. Resource limitations (compute vs. memory bound)
  8. Roofline model position

Profiling Methodology

A structured approach to profiling CUDA applications:

1. Initial Assessment

  • Start with high-level system profiling (Nsight Systems)
  • Identify time distribution between CPU, GPU, and data transfers
  • Look for obvious bottlenecks like excessive synchronization or transfers

2. Kernel Analysis

  • Profile individual kernels (Nsight Compute)
  • Identify the most time-consuming kernels
  • Collect key metrics for these kernels

3. Bottleneck Identification

  • Determine if kernels are compute-bound or memory-bound
  • Use the roofline model to understand performance limiters
  • Check for specific inefficiencies (divergence, non-coalesced access)

4. Guided Optimization

  • Address the most significant bottlenecks first
  • Make one change at a time and measure impact
  • Compare before/after profiles to validate improvements

5. Iterative Refinement

  • Repeat the process for the next bottleneck
  • Re-profile the entire application periodically
  • Continue until performance goals are met

Common Performance Bottlenecks

  1. Non-coalesced Memory Access:
  2. Symptoms: Low global memory load/store efficiency
  3. Solution: Reorganize data layout or access patterns

  4. Shared Memory Bank Conflicts:

  5. Symptoms: Low shared memory bandwidth
  6. Solution: Adjust padding or access patterns

  7. Excessive Global Memory Access:

  8. Symptoms: High memory dependency
  9. Solution: Increase data reuse through shared memory or registers
  1. Warp Divergence:
  2. Symptoms: Low warp execution efficiency
  3. Solution: Reorganize algorithms to minimize divergent paths

  4. Low Occupancy:

  5. Symptoms: SM occupancy below 50%
  6. Solution: Reduce register/shared memory usage or adjust block size

  7. Kernel Launch Overhead:

  8. Symptoms: Many small, short-duration kernels
  9. Solution: Kernel fusion or persistent kernels

System-Level Issues

  1. Excessive Host-Device Transfers:
  2. Symptoms: High PCIe utilization, many transfer operations
  3. Solution: Batch transfers, use pinned memory, or unified memory

  4. CPU-GPU Synchronization:

  5. Symptoms: GPU idle periods between kernels
  6. Solution: Use CUDA streams, asynchronous operations

  7. Underutilized GPU Resources:

  8. Symptoms: Low overall GPU utilization
  9. Solution: Concurrent kernels, streams, or increase problem size

Tracing Techniques

Tracing provides a timeline view of application execution:

CUDA Events

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
myKernel<<<grid, block>>>(data);
cudaEventRecord(stop);
cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel execution time: %f ms\n", milliseconds);

NVTX Markers and Ranges

NVIDIA Tools Extension (NVTX) allows custom annotations:

#include <nvtx3/nvToolsExt.h>

// Mark an instantaneous event
nvtxMark("Interesting point");

// Begin a range
nvtxRangePushA("Data preparation");
// ... code ...
nvtxRangePop(); // End the range

// Colored range for better visibility
nvtxEventAttributes_t eventAttrib = {0};
eventAttrib.version = NVTX_VERSION;
eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
eventAttrib.colorType = NVTX_COLOR_ARGB;
eventAttrib.color = 0xFF00FF00; // Green
eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
eventAttrib.message.ascii = "Kernel Execution";
nvtxRangePushEx(&eventAttrib);
myKernel<<<grid, block>>>(data);
nvtxRangePop();

Programmatic Profiling with CUPTI

CUDA Profiling Tools Interface (CUPTI) enables programmatic access to profiling data:

// Simplified CUPTI usage example
#include <cupti.h>

void CUPTIAPI callbackHandler(void *userdata, CUpti_CallbackDomain domain,
                             CUpti_CallbackId cbid, const void *cbInfo) {
    // Handle callback
}

// Initialize CUPTI and register callbacks
CUpti_SubscriberHandle subscriber;
cuptiSubscribe(&subscriber, callbackHandler, NULL);
cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API,
                   CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020);

Example Application

The accompanying basic08.cu demonstrates:

  1. Basic kernel timing: Using CUDA events
  2. NVTX annotations: Adding markers and ranges
  3. Memory transfer profiling: Analyzing host-device transfers
  4. Kernel optimization: Comparing different implementation strategies
  5. Interpreting profiling data: Making optimization decisions

Key Code Sections

Basic kernel timing:

__global__ void computeKernel(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = input[idx];
        // Perform computation
        float result = x * x + x + 1.0f;
        output[idx] = result;
    }
}

void timeKernel() {
    // Allocate memory
    float *d_input, *d_output;
    cudaMalloc(&d_input, SIZE * sizeof(float));
    cudaMalloc(&d_output, SIZE * sizeof(float));

    // Initialize data
    float *h_input = new float[SIZE];
    for (int i = 0; i < SIZE; i++) h_input[i] = i;

    cudaMemcpy(d_input, h_input, SIZE * sizeof(float), cudaMemcpyHostToDevice);

    // Timing with events
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Warm-up run
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);

    // Timed run
    cudaEventRecord(start);
    computeKernel<<<(SIZE + 255) / 256, 256>>>(d_input, d_output, SIZE);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    printf("Kernel execution time: %f ms\n", milliseconds);

    // Cleanup
    delete[] h_input;
    cudaFree(d_input);
    cudaFree(d_output);
}

Best Practices

Profiling Workflow

  1. Start with high-level profiling before diving into details
  2. Establish baselines for important kernels
  3. Profile regularly during development, not just at the end
  4. Automate profiling where possible for regression testing
  5. Compare across hardware to ensure portability

Tool Selection

  1. Nsight Systems for system-level analysis and timeline
  2. Nsight Compute for detailed kernel metrics
  3. NVTX markers for custom annotations
  4. CUDA events for lightweight timing measurements

Optimization Approach

  1. Focus on hotspots: Address the most time-consuming operations first
  2. Use roofline analysis: Understand theoretical limits
  3. Balance efforts: Don't over-optimize less critical sections
  4. Consider trade-offs: Sometimes readability > minor performance gains
  5. Document insights: Record profiling discoveries for future reference

Further Reading

Share on Share on