CUPTI User-Range Profiling Tutorial

The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.

Introduction

When profiling CUDA applications, you often need to focus on specific sections of code rather than entire kernels or the complete application. CUPTI's User-Range Profiling feature allows you to define custom ranges in your code and collect performance metrics only within these ranges. This gives you precise control over which parts of your application are profiled, making it easier to identify and optimize performance bottlenecks in complex applications. This tutorial demonstrates how to use CUPTI's Profiler API to define custom ranges and collect performance metrics within them.

What You'll Learn

How to define and instrument user-specified ranges in your CUDA code
Setting up the CUPTI Profiler API for user-range profiling
Collecting GPU performance metrics within your defined ranges
Processing and analyzing the collected metrics for each range
Comparing performance across different parts of your application

Understanding User-Range Profiling

Unlike automatic profiling that targets individual CUDA API calls or kernels, user-range profiling lets you:

Define logical sections of your code that may include multiple CUDA operations
Give meaningful names to these sections for easier analysis
Focus profiling resources on the most important parts of your application
Compare performance metrics across different algorithmic approaches

This approach is particularly useful when: - Your application has distinct phases with different performance characteristics - You want to compare different implementations of the same algorithm - You need to profile a specific sequence of operations as a single unit - You're optimizing a particular section of a larger application

Code Walkthrough

1. Setting Up the CUPTI Profiler

First, we need to initialize the CUPTI Profiler API and configure it for user-range profiling:

CUpti_Profiler_Initialize_Params initializeParams = {CUpti_Profiler_Initialize_Params_STRUCT_SIZE};
CUPTI_API_CALL(cuptiProfilerInitialize(&initializeParams));

// Get the chip name for the current device
CUpti_Device_GetChipName_Params getChipNameParams = {CUpti_Device_GetChipName_Params_STRUCT_SIZE};
getChipNameParams.deviceIndex = 0;
CUPTI_API_CALL(cuptiDeviceGetChipName(&getChipNameParams));
const char *chipName = getChipNameParams.pChipName;

// Create the metric configuration
NVPW_InitializeHost_Params initializeHostParams = {NVPW_InitializeHost_Params_STRUCT_SIZE};
NVPW_API_CALL(NVPW_InitializeHost(&initializeHostParams));

// Set up the metrics to collect
const char *metricNames[] = {"smsp__warps_launched.avg"};
struct MetricNameList metricList;
metricList.numMetrics = 1;
metricList.metricNames = metricNames;

// Create the counter data image and configuration
CUpti_Profiler_CounterDataImageOptions counterDataImageOptions = {
    CUpti_Profiler_CounterDataImageOptions_STRUCT_SIZE};
counterDataImageOptions.pChipName = chipName;
counterDataImageOptions.counterDataImageSize = 0;
CUPTI_API_CALL(cuptiProfilerCounterDataImageCalculateSize(&counterDataImageOptions));
counterDataImage = (uint8_t *)malloc(counterDataImageOptions.counterDataImageSize);
counterDataImageOptions.pCounterDataImage = counterDataImage;
CUPTI_API_CALL(cuptiProfilerCounterDataImageInitialize(&counterDataImageOptions));

// Create the counter data configuration
CUpti_Profiler_CounterDataImageCalculateScratchBufferSize_Params scratchBufferSizeParams = {
    CUpti_Profiler_CounterDataImageCalculateScratchBufferSize_Params_STRUCT_SIZE};
scratchBufferSizeParams.pOptions = &counterDataImageOptions;
CUPTI_API_CALL(cuptiProfilerCounterDataImageCalculateScratchBufferSize(&scratchBufferSizeParams));

// Configure the profiler for user-range profiling
CUpti_Profiler_UserRange_Config_Params configParams = {
    CUpti_Profiler_UserRange_Config_Params_STRUCT_SIZE};
configParams.pCounterDataPrefixImage = counterDataImage;
configParams.counterDataPrefixImageSize = counterDataImageOptions.counterDataImageSize;
configParams.maxRangesPerPass = 1;
configParams.maxLaunchesPerPass = 1;
CUPTI_API_CALL(cuptiProfilerUserRangeConfigureScratchBuffer(&configParams));

This code: 1. Initializes the CUPTI Profiler API 2. Gets the chip name for the current device 3. Sets up the metrics we want to collect (in this case, "smsp__warps_launched.avg") 4. Creates and configures the counter data image 5. Configures the profiler specifically for user-range profiling

2. Defining User Ranges in Your Code

Next, we define ranges around the code sections we want to profile:

void profileVectorOperations(int *d_A, int *d_B, int *d_C, int numElements)
{
    // Define the range name
    const char *rangeName = "Vector Add-Subtract";

    // Start a user range
    CUpti_Profiler_BeginRange_Params beginRangeParams = {CUpti_Profiler_BeginRange_Params_STRUCT_SIZE};
    beginRangeParams.pRangeName = rangeName;
    CUPTI_API_CALL(cuptiProfilerBeginRange(&beginRangeParams));

    // Execute operations within the range
    int threadsPerBlock = 256;
    int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("Launching kernel: blocks %d, thread/block %d\n", blocksPerGrid, threadsPerBlock);

    // First vector operation: Add
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

    // Second vector operation: Subtract
    VecSub<<<blocksPerGrid, threadsPerBlock>>>(d_C, d_B, d_C, numElements);

    // End the user range
    CUpti_Profiler_EndRange_Params endRangeParams = {CUpti_Profiler_EndRange_Params_STRUCT_SIZE};
    CUPTI_API_CALL(cuptiProfilerEndRange(&endRangeParams));
}

This function: 1. Defines a meaningful name for the range ("Vector Add-Subtract") 2. Starts a user range with cuptiProfilerBeginRange 3. Executes multiple GPU operations within the range (two kernel launches) 4. Ends the range with cuptiProfilerEndRange

3. Starting and Stopping the Profiler Session

We need to start the profiler before our ranges and stop it after:

int main(int argc, char *argv[])
{
    // Initialize CUDA and CUPTI
    initializeCuda();
    initializeProfiler();

    // Allocate and initialize data
    int *h_A, *h_B, *h_C;
    int *d_A, *d_B, *d_C;
    // ... allocation and initialization code ...

    // Start the profiler session
    CUpti_Profiler_BeginSession_Params beginSessionParams = {CUpti_Profiler_BeginSession_Params_STRUCT_SIZE};
    beginSessionParams.counterDataImageSize = counterDataImageOptions.counterDataImageSize;
    beginSessionParams.pCounterDataImage = counterDataImage;
    beginSessionParams.maxRangesPerPass = 1;
    beginSessionParams.maxLaunchesPerPass = 1;
    CUPTI_API_CALL(cuptiProfilerBeginSession(&beginSessionParams));

    // Enable profiling
    CUpti_Profiler_EnableProfiling_Params enableProfilingParams = {
        CUpti_Profiler_EnableProfiling_Params_STRUCT_SIZE};
    CUPTI_API_CALL(cuptiProfilerEnableProfiling(&enableProfilingParams));

    // Execute our profiled code with user ranges
    profileVectorOperations(d_A, d_B, d_C, numElements);

    // Disable profiling
    CUpti_Profiler_DisableProfiling_Params disableProfilingParams = {
        CUpti_Profiler_DisableProfiling_Params_STRUCT_SIZE};
    CUPTI_API_CALL(cuptiProfilerDisableProfiling(&disableProfilingParams));

    // End the profiler session
    CUpti_Profiler_EndSession_Params endSessionParams = {CUpti_Profiler_EndSession_Params_STRUCT_SIZE};
    CUPTI_API_CALL(cuptiProfilerEndSession(&endSessionParams));

    // Process the results
    processCounterData();

    // Clean up
    // ... cleanup code ...

    return 0;
}

This code: 1. Initializes CUDA and the CUPTI profiler 2. Starts a profiler session with cuptiProfilerBeginSession 3. Enables profiling with cuptiProfilerEnableProfiling 4. Executes the code containing our user ranges 5. Disables profiling and ends the session 6. Processes the collected data

4. Processing the Collected Metrics

After profiling, we need to process the counter data to calculate metrics:

void processCounterData()
{
    // Get the chip name for the current device
    CUpti_Device_GetChipName_Params getChipNameParams = {CUpti_Device_GetChipName_Params_STRUCT_SIZE};
    getChipNameParams.deviceIndex = 0;
    CUPTI_API_CALL(cuptiDeviceGetChipName(&getChipNameParams));
    const char *chipName = getChipNameParams.pChipName;

    // Set up the metric evaluation request
    NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params calculateScratchBufferSizeParams = {
        NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize_Params_STRUCT_SIZE};
    calculateScratchBufferSizeParams.pChipName = chipName;
    NVPW_API_CALL(NVPW_CUDA_MetricsEvaluator_CalculateScratchBufferSize(&calculateScratchBufferSizeParams));

    // Create the metrics evaluator
    NVPW_CUDA_MetricsEvaluator_Initialize_Params metricEvaluatorInitializeParams = {
        NVPW_CUDA_MetricsEvaluator_Initialize_Params_STRUCT_SIZE};
    metricEvaluatorInitializeParams.pChipName = chipName;
    metricEvaluatorInitializeParams.scratchBufferSize = calculateScratchBufferSizeParams.scratchBufferSize;
    metricEvaluatorInitializeParams.pScratchBuffer = malloc(calculateScratchBufferSizeParams.scratchBufferSize);
    NVPW_API_CALL(NVPW_CUDA_MetricsEvaluator_Initialize(&metricEvaluatorInitializeParams));

    // Get the number of ranges
    CUpti_Profiler_GetNumRanges_Params getNumRangesParams = {CUpti_Profiler_GetNumRanges_Params_STRUCT_SIZE};
    getNumRangesParams.pCounterDataImage = counterDataImage;
    CUPTI_API_CALL(cuptiProfilerGetNumRanges(&getNumRangesParams));

    // For each range, get the metrics
    for (size_t rangeIndex = 0; rangeIndex < getNumRangesParams.numRanges; ++rangeIndex) {
        // Get the range name
        CUpti_Profiler_GetRangeName_Params getRangeNameParams = {CUpti_Profiler_GetRangeName_Params_STRUCT_SIZE};
        getRangeNameParams.pCounterDataImage = counterDataImage;
        getRangeNameParams.rangeIndex = rangeIndex;
        CUPTI_API_CALL(cuptiProfilerGetRangeName(&getRangeNameParams));

        printf("Range %zu : %s\n", rangeIndex, getRangeNameParams.pRangeName);

        // Evaluate the metrics for this range
        NVPW_MetricsEvaluator_SetDeviceAttributes_Params setDeviceAttribParams = {
            NVPW_MetricsEvaluator_SetDeviceAttributes_Params_STRUCT_SIZE};
        setDeviceAttribParams.pMetricsEvaluator = metricEvaluatorInitializeParams.pMetricsEvaluator;
        setDeviceAttribParams.pCounterDataImage = counterDataImage;
        setDeviceAttribParams.counterDataImageSize = counterDataImageOptions.counterDataImageSize;
        setDeviceAttribParams.rangeIndex = rangeIndex;
        NVPW_API_CALL(NVPW_MetricsEvaluator_SetDeviceAttributes(&setDeviceAttribParams));

        // For each metric, get and display its value
        for (size_t metricIndex = 0; metricIndex < metricList.numMetrics; ++metricIndex) {
            const char *metricName = metricList.metricNames[metricIndex];

            NVPW_MetricsEvaluator_GetMetricValue_Params getMetricValueParams = {
                NVPW_MetricsEvaluator_GetMetricValue_Params_STRUCT_SIZE};
            getMetricValueParams.pMetricsEvaluator = metricEvaluatorInitializeParams.pMetricsEvaluator;
            getMetricValueParams.metricName = metricName;
            NVPW_API_CALL(NVPW_MetricsEvaluator_GetMetricValue(&getMetricValueParams));

            printf("  %s: %.1f\n", metricName, getMetricValueParams.metricValue);
        }
    }

    // Clean up
    free(metricEvaluatorInitializeParams.pScratchBuffer);
}

This function: 1. Sets up a metrics evaluator for the current device 2. Gets the number of ranges that were profiled 3. For each range: - Gets the range name - Sets up the device attributes for metric evaluation - Calculates and displays the value of each requested metric

5. Sample Vector Operation Kernels

Here are the simple vector operation kernels we're profiling:

__global__ void VecAdd(const int* A, const int* B, int* C, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

__global__ void VecSub(const int* A, const int* B, int* C, int N)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] - B[i];
    }
}

These kernels: 1. Calculate the global thread index 2. Check if the index is within bounds 3. Perform a simple vector addition or subtraction

Running the Tutorial

Build the sample:
```
make
```
Run the user-range profiling example:
```
./userRangeSample
```

Understanding the Output

When you run the user-range profiling example, you'll see output similar to this:

Launching kernel: blocks 196, thread/block 256
Range 0 : Vector Add-Subtract
  smsp__warps_launched.avg: 392.0

Let's analyze this output:

Kernel Launch Information:
blocks 196, thread/block 256: Shows the kernel configuration
Range Information:
Range 0 : Vector Add-Subtract: Identifies the user-defined range by index and name
Metric Values:
smsp__warps_launched.avg: 392.0: Shows the average number of warps launched per streaming multiprocessor

The metric smsp__warps_launched.avg indicates the average number of warps launched per streaming multiprocessor (SM) during the execution of operations within our range. In this case, the value is 392.0, which means that on average, each SM launched 392 warps during the vector operations.

Advanced User-Range Profiling Techniques

1. Nested Ranges

You can create nested ranges to profile hierarchical code structures:

// Start outer range
cuptiProfilerBeginRange(&outerRangeParams);

// First operation
operation1();

// Start inner range
cuptiProfilerBeginRange(&innerRangeParams);
operation2();
operation3();
// End inner range
cuptiProfilerEndRange(&innerEndRangeParams);

operation4();

// End outer range
cuptiProfilerEndRange(&outerEndRangeParams);

2. Multiple Metrics

You can collect multiple metrics in a single profiling session:

const char *metricNames[] = {
    "smsp__warps_launched.avg",
    "sm__warps_active.avg.pct_of_peak_sustained_active",
    "dram__bytes_read.sum",
    "dram__bytes_write.sum"
};
metricList.numMetrics = 4;
metricList.metricNames = metricNames;

3. Comparing Implementations

You can use user ranges to compare different implementations of the same algorithm:

// Implementation 1
cuptiProfilerBeginRange(&impl1RangeParams);
algorithmImplementation1();
cuptiProfilerEndRange(&impl1EndRangeParams);

// Implementation 2
cuptiProfilerBeginRange(&impl2RangeParams);
algorithmImplementation2();
cuptiProfilerEndRange(&impl2EndRangeParams);

4. Profiling Specific Application Phases

You can profile specific phases of your application:

// Data preparation phase
cuptiProfilerBeginRange(&prepRangeParams);
prepareData();
cuptiProfilerEndRange(&prepEndRangeParams);

// Computation phase
cuptiProfilerBeginRange(&computeRangeParams);
performComputation();
cuptiProfilerEndRange(&computeEndRangeParams);

// Result processing phase
cuptiProfilerBeginRange(&processRangeParams);
processResults();
cuptiProfilerEndRange(&processEndRangeParams);

Common Metrics for User-Range Profiling

Here are some useful metrics to collect in user ranges:

Compute Utilization:
sm__warps_active.avg.pct_of_peak_sustained_active: Percentage of warps active
smsp__thread_inst_executed.avg.pct_of_peak_sustained_active: Thread instruction execution efficiency
Memory Performance:
dram__bytes_read.sum: Total bytes read from DRAM
dram__bytes_write.sum: Total bytes written to DRAM
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum: Global memory load transactions
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum: Global memory store transactions
Execution Statistics:
smsp__warps_launched.avg: Average warps launched per SM
smsp__cycles_active.avg: Average cycles with at least one active warp
sm__cycles_elapsed.avg: Average cycles elapsed
Instruction Mix:
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum: Float add instructions
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum: Float multiply instructions
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum: Fused multiply-add instructions

Next Steps

Apply user-range profiling to your own CUDA applications
Experiment with different metrics to gain insights into performance
Use nested ranges to analyze hierarchical code structures
Compare different implementations of algorithms using user ranges
Combine user-range profiling with other CUPTI features for comprehensive analysis

Share on Share on