CUPTI Performance Metrics Tutorial

The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.

Introduction

Understanding GPU performance requires more than just raw event counts - it requires meaningful metrics that provide insights into how efficiently your code is running. This tutorial demonstrates how to use CUPTI callbacks to collect and calculate performance metrics during CUDA kernel execution, giving you powerful insights into your application's GPU utilization.

What You'll Learn

How to collect complex performance metrics during kernel execution
Techniques for handling metrics that require multiple passes
Converting raw event counts into meaningful performance indicators
Interpreting metrics to identify optimization opportunities

Understanding Performance Metrics

Performance metrics in CUDA provide high-level insights by combining multiple hardware events. For example:

IPC (Instructions Per Cycle): Measures computational efficiency
Memory Throughput: Measures memory bandwidth utilization
Warp Execution Efficiency: Measures thread utilization within warps

These metrics are more intuitive than raw event counts and directly relate to optimization strategies.

Code Walkthrough

1. Setting Up Metric Collection

First, we need to identify which events are required for our target metric:

int main(int argc, char *argv[])
{
    // Default to "ipc" (instructions per cycle) or use command line argument
    const char *metricName = "ipc";
    if (argc > 1) {
        metricName = argv[1];
    }

    // Initialize CUDA
    RUNTIME_API_CALL(cudaSetDevice(0));

    // Get device properties
    CUdevice device = 0;
    DRIVER_API_CALL(cuDeviceGet(&device, 0));

    // Get the metric ID for the requested metric
    CUpti_MetricID metricId;
    CUPTI_CALL(cuptiMetricGetIdFromName(device, metricName, &metricId));

    // Determine the events needed for this metric
    uint32_t numEvents = 0;
    CUPTI_CALL(cuptiMetricGetNumEvents(metricId, &numEvents));

    // Allocate space for event IDs
    CUpti_EventID *eventIds = (CUpti_EventID *)malloc(numEvents * sizeof(CUpti_EventID));
    CUPTI_CALL(cuptiMetricEnumEvents(metricId, &numEvents, eventIds));

    // Determine how many passes are needed to collect all events
    MetricData_t metricData;
    metricData.device = device;
    metricData.eventIdArray = eventIds;
    metricData.numEvents = numEvents;
    metricData.eventValueArray = (uint64_t *)calloc(numEvents, sizeof(uint64_t));
    metricData.eventIdx = 0;

    // Create event groups for each pass
    createEventGroups(&metricData);

    printf("Collecting events for metric %s\n", metricName);
}

This code: 1. Identifies which metric to collect (default is "ipc") 2. Gets the list of events required for that metric 3. Sets up data structures to hold event values 4. Creates event groups for collecting the events

2. Creating Event Groups

Some events can't be collected simultaneously due to hardware limitations, so we need to organize them into compatible groups:

void createEventGroups(MetricData_t *metricData)
{
    CUcontext context = NULL;
    DRIVER_API_CALL(cuCtxGetCurrent(&context));

    // Get number of event domains on the device
    uint32_t numDomains = 0;
    CUPTI_CALL(cuptiDeviceGetNumEventDomains(metricData->device, &numDomains));

    // Get the event domains
    CUpti_EventDomainID *domainIds = (CUpti_EventDomainID *)malloc(numDomains * sizeof(CUpti_EventDomainID));
    CUPTI_CALL(cuptiDeviceEnumEventDomains(metricData->device, &numDomains, domainIds));

    // For each event, find its domain and available instances
    for (int i = 0; i < metricData->numEvents; i++) {
        CUpti_EventDomainID domainId;
        CUPTI_CALL(cuptiEventGetAttribute(metricData->eventIdArray[i], 
                                        CUPTI_EVENT_ATTR_DOMAIN, 
                                        &domainId));

        // Find this domain in our list
        int domainIndex = -1;
        for (int j = 0; j < numDomains; j++) {
            if (domainId == domainIds[j]) {
                domainIndex = j;
                break;
            }
        }

        // If this is a new event group or the event can't go in the current group,
        // create a new event group
        if (metricData->numEventGroups == 0 || 
            !canAddEventToGroup(metricData, metricData->eventIdArray[i])) {

            // Create a new event group
            CUpti_EventGroup eventGroup;
            CUPTI_CALL(cuptiEventGroupCreate(context, &eventGroup, 0));

            // Add the event to the group
            CUPTI_CALL(cuptiEventGroupAddEvent(eventGroup, metricData->eventIdArray[i]));

            // Store the event group
            metricData->eventGroups[metricData->numEventGroups++] = eventGroup;
        }
        else {
            // Add the event to the existing group
            CUPTI_CALL(cuptiEventGroupAddEvent(
                metricData->eventGroups[metricData->numEventGroups-1], 
                metricData->eventIdArray[i]));
        }
    }

    free(domainIds);
}

This function: 1. Gets all event domains on the device 2. For each event, determines its domain 3. Creates event groups that are compatible with hardware limitations 4. May create multiple groups if events can't be collected simultaneously

3. Setting Up the Callback Function

Now we register a callback that will be called when kernels are launched:

// Subscribe to CUPTI callbacks
CUpti_SubscriberHandle subscriber;
CUPTI_CALL(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)getMetricValueCallback, &metricData));

// Enable callbacks for CUDA runtime
CUPTI_CALL(cuptiEnableDomain(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API));

4. The Callback Function

The heart of the sample is the callback function that collects event data:

void CUPTIAPI getMetricValueCallback(void *userdata, CUpti_CallbackDomain domain,
                                    CUpti_CallbackId cbid, const CUpti_CallbackData *cbInfo)
{
    MetricData_t *metricData = (MetricData_t*)userdata;

    // Only process runtime API callbacks for kernel launches
    if (domain != CUPTI_CB_DOMAIN_RUNTIME_API ||
        (cbid != CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020 &&
         cbid != CUPTI_RUNTIME_TRACE_CBID_cudaLaunchKernel_v7000)) {
        return;
    }

    // Check if we're entering or exiting the function
    if (cbInfo->callbackSite == CUPTI_API_ENTER) {
        // We're about to launch a kernel

        // Enable the event groups for this pass
        for (int i = 0; i < metricData->numEventGroups; i++) {
            CUPTI_CALL(cuptiEventGroupEnable(metricData->eventGroups[i]));
        }

        // Set the collection mode to collect only during kernel execution
        CUpti_EventCollectionMode mode = CUPTI_EVENT_COLLECTION_MODE_KERNEL;
        for (int i = 0; i < metricData->numEventGroups; i++) {
            CUPTI_CALL(cuptiEventGroupSetAttribute(metricData->eventGroups[i],
                                                 CUPTI_EVENT_GROUP_ATTR_COLLECTION_MODE,
                                                 sizeof(mode), &mode));
        }
    }
    else if (cbInfo->callbackSite == CUPTI_API_EXIT) {
        // The kernel has completed

        // Make sure all work is done
        RUNTIME_API_CALL(cudaDeviceSynchronize());

        // For each event group, read and normalize the event values
        for (int i = 0; i < metricData->numEventGroups; i++) {
            CUpti_EventGroup group = metricData->eventGroups[i];

            // Get the number of events in this group
            uint32_t numEvents = 0;
            CUPTI_CALL(cuptiEventGroupGetAttribute(group, 
                                                 CUPTI_EVENT_GROUP_ATTR_NUM_EVENTS,
                                                 &numEvents));

            // Get the event IDs in this group
            CUpti_EventID *eventIds = (CUpti_EventID *)malloc(numEvents * sizeof(CUpti_EventID));
            CUPTI_CALL(cuptiEventGroupGetAttribute(group, 
                                                 CUPTI_EVENT_GROUP_ATTR_EVENTS,
                                                 eventIds));

            // Get the number of instances for this group
            uint32_t numInstances = 0;
            CUPTI_CALL(cuptiEventGroupGetAttribute(group, 
                                                 CUPTI_EVENT_GROUP_ATTR_INSTANCE_COUNT,
                                                 &numInstances));

            // Read the event values for each event in this group
            for (int j = 0; j < numEvents; j++) {
                // Find the index of this event in our global list
                int index = -1;
                for (int k = 0; k < metricData->numEvents; k++) {
                    if (eventIds[j] == metricData->eventIdArray[k]) {
                        index = k;
                        break;
                    }
                }

                if (index != -1) {
                    // Read the event value
                    uint64_t value = 0;
                    CUPTI_CALL(cuptiEventGroupReadEvent(group,
                                                      CUPTI_EVENT_READ_FLAG_NONE,
                                                      eventIds[j],
                                                      &value));

                    // Store the value
                    metricData->eventValueArray[index] = value;

                    // Get the event name for display
                    char eventName[128];
                    CUPTI_CALL(cuptiEventGetAttribute(eventIds[j],
                                                    CUPTI_EVENT_ATTR_NAME,
                                                    eventName));

                    // Print the raw and normalized values
                    printf("\t%s = %llu (%llu)\n", eventName, 
                           (unsigned long long)value, 
                           (unsigned long long)metricData->eventValueArray[index]);

                    // Normalize the value by the number of instances
                    uint64_t normalized = value * numInstances;
                    printf("\t%s (normalized) (%llu * %u) / %u = %llu\n", 
                           eventName, 
                           (unsigned long long)value,
                           numInstances,
                           numInstances,
                           (unsigned long long)normalized);
                }
            }

            free(eventIds);
        }

        // Disable the event groups
        for (int i = 0; i < metricData->numEventGroups; i++) {
            CUPTI_CALL(cuptiEventGroupDisable(metricData->eventGroups[i]));
        }

        // Calculate the metric value from the collected events
        calculateMetricValue(metricData);
    }
}

This function: 1. Enables event collection when a kernel is about to launch 2. Reads event values after the kernel completes 3. Normalizes event values based on the number of instances 4. Displays raw and normalized values for each event

5. Calculating the Metric Value

After collecting all events, we calculate the actual metric value:

void calculateMetricValue(MetricData_t *metricData)
{
    // Get the metric ID
    CUpti_MetricID metricId;
    CUPTI_CALL(cuptiMetricGetIdFromName(metricData->device, "ipc", &metricId));

    // Calculate the metric value from the collected events
    double metricValue = 0.0;
    CUPTI_CALL(cuptiMetricGetValue(metricData->device, 
                                  metricId,
                                  metricData->numEvents * sizeof(CUpti_EventID),
                                  metricData->eventIdArray,
                                  metricData->numEvents * sizeof(uint64_t),
                                  metricData->eventValueArray,
                                  0,
                                  &metricValue));

    // Get the metric name
    char metricName[128];
    CUPTI_CALL(cuptiMetricGetAttribute(metricId, 
                                      CUPTI_METRIC_ATTR_NAME,
                                      metricName));

    // Print the metric value
    printf("\nMetric %s\n", metricName);
    printf("\tValue: %f\n", metricValue);
}

This function: 1. Takes the raw event values we collected 2. Uses CUPTI to calculate the metric value 3. Displays the final metric value

6. The Test Kernel

The sample uses a simple vector addition kernel to demonstrate metric collection:

__global__ void VecAdd(const int* A, const int* B, int* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        // Do some extra work to generate more instructions
        int temp = 0;
        for (int j = 0; j < 100; j++) {
            temp += A[i] + B[i];
        }
        C[i] = temp;
    }
}

Running the Tutorial

Build the sample:
```
make
```
Run with the default metric (instructions per cycle):
```
./callback_metric
```
Try a different metric:
```
./callback_metric achieved_occupancy
```

Understanding the Output

When running with the default "ipc" metric, you'll see output like:

Collecting events for metric ipc
Pass 1/1 collecting 2 events using 1 event group
        inst_executed = 51200 (51200)
        inst_executed (normalized) (51200 * 1) / 1 = 51200
        active_cycles = 25600 (25600)
        active_cycles (normalized) (25600 * 1) / 1 = 25600

Metric ipc
        Value: 2.000000

Let's decode this output:

Events Collected: The "ipc" metric requires two events - instructions executed and active cycles
Raw Values: The actual hardware counter values read from the GPU
Normalized Values: Adjusted values accounting for all instances of the counters
Metric Value: The calculated IPC value (2.0), meaning on average 2 instructions are executed per clock cycle

Interpreting Common Metrics

Instructions Per Cycle (IPC)

What it measures: Computational efficiency
Ideal range: Varies by architecture, but higher is better
Optimization hints:
Low IPC (<1): May indicate memory bottlenecks or thread divergence
High IPC (>2): Good instruction-level parallelism

Achieved Occupancy

What it measures: Ratio of active warps to maximum possible warps
Ideal range: Closer to 1.0 is better
Optimization hints:
Low occupancy: Consider reducing register usage or shared memory
High occupancy doesn't guarantee better performance but provides more latency hiding

Memory Throughput

What it measures: Memory bandwidth utilization
Ideal range: Closer to peak bandwidth is better
Optimization hints:
Low throughput: Consider memory access patterns, coalescing, or caching
High throughput with low IPC: Application is memory-bound

Advanced Usage

Collecting Multiple Metrics

To collect multiple metrics: 1. Run the application multiple times, each with a different metric 2. Or modify the code to collect all required events and calculate multiple metrics

Multi-Pass Collection

Some metrics require events that can't be collected simultaneously. The sample handles this by: 1. Determining which events can be collected together 2. Creating multiple event groups if needed 3. Running the kernel multiple times if necessary

Next Steps

Try collecting different metrics to understand various aspects of your kernels
Apply metric collection to your own CUDA applications
Use metrics to identify optimization opportunities
Combine multiple metrics to get a complete performance picture

Share on Share on