CUPTI OpenACC Tracing Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
OpenACC is a directive-based programming model that simplifies GPU programming by allowing developers to annotate their code with pragmas that the compiler translates into GPU operations. While this simplifies development, it can make performance analysis challenging because the relationship between your directives and the actual GPU operations isn't always clear. This tutorial demonstrates how to use CUPTI to trace OpenACC API calls and correlate them with GPU activities, giving you insights into how your OpenACC code executes on the GPU.
What You'll Learn
- How to intercept and trace OpenACC API calls
- Correlating OpenACC directives with the resulting GPU operations
- Measuring performance of different OpenACC operations
- Identifying optimization opportunities in OpenACC applications
Understanding OpenACC Execution
When you write OpenACC code, the following typically happens:
- The OpenACC compiler translates your directives into GPU code
- At runtime, the OpenACC runtime library:
- Manages memory allocations and transfers
- Launches GPU kernels
- Synchronizes execution between host and device
By tracing these operations, we can understand how our high-level directives translate to low-level GPU operations.
Code Walkthrough
1. Setting Up OpenACC API Interception
To trace OpenACC API calls, we need to intercept the functions from the OpenACC runtime library:
// Define function pointer types for OpenACC API functions
typedef int (*acc_init_t)(acc_device_t);
typedef int (*acc_get_num_devices_t)(acc_device_t);
typedef void (*acc_set_device_num_t)(int, acc_device_t);
typedef int (*acc_get_device_num_t)(acc_device_t);
typedef void* (*acc_create_t)(void*, size_t);
typedef void* (*acc_copyin_t)(void*, size_t);
typedef void (*acc_delete_t)(void*, size_t);
typedef void* (*acc_copyout_t)(void*, size_t);
// ... more function types ...
// Original function pointers
static acc_init_t real_acc_init = NULL;
static acc_get_num_devices_t real_acc_get_num_devices = NULL;
static acc_set_device_num_t real_acc_set_device_num = NULL;
// ... more function pointers ...
// Initialize function pointers using dlsym
void initOpenACCAPI()
{
void *handle = dlopen("libopenacc.so", RTLD_NOW);
if (!handle) {
fprintf(stderr, "Error: could not load OpenACC library\n");
exit(1);
}
real_acc_init = (acc_init_t)dlsym(handle, "acc_init");
real_acc_get_num_devices = (acc_get_num_devices_t)dlsym(handle, "acc_get_num_devices");
real_acc_set_device_num = (acc_set_device_num_t)dlsym(handle, "acc_set_device_num");
// ... load more functions ...
}
This code:
1. Defines function pointer types matching the OpenACC API
2. Creates variables to hold the original function pointers
3. Uses dlsym
to get the addresses of the actual functions in the OpenACC library
2. Implementing Wrapper Functions
For each OpenACC function we want to trace, we implement a wrapper function:
// Wrapper for acc_init
int acc_init(acc_device_t device_type)
{
// Record start time
double startTime = getCurrentTime();
// Call the real function
int result = real_acc_init(device_type);
// Record end time
double endTime = getCurrentTime();
// Log the function call
printf("[%.3f ms] acc_init(", (startTime - programStartTime) * 1000.0);
printDeviceType(device_type);
printf(")\n");
return result;
}
// Wrapper for acc_create
void* acc_create(void* host_addr, size_t size)
{
double startTime = getCurrentTime();
// Call the real function
void* result = real_acc_create(host_addr, size);
double endTime = getCurrentTime();
// Log the function call with details
printf("[%.3f ms] acc_create(%p, %zu) [%zu elements]\n",
(startTime - programStartTime) * 1000.0,
host_addr, size, size / sizeof(float));
// Track the CUDA memory allocation
trackMemoryAllocation(host_addr, result, size);
return result;
}
These wrappers: 1. Record the time before and after calling the real function 2. Log information about the function call 3. Track resources like memory allocations 4. Return the result from the real function
3. Correlating with CUDA Activities
To correlate OpenACC API calls with CUDA operations, we use CUPTI's Activity API:
void initCUPTI()
{
// Initialize CUPTI activity API
CUPTI_CALL(cuptiActivityRegisterCallbacks(bufferRequested, bufferCompleted));
// Enable activity kinds we want to track
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_RUNTIME));
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_MEMCPY));
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL));
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_MEMSET));
CUPTI_CALL(cuptiActivityEnable(CUPTI_ACTIVITY_KIND_MEMORY));
// Get the start timestamp for normalizing times
CUPTI_CALL(cuptiGetTimestamp(&startTimestamp));
}
void processActivity(CUpti_Activity *record)
{
switch (record->kind) {
case CUPTI_ACTIVITY_KIND_MEMCPY: {
CUpti_ActivityMemcpy *memcpy = (CUpti_ActivityMemcpy *)record;
// Find the corresponding OpenACC API call
OpenACCMemoryOp *op = findOpenACCMemoryOp(memcpy->start);
if (op) {
// This memory copy corresponds to an OpenACC data directive
printf("[%.3f ms] > CUDA %s Transfer: %zu bytes %s %p\n",
(memcpy->start - startTimestamp) / 1000.0,
getMemcpyKindString(memcpy->copyKind),
memcpy->bytes,
(memcpy->copyKind == CUPTI_ACTIVITY_MEMCPY_KIND_HTOD) ? "to" : "from",
(void*)memcpy->deviceId);
// Update statistics
if (memcpy->copyKind == CUPTI_ACTIVITY_MEMCPY_KIND_HTOD) {
hostToDeviceTime += (memcpy->end - memcpy->start) / 1000.0;
} else {
deviceToHostTime += (memcpy->end - memcpy->start) / 1000.0;
}
}
break;
}
case CUPTI_ACTIVITY_KIND_KERNEL: {
CUpti_ActivityKernel5 *kernel = (CUpti_ActivityKernel5 *)record;
// Find the corresponding OpenACC compute region
OpenACCComputeRegion *region = findOpenACCComputeRegion(kernel->start);
if (region) {
// This kernel corresponds to an OpenACC compute directive
printf("[%.3f ms] > CUDA Kernel Launch: %s [grid:%d block:%d]\n",
(kernel->start - startTimestamp) / 1000.0,
kernel->name,
kernel->gridX,
kernel->blockX);
// Update statistics
kernelExecutionTime += (kernel->end - kernel->start) / 1000.0;
}
break;
}
// ... handle other activity types ...
}
}
This code: 1. Sets up CUPTI to collect activities like memory transfers and kernel launches 2. Processes each activity record and correlates it with OpenACC operations 3. Updates performance statistics for different types of operations
4. Sample OpenACC Application
The sample includes a simple OpenACC application to demonstrate the tracing:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#define N 4096
int main()
{
float *a, *b, *c;
int i;
// Allocate host memory
a = (float*)malloc(N * sizeof(float));
b = (float*)malloc(N * sizeof(float));
c = (float*)malloc(N * sizeof(float));
// Initialize arrays
for (i = 0; i < N; i++) {
a[i] = (float)i;
b[i] = (float)(N - i);
}
// OpenACC directives for vector addition
#pragma acc data create(a[0:N], b[0:N], c[0:N])
{
#pragma acc update device(a[0:N], b[0:N])
#pragma acc parallel loop
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
#pragma acc update host(c[0:N])
}
// Verify results
for (i = 0; i < N; i++) {
if (c[i] != (float)(N)) {
printf("Error: c[%d] = %f, expected %f\n", i, c[i], (float)N);
break;
}
}
printf("Vector addition completed successfully\n");
// Free memory
free(a);
free(b);
free(c);
return 0;
}
This simple application: 1. Allocates and initializes arrays on the host 2. Uses OpenACC directives to: - Create device memory - Copy data to the device - Execute a parallel loop for vector addition - Copy results back to the host 3. Verifies the results
Running the Tutorial
-
Build the sample:
-
Run the OpenACC tracer:
Understanding the Output
The output shows a timeline of OpenACC API calls and their corresponding GPU activities:
OpenACC API Trace:
[00.000 ms] acc_init(acc_device_nvidia)
[00.245 ms] acc_get_num_devices(acc_device_nvidia) returned 1
[00.302 ms] acc_set_device_num(0, acc_device_nvidia)
[00.356 ms] acc_get_device_num(acc_device_nvidia) returned 0
[01.234 ms] acc_create(0x7fff5a1c3000, 16384) [4096 elements]
[01.456 ms] > CUDA Memory Allocation: 16384 bytes at 0xd00000
[01.789 ms] acc_create(0x7fff5a1c7000, 16384) [4096 elements]
[01.856 ms] > CUDA Memory Allocation: 16384 bytes at 0xd04000
[02.123 ms] acc_copyin(0x7fff5a1c3000, 16384) [4096 elements]
[02.234 ms] > CUDA H2D Transfer: 16384 bytes to 0xd00000
[02.345 ms] acc_copyin(0x7fff5a1c7000, 16384) [4096 elements]
[02.456 ms] > CUDA H2D Transfer: 16384 bytes to 0xd04000
[03.012 ms] acc_parallel: entering compute region
[03.123 ms] > CUDA Kernel Launch: Vector Addition [grid:16 block:256]
[03.456 ms] acc_parallel: exiting compute region
[04.123 ms] acc_copyout(0x7fff5a1c3000, 16384) [4096 elements]
[04.234 ms] > CUDA D2H Transfer: 16384 bytes from 0xd00000
[04.345 ms] acc_delete(0x7fff5a1c3000, 16384)
[04.456 ms] > CUDA Memory Free: 0xd00000
[04.567 ms] acc_delete(0x7fff5a1c7000, 16384)
[04.678 ms] > CUDA Memory Free: 0xd04000
This output shows:
- Initialization Phase:
-
OpenACC runtime initializes and selects the NVIDIA GPU device
-
Memory Allocation Phase:
acc_create
calls allocate memory on the device-
Each call is matched with a CUDA memory allocation
-
Data Transfer Phase:
acc_copyin
calls transfer data from host to device-
Each call is matched with a CUDA H2D (Host-to-Device) transfer
-
Computation Phase:
acc_parallel
marks the beginning and end of a compute region-
The region is matched with a CUDA kernel launch
-
Result Retrieval Phase:
acc_copyout
transfers results from device to host-
Matched with a CUDA D2H (Device-to-Host) transfer
-
Cleanup Phase:
acc_delete
frees device memory- Matched with CUDA memory free operations
The trace also includes a performance summary:
OpenACC Performance Summary:
Data Allocation Time: 0.623 ms
Host-to-Device Transfer Time: 0.344 ms
Kernel Execution Time: 0.444 ms
Device-to-Host Transfer Time: 0.111 ms
Data Deallocation Time: 0.233 ms
Total OpenACC API Time: 1.755 ms
This summary helps identify where time is spent in your OpenACC application.
Performance Analysis
The trace data reveals several important insights:
- Data Movement Overhead:
- Compare the time spent in data transfers versus computation
-
Look for unnecessary data transfers
-
Memory Management:
- Check if memory is being allocated and freed efficiently
-
Look for opportunities to reuse device memory
-
Kernel Efficiency:
- Examine the grid and block dimensions of launched kernels
-
Compare the kernel execution time with the overall compute region time
-
API Overhead:
- Measure the overhead of OpenACC runtime API calls
- Look for cases where manual CUDA code might be more efficient
Advanced Usage
Tracing Specific OpenACC Constructs
To focus on specific OpenACC constructs, you can modify the wrapper functions to trace only certain API calls:
// Only trace data movement operations
void* acc_copyin(void* host_addr, size_t size)
{
double startTime = getCurrentTime();
void* result = real_acc_copyin(host_addr, size);
double endTime = getCurrentTime();
// Only log if size is above a threshold
if (size > THRESHOLD) {
printf("[%.3f ms] acc_copyin(%p, %zu)\n",
(startTime - programStartTime) * 1000.0,
host_addr, size);
}
return result;
}
Correlating with Source Code
To correlate GPU activities with your source code, you can add source file and line information:
// Add source information to OpenACC API traces
#define ACC_TRACE(func, args...) \
printf("[%.3f ms] %s:%d: " #func "(", \
(getCurrentTime() - programStartTime) * 1000.0, \
__FILE__, __LINE__); \
func(args); \
printf(")\n")
Next Steps
- Apply OpenACC tracing to your own applications
- Identify performance bottlenecks in data movement and computation
- Use the insights to optimize your OpenACC directives
- Compare different OpenACC implementations for the same algorithm