CUPTI SASS Source Mapping Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
When optimizing CUDA kernels, understanding how your source code translates to actual GPU assembly instructions can provide powerful insights. SASS (Streaming Assembly) is the native assembly language executed by NVIDIA GPUs, and mapping between your CUDA C/C++ code and the resulting SASS instructions can reveal optimization opportunities that aren't apparent at the source level. This tutorial demonstrates how to use CUPTI to extract SASS code for your kernels and map it back to the original source code, helping you understand the relationship between your high-level code and the instructions that actually execute on the GPU.
What You'll Learn
- How to extract SASS assembly code for CUDA kernels
- Mapping SASS instructions back to source code lines
- Interpreting SASS to understand instruction-level behavior
- Identifying optimization opportunities by analyzing SASS
- Using SASS information to make informed optimization decisions
Understanding CUDA Compilation
When you compile CUDA code, it goes through several stages:
- CUDA C/C++: Your high-level source code
- PTX: An intermediate representation (Parallel Thread Execution)
- SASS: The final machine code executed by the GPU
Each stage represents a different level of abstraction, and understanding the final SASS code can provide insights that aren't visible at the source level.
Code Walkthrough
1. Loading a CUDA Module
First, we need to load a CUDA module to access its code:
CUmodule module;
CUfunction function;
// Load the module
DRIVER_API_CALL(cuModuleLoad(&module, "kernel.cubin"));
// Get the kernel function
DRIVER_API_CALL(cuModuleGetFunction(&function, module, "vectorAdd"));
This code: 1. Loads a compiled CUDA binary (cubin) file 2. Gets a handle to a specific kernel function within that module
2. Extracting SASS Code
Next, we extract the SASS code for the kernel:
// Get the function's code
CUdeviceptr code;
size_t codeSize;
DRIVER_API_CALL(cuFuncGetAttribute(&code, CU_FUNC_ATTRIBUTE_CODE, function));
DRIVER_API_CALL(cuFuncGetAttribute(&codeSize, CU_FUNC_ATTRIBUTE_BINARY_SIZE, function));
// Allocate memory for the code
unsigned char *sassCode = (unsigned char *)malloc(codeSize);
if (!sassCode) {
fprintf(stderr, "Failed to allocate memory for SASS code\n");
return -1;
}
// Copy the code from device memory
DRIVER_API_CALL(cuMemcpyDtoH(sassCode, code, codeSize));
This code: 1. Gets the device pointer to the function's code and its size 2. Allocates memory to hold the SASS code 3. Copies the code from device memory to host memory
3. Disassembling SASS
Now we disassemble the binary SASS code into a human-readable format:
// Create a disassembler
CUpti_Activity_DisassembleData disassembleData;
memset(&disassembleData, 0, sizeof(disassembleData));
disassembleData.size = sizeof(disassembleData);
disassembleData.cubin = sassCode;
disassembleData.cubinSize = codeSize;
disassembleData.function = (const char *)function;
// Disassemble the code
CUPTI_CALL(cuptiActivityDisassembleKernel(&disassembleData));
// Get the disassembled SASS
const char *sassText = disassembleData.sass;
This code: 1. Sets up a structure for disassembly 2. Calls CUPTI to disassemble the kernel 3. Gets the resulting SASS text
4. Mapping SASS to Source Code
To map SASS instructions to source code, we use CUPTI's line information API:
// Get the number of functions in the module
uint32_t numFunctions = 0;
CUPTI_CALL(cuptiModuleGetNumFunctions(module, &numFunctions));
// Get the function IDs
CUpti_ModuleResourceData *functionIds =
(CUpti_ModuleResourceData *)malloc(numFunctions * sizeof(CUpti_ModuleResourceData));
CUPTI_CALL(cuptiModuleGetFunctions(module, numFunctions, functionIds));
// For each function
for (uint32_t i = 0; i < numFunctions; i++) {
// Check if this is our target function
if (strcmp(functionIds[i].resourceName, "vectorAdd") == 0) {
// Get line information
uint32_t numLines = 0;
CUPTI_CALL(cuptiGetNumLines(functionIds[i].function, &numLines));
// Allocate memory for line information
CUpti_LineInfo *lineInfo =
(CUpti_LineInfo *)malloc(numLines * sizeof(CUpti_LineInfo));
// Get the line information
CUPTI_CALL(cuptiGetLineInfo(functionIds[i].function, numLines, lineInfo));
// Process line information
for (uint32_t j = 0; j < numLines; j++) {
printf("SASS instruction at offset 0x%x maps to %s:%d\n",
lineInfo[j].pcOffset, lineInfo[j].fileName, lineInfo[j].lineNumber);
}
free(lineInfo);
}
}
free(functionIds);
This code: 1. Gets the list of functions in the module 2. Finds our target function 3. Gets line information for that function 4. Maps each SASS instruction offset to a source file and line number
5. Creating a Source-Annotated SASS Listing
Now we combine the SASS code with source line information:
void printSourceAnnotatedSass(const char *sassText, CUpti_LineInfo *lineInfo, uint32_t numLines)
{
// Parse the SASS text
char *sassCopy = strdup(sassText);
char *line = strtok(sassCopy, "\n");
int currentSourceLine = -1;
const char *currentFileName = NULL;
// Process each line of SASS
while (line != NULL) {
// Extract the instruction offset
unsigned int offset;
if (sscanf(line, "/*%x*/", &offset) == 1) {
// Find the source line for this offset
for (uint32_t i = 0; i < numLines; i++) {
if (lineInfo[i].pcOffset == offset) {
// If we've moved to a new source line, print it
if (currentSourceLine != lineInfo[i].lineNumber ||
currentFileName != lineInfo[i].fileName) {
currentSourceLine = lineInfo[i].lineNumber;
currentFileName = lineInfo[i].fileName;
// Read the source file and get the line
char sourceLine[1024];
FILE *sourceFile = fopen(currentFileName, "r");
if (sourceFile) {
for (int j = 0; j < currentSourceLine; j++) {
if (!fgets(sourceLine, sizeof(sourceLine), sourceFile)) {
break;
}
}
fclose(sourceFile);
// Remove newline
sourceLine[strcspn(sourceLine, "\n")] = 0;
printf("\nSource Line %d: %s\n", currentSourceLine, sourceLine);
}
}
break;
}
}
}
// Print the SASS instruction
printf(" %s\n", line);
// Get the next line
line = strtok(NULL, "\n");
}
free(sassCopy);
}
This function: 1. Parses the SASS text line by line 2. Extracts the instruction offset from each line 3. Finds the corresponding source line for that offset 4. Prints the source line followed by the SASS instructions
6. Sample Kernel for Analysis
Here's a simple vector addition kernel we'll analyze:
__global__ void vectorAdd(const float *a, const float *b, float *c, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
This kernel: 1. Calculates a global thread index 2. Checks if the index is within bounds 3. Performs a simple vector addition
Running the Tutorial
-
Build the sample:
-
Run the SASS source mapping example:
Understanding the Output
When you run the SASS source mapping example, you'll see output similar to this:
SASS Disassembly for kernel 'vectorAdd':
Source Line 42: int i = blockIdx.x * blockDim.x + threadIdx.x;
/*0008*/ MOV R1, c[0x0][0x44]; /* Source Line 42 */
/*0010*/ S2R R0, SR_CTAID.X; /* Source Line 42 */
/*0018*/ S2R R3, SR_TID.X; /* Source Line 42 */
/*0020*/ IMAD R0, R0, c[0x0][0x28], R3; /* Source Line 42 */
Source Line 43: if (i < n) {
/*0028*/ ISETP.GE.AND P0, PT, R0, R1, PT; /* Source Line 43 */
/*0030*/ @P0 EXIT; /* Source Line 43 */
Source Line 44: c[i] = a[i] + b[i];
/*0038*/ IMUL R3, R0, 0x4; /* Source Line 44 */
/*0040*/ IMAD R2, R0, 0x4, c[0x0][0x140]; /* Source Line 44 */
/*0048*/ IMAD R1, R0, 0x4, c[0x0][0x148]; /* Source Line 44 */
/*0050*/ IMAD R0, R0, 0x4, c[0x0][0x150]; /* Source Line 44 */
/*0058*/ LDG R2, [R2]; /* Source Line 44 */
/*0060*/ LDG R1, [R1]; /* Source Line 44 */
/*0068*/ IADD R1, R1, R2; /* Source Line 44 */
/*0070*/ STG [R0], R1; /* Source Line 44 */
Source Line 45: }
/*0078*/ EXIT; /* Source Line 45 */
Performance Analysis:
Line 42 (Thread index calculation): 4 instructions (20% of kernel instructions)
Line 43 (Bounds check): 2 instructions (10% of kernel instructions)
Line 44 (Array access and computation): 8 instructions (70% of kernel instructions)
Line 45 (Kernel exit): 1 instruction
Let's analyze this output:
Thread Index Calculation (Line 42)
/*0008*/ MOV R1, c[0x0][0x44]; /* Source Line 42 */
/*0010*/ S2R R0, SR_CTAID.X; /* Source Line 42 */
/*0018*/ S2R R3, SR_TID.X; /* Source Line 42 */
/*0020*/ IMAD R0, R0, c[0x0][0x28], R3; /* Source Line 42 */
These instructions:
1. MOV R1, c[0x0][0x44]
: Load the value of n
into register R1
2. S2R R0, SR_CTAID.X
: Load the block index into R0
3. S2R R3, SR_TID.X
: Load the thread index into R3
4. IMAD R0, R0, c[0x0][0x28], R3
: Calculate blockIdx.x * blockDim.x + threadIdx.x
Bounds Check (Line 43)
/*0028*/ ISETP.GE.AND P0, PT, R0, R1, PT; /* Source Line 43 */
/*0030*/ @P0 EXIT; /* Source Line 43 */
These instructions:
1. ISETP.GE.AND P0, PT, R0, R1, PT
: Set predicate P0 if i >= n
2. @P0 EXIT
: Exit the kernel if P0 is true (i.e., if i >= n
)
Vector Addition (Line 44)
/*0038*/ IMUL R3, R0, 0x4; /* Source Line 44 */
/*0040*/ IMAD R2, R0, 0x4, c[0x0][0x140]; /* Source Line 44 */
/*0048*/ IMAD R1, R0, 0x4, c[0x0][0x148]; /* Source Line 44 */
/*0050*/ IMAD R0, R0, 0x4, c[0x0][0x150]; /* Source Line 44 */
/*0058*/ LDG R2, [R2]; /* Source Line 44 */
/*0060*/ LDG R1, [R1]; /* Source Line 44 */
/*0068*/ IADD R1, R1, R2; /* Source Line 44 */
/*0070*/ STG [R0], R1; /* Source Line 44 */
These instructions:
1. IMUL R3, R0, 0x4
: Multiply index by 4 (size of float)
2. IMAD R2, R0, 0x4, c[0x0][0x140]
: Calculate address of a[i]
3. IMAD R1, R0, 0x4, c[0x0][0x148]
: Calculate address of b[i]
4. IMAD R0, R0, 0x4, c[0x0][0x150]
: Calculate address of c[i]
5. LDG R2, [R2]
: Load a[i]
into R2
6. LDG R1, [R1]
: Load b[i]
into R1
7. IADD R1, R1, R2
: Add R1 and R2, store result in R1
8. STG [R0], R1
: Store result in c[i]
Kernel Exit (Line 45)
This instruction:
1. EXIT
: Exit the kernel
Interpreting SASS for Optimization
Memory Access Patterns
In the SASS for line 44, we see:
These are global memory loads. The LDG
instruction loads from global memory. For optimal performance:
- Adjacent threads should access adjacent memory locations
- Memory accesses should be aligned to 128-byte boundaries
- Coalesced memory access is critical for performance
Instruction Mix
Looking at the instruction mix:
- 4 instructions for thread index calculation (20%)
- 2 instructions for bounds checking (10%)
- 8 instructions for the actual computation (70%)
This tells us: - The kernel has a reasonable ratio of computation to overhead - Most instructions are dedicated to the actual vector addition - The thread index calculation and bounds checking are relatively efficient
Register Usage
The SASS shows register usage:
- R0: Used for thread index and later for the address of
c[i]
- R1: Initially holds
n
, later holdsb[i]
and the final result - R2: Holds the address of
a[i]
and later the value ofa[i]
- R3: Initially holds thread ID, later used for byte offset calculation
Register usage is efficient with good reuse of registers.
Advanced SASS Analysis
Instruction Throughput
Different SASS instructions have different throughput:
MOV
,S2R
: Fast register operationsIMAD
,IMUL
: Integer arithmetic (medium throughput)LDG
,STG
: Global memory operations (slow, high latency)ISETP
: Predicate operations (medium throughput)
In performance-critical kernels, minimizing the use of slow instructions is important.
Predicated Execution
The SASS shows predicated execution:
This uses the predicate register P0 to conditionally execute the EXIT instruction. Predication can avoid branch divergence but has its own costs.
Memory Address Calculation
The SASS shows memory address calculations:
This calculates base_address + i * 4
using a multiply-add instruction. The compiler has optimized this calculation.
Optimization Strategies Based on SASS
1. Memory Coalescing
If the SASS shows many uncoalesced memory accesses, consider: - Reorganizing your data structures - Using shared memory as a staging area - Adjusting your thread block dimensions
2. Instruction Reduction
If certain source lines generate many instructions: - Simplify complex expressions - Use intrinsic functions when appropriate - Consider algorithmic changes
3. Register Pressure
If the SASS shows high register usage: - Break complex functions into smaller ones - Reduce the number of variables in flight - Consider using shared memory instead of registers for some data
Next Steps
- Apply SASS analysis to your own CUDA kernels
- Look for patterns in the generated code that might indicate inefficiencies
- Compare SASS across different compiler optimization levels
- Use SASS insights to guide source-level optimizations