CUPTI PC Sampling Analysis Utility Tutorial

The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.

Introduction

The PC Sampling Utility is a powerful post-processing tool that analyzes data collected by the pc_sampling_continuous sample. It transforms raw PC sampling data into actionable insights by correlating assembly instructions with stall reasons and providing source-level mapping when debug information is available.

What You'll Learn

How to analyze PC sampling data files generated by continuous sampling
Understanding stall reason counters at the assembly instruction level
Techniques for correlating assembly code with CUDA C source code
Working with CUDA cubin files for detailed analysis
Interpreting performance bottlenecks from PC sampling results

Understanding PC Sampling Data Analysis

PC sampling analysis differs from real-time monitoring because it:

Processes collected data offline: Allows detailed analysis without runtime overhead
Provides assembly-level insights: Shows exactly which instructions cause performance issues
Correlates with source code: Maps performance hotspots back to your original C/C++ code
Quantifies stall reasons: Explains why GPU execution units are idle
Supports batch processing: Can analyze multiple sampling sessions together

Key Concepts

Stall Reasons

GPU warps can be stalled for various reasons:

MEMORY_DEPENDENCY: Waiting for memory operations to complete
EXECUTION_DEPENDENCY: Waiting for previous instructions in the pipeline
NOT_SELECTED: Warp is ready but scheduler chose other warps
MEMORY_THROTTLE: Memory subsystem is saturated
PIPE_BUSY: Execution pipeline is fully utilized
CONSTANT_MEMORY_DEPENDENCY: Waiting for constant memory access
TEXTURE_MEMORY_DEPENDENCY: Waiting for texture memory access

Assembly to Source Correlation

The utility can map assembly instructions back to source code when: - Debug information is compiled into the application (-g flag) - CUDA cubin files are extracted and properly named - Source files are accessible at analysis time

Building the Utility

Prerequisites

Ensure you have: - CUDA Toolkit installed - CUPTI libraries available - Access to cubin files from your target application

Build Process

Navigate to the pc_sampling_utility directory:
```
cd pc_sampling_utility
```
Build using the provided Makefile:
```
make
```

This creates the pc_sampling_utility executable.

Preparing Input Data

Generating PC Sampling Data

First, collect PC sampling data using the continuous sampling library:

# Using the continuous sampling library
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/cupti/lib64:/path/to/pc_sampling_continuous
./libpc_sampling_continuous.pl --app ./your_cuda_application --output samples.data

Extracting CUDA Cubin Files

For source correlation, extract cubin files from your application:

# Extract all cubin files from executable
cuobjdump -xelf all your_cuda_application

# Extract from library files
cuobjdump -xelf all libmy_cuda_library.so

Important: The cuobjdump version must match the CUDA Toolkit version used to build your application.

Naming Cubin Files

Rename the extracted cubin files sequentially:

# Rename cubin files in order
mv first_extracted_file.cubin 1.cubin
mv second_extracted_file.cubin 2.cubin
mv third_extracted_file.cubin 3.cubin
# ... and so on

The utility expects cubin files to be named 1.cubin, 2.cubin, 3.cubin, etc.

Running the Analysis

Basic Usage

./pc_sampling_utility --input samples.data

Command Line Options

View all available options:

./pc_sampling_utility --help

Common options include:

# Specify input file
./pc_sampling_utility --input samples.data

# Set cubin directory
./pc_sampling_utility --input samples.data --cubin-path ./cubins/

# Enable verbose output
./pc_sampling_utility --input samples.data --verbose

# Filter specific kernels
./pc_sampling_utility --input samples.data --kernel vectorAdd

# Set output format
./pc_sampling_utility --input samples.data --format csv

Understanding the Output

Sample Output Format

Kernel: vectorAdd(float*, float*, float*, int)
================================================================================

Assembly Analysis:
PC: 0x008 | INST: LDG.E.SYS R2, [R8] | Stall: MEMORY_DEPENDENCY | Count: 245 (15.3%)
PC: 0x010 | INST: LDG.E.SYS R4, [R10] | Stall: MEMORY_DEPENDENCY | Count: 198 (12.4%)
PC: 0x018 | INST: FADD R6, R2, R4 | Stall: EXECUTION_DEPENDENCY | Count: 89 (5.6%)
PC: 0x020 | INST: STG.E.SYS [R12], R6 | Stall: MEMORY_DEPENDENCY | Count: 156 (9.7%)

Source Correlation:
PC: 0x008 | File: vector_add.cu | Line: 42 | Code: float a = A[i];
PC: 0x010 | File: vector_add.cu | Line: 43 | Code: float b = B[i];
PC: 0x018 | File: vector_add.cu | Line: 44 | Code: float result = a + b;
PC: 0x020 | File: vector_add.cu | Line: 45 | Code: C[i] = result;

Performance Summary:
Total Samples: 1599
Memory Bound: 599 samples (37.5%)
Execution Bound: 234 samples (14.6%)
Scheduler Limited: 445 samples (27.8%)
Other: 321 samples (20.1%)

Key Metrics to Analyze

Stall Distribution: Which stall reasons dominate your kernel execution
Hotspot Instructions: Assembly instructions with the highest sample counts
Memory Access Patterns: How memory operations contribute to stalls
Source Line Correlation: Which source lines correspond to performance issues

Practical Analysis Workflows

Identifying Memory Bottlenecks

Look for MEMORY_DEPENDENCY stalls: High counts indicate memory-bound kernels
Analyze access patterns: Check if accesses are coalesced
Consider caching strategies: Evaluate shared memory or texture memory usage

Example workflow:

# Focus on memory-related stalls
./pc_sampling_utility --input samples.data --filter-stall MEMORY_DEPENDENCY

# Analyze specific memory instructions
./pc_sampling_utility --input samples.data --filter-instruction "LDG\|STG"

Optimizing Instruction Dependencies

Identify EXECUTION_DEPENDENCY hotspots: Shows instruction pipeline stalls
Analyze instruction ordering: Look for opportunities to reorder operations
Consider ILP (Instruction Level Parallelism): Find independent operations

Understanding Scheduler Behavior

Monitor NOT_SELECTED stalls: Indicates scheduler pressure
Analyze warp utilization: Check if enough warps are available
Consider occupancy optimization: Increase warps per SM when possible

Advanced Analysis Techniques

Comparing Multiple Runs

# Analyze baseline version
./pc_sampling_utility --input baseline.data --output baseline_analysis.txt

# Analyze optimized version
./pc_sampling_utility --input optimized.data --output optimized_analysis.txt

# Compare results
diff baseline_analysis.txt optimized_analysis.txt

Statistical Analysis

# Generate CSV output for spreadsheet analysis
./pc_sampling_utility --input samples.data --format csv --output analysis.csv

# Create histograms of stall reasons
./pc_sampling_utility --input samples.data --histogram --bins 20

Kernel-Specific Analysis

# Analyze only specific kernels
./pc_sampling_utility --input samples.data --kernel "matrixMul.*"

# Exclude certain kernels
./pc_sampling_utility --input samples.data --exclude-kernel "memcpy.*"

Integration with Development Workflow

Performance Regression Detection

Baseline establishment: Create performance profiles for known-good versions
Automated analysis: Include PC sampling in CI/CD pipelines
Threshold monitoring: Alert on significant performance changes

Optimization Guidance

Hotspot identification: Focus optimization efforts on high-impact areas
Validation: Verify that optimizations reduce relevant stall reasons
Iteration: Use sampling data to guide successive optimization attempts

Troubleshooting

Common Issues

Missing cubin files: Ensure cubins are extracted and properly named
Version mismatches: Verify cuobjdump version matches CUDA Toolkit
Missing debug info: Compile with -g flag for source correlation
Path issues: Check that cubin files are in the expected location

Debug Tips

Start with simple kernels: Test the workflow with basic examples first
Verify cubin extraction: Check that cuobjdump produces valid files
Test without source correlation: Ensure basic assembly analysis works
Use verbose output: Enable detailed logging to understand processing steps

Next Steps

Apply PC sampling analysis to identify performance bottlenecks in your applications
Integrate the analysis workflow into your optimization process
Experiment with different sampling configurations to balance detail and overhead
Combine PC sampling results with other profiling tools for comprehensive analysis
Develop custom scripts to automate analysis for your specific use cases

Share on Share on