CUPTI Program Counter (PC) Sampling Continuous Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
Program Counter (PC) sampling is a powerful profiling technique that allows you to understand where your CUDA kernels spend their execution time at the assembly instruction level. This tutorial demonstrates how to implement continuous PC sampling that can monitor any CUDA application without requiring source code modifications.
What You'll Learn
- How to build a dynamic library for PC sampling injection
- Techniques for continuous profiling of CUDA applications
- Understanding PC sampling data and stall reasons
- Cross-platform implementation (Linux and Windows)
- Using the profiling library with existing applications
Understanding PC Sampling Continuous
PC sampling continuous differs from other profiling methods because it:
- Operates at the assembly level: Provides insights into actual GPU instruction execution
- Requires no source modifications: Can profile any CUDA application
- Works via library injection: Uses dynamic loading to intercept CUDA calls
- Provides stall reason analysis: Shows why warps are not making progress
- Supports real-time monitoring: Can observe performance during execution
Architecture Overview
The continuous PC sampling system consists of:
- Dynamic Library:
libpc_sampling_continuous.so
(Linux) orpc_sampling_continuous.lib
(Windows) - Injection Mechanism: Uses
LD_PRELOAD
(Linux) or DLL injection (Windows) - CUPTI Integration: Leverages CUPTI's PC sampling APIs
- Helper Script:
libpc_sampling_continuous.pl
for easy execution
Building the Sample
Linux Build Process
-
Navigate to the pc_sampling_continuous directory:
-
Build using the provided Makefile:
This creates libpc_sampling_continuous.so
in the current directory.
Windows Build Process
For Windows, you need to build the Microsoft Detours library first:
- Download Detours source:
- From GitHub: https://github.com/microsoft/Detours
-
Or Microsoft: https://www.microsoft.com/en-us/download/details.aspx?id=52586
-
Build Detours:
-
Copy required files:
-
Build the sample:
This creates pc_sampling_continuous.lib
.
Running the Sample
Linux Execution
-
Set up library paths:
-
Use the helper script:
This shows all available options.
- Run with your application:
Windows Execution
-
Set up library paths:
-
Run with your application:
Helper Script Options
The libpc_sampling_continuous.pl
script provides various configuration options:
# Show help
./libpc_sampling_continuous.pl --help
# Basic usage
./libpc_sampling_continuous.pl --app ./my_cuda_app
# Specify sampling frequency
./libpc_sampling_continuous.pl --app ./my_cuda_app --frequency 1000
# Set output file
./libpc_sampling_continuous.pl --app ./my_cuda_app --output samples.data
# Enable verbose output
./libpc_sampling_continuous.pl --app ./my_cuda_app --verbose
Understanding the Output
Sample Data Format
The PC sampling generates data files containing:
- Function Information: Kernel names and their addresses
- PC Samples: Program counter values with timestamps
- Stall Reasons: Why warps were stalled at each sample point
- Source Correlation: Assembly to source code mapping (when debug info is available)
Example Output Structure
Kernel: vectorAdd(float*, float*, float*, int)
PC: 0x7f8b2c001000, Stall: MEMORY_DEPENDENCY, Count: 15
PC: 0x7f8b2c001008, Stall: EXECUTION_DEPENDENCY, Count: 8
PC: 0x7f8b2c001010, Stall: NOT_SELECTED, Count: 12
...
Key Features
Automatic Injection
The library automatically: - Intercepts CUDA runtime and driver API calls - Sets up PC sampling for each kernel launch - Collects samples during kernel execution - Generates detailed reports
No Source Modification Required
Benefits include: - Profile existing applications without recompilation - Analyze third-party CUDA libraries - Monitor production workloads - Compare different optimization strategies
Cross-Platform Support
The implementation handles: - Different dynamic loading mechanisms (Linux vs Windows) - Platform-specific library formats - Varying CUDA installation paths - Different debugging symbol formats
Practical Applications
Performance Hotspot Identification
Use PC sampling to: 1. Find bottleneck instructions: Identify assembly instructions where execution time is spent 2. Analyze stall patterns: Understand why warps are not making progress 3. Optimize memory access: Detect memory-bound operations 4. Improve instruction scheduling: Identify dependency stalls
Algorithm Analysis
Apply to: 1. Compare implementations: Profile different algorithmic approaches 2. Validate optimizations: Measure the impact of code changes 3. Understand GPU utilization: See how well your code uses available resources 4. Debug performance regressions: Identify when and where performance degrades
Advanced Usage
Custom Sampling Configurations
Modify sampling behavior by: 1. Adjusting sampling frequency: Balance detail vs overhead 2. Filtering specific kernels: Focus on particular functions 3. Setting duration limits: Control profiling duration 4. Configuring output formats: Choose appropriate data formats
Integration with Other Tools
Combine with: 1. NVIDIA Nsight Compute: Correlate with detailed metrics 2. NVIDIA Nsight Systems: Add timeline context 3. Custom analysis scripts: Process sampling data programmatically 4. Visualization tools: Create performance charts and graphs
Troubleshooting
Common Issues
- Library not found: Ensure all paths are correctly set in LD_LIBRARY_PATH/PATH
- Permission errors: Check that the target application has necessary permissions
- CUDA version mismatch: Verify CUPTI version matches CUDA runtime
- Missing symbols: Ensure debug information is available for source correlation
Debug Tips
- Enable verbose logging: Use
--verbose
flag to see detailed execution - Check dependencies: Verify all required libraries are accessible
- Test with simple apps: Start with basic CUDA samples before complex applications
- Monitor resource usage: Ensure sufficient memory for sampling data
Next Steps
- Experiment with different sampling frequencies to find optimal settings
- Apply continuous PC sampling to your own CUDA applications
- Combine with the
pc_sampling_utility
to analyze collected data - Explore correlation with source code using debug symbols
- Integrate PC sampling data with your performance analysis workflow