CUPTI Trace Injection Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
The CUPTI Trace Injection sample demonstrates how to create a lightweight tracing library that can be automatically injected into any CUDA application. This approach enables comprehensive activity tracing without requiring source code modifications, making it perfect for profiling existing applications, third-party libraries, or production workloads.
What You'll Learn
- How to build an injection library for automatic CUDA activity tracing
- Understanding the CUDA injection mechanism for seamless integration
- Implementing NVTX activity recording for enhanced timeline visualization
- Cross-platform injection techniques (Linux and Windows)
- Collecting comprehensive trace data without application modifications
Understanding Trace Injection
Trace injection provides several key advantages for CUDA profiling:
- Zero application modification: Profile any CUDA application without recompilation
- Automatic activation: CUDA runtime loads and initializes the tracing automatically
- Comprehensive coverage: Captures all CUDA operations and activities
- NVTX integration: Records user-defined ranges and markers
- Timeline visualization: Generates data suitable for timeline analysis tools
Architecture Overview
The trace injection system consists of:
- Injection Library:
libcupti_trace_injection.so
(Linux) orlibcupti_trace_injection.dll
(Windows) - CUDA Injection Hook: Automatic loading via
CUDA_INJECTION64_PATH
- NVTX Integration: Optional NVTX activity recording via
NVTX_INJECTION64_PATH
- Activity Collection: Comprehensive CUDA API and GPU activity tracing
- Output Generation: Structured trace data for analysis tools
Key Features
Automatic Initialization
- No source code changes required
- Works with any CUDA application
- Supports both runtime and driver APIs
- Handles complex multi-threaded applications
Comprehensive Activity Tracing
- CUDA runtime API calls
- CUDA driver API calls
- Kernel execution activities
- Memory transfer operations
- Context and stream management
NVTX Support
- User-defined range recording
- Custom markers and annotations
- Enhanced timeline visualization
- Application phase correlation
Building the Sample
Prerequisites
Ensure you have: - CUDA Toolkit with CUPTI - Development tools (gcc/Visual Studio) - For Windows: Microsoft Detours library
Linux Build Process
-
Navigate to the sample directory:
-
Build using the provided Makefile:
This creates libcupti_trace_injection.so
.
Windows Build Process
For Windows builds, you need the Microsoft Detours library:
-
Download and build Detours:
-
Copy required files:
-
Build the sample:
This creates libcupti_trace_injection.dll
.
Running the Sample
Linux Usage
-
Set up injection environment:
-
Run your CUDA application:
Windows Usage
-
Set up injection environment:
-
Run your CUDA application:
Environment Variables
CUDA_INJECTION64_PATH
Specifies the path to your injection library. When set, CUDA automatically:
- Loads the shared library at initialization
- Calls the InitializeInjection()
function
- Enables tracing for all subsequent CUDA operations
NVTX_INJECTION64_PATH
Optional path to CUPTI library for NVTX activity recording: - Enables user-defined range collection - Records custom markers and annotations - Provides enhanced timeline context
Understanding the Output
Trace Data Format
The injection library generates comprehensive trace data including:
CUDA Runtime API Calls:
cudaMalloc: Start=1234567890, End=1234567925, Duration=35μs
cudaMemcpy: Start=1234567950, End=1234568100, Duration=150μs
cudaLaunchKernel: Start=1234568150, End=1234568175, Duration=25μs
GPU Activities:
Kernel: vectorAdd, Start=1234568200, End=1234568500, Duration=300μs
MemcpyHtoD: Size=4096KB, Start=1234567950, End=1234568100, Duration=150μs
MemcpyDtoH: Size=4096KB, Start=1234568600, End=1234568750, Duration=150μs
NVTX Ranges:
Range: "Data Preparation", Start=1234567800, End=1234568150, Duration=350μs
Range: "Computation", Start=1234568150, End=1234568550, Duration=400μs
Range: "Result Validation", Start=1234568600, End=1234568900, Duration=300μs
Key Metrics
- API Call Timing: Duration of CUDA runtime and driver API calls
- GPU Activity Timeline: Actual kernel execution and memory transfer times
- Memory Usage: Allocation sizes and transfer patterns
- Concurrency Analysis: Overlapping operations and stream utilization
- User-Defined Context: NVTX ranges providing application semantics
Practical Applications
Performance Analysis
Use trace injection for: - Bottleneck identification: Find the slowest operations in your application - Concurrency analysis: Understand how well operations overlap - Memory bandwidth utilization: Analyze data transfer efficiency - API overhead measurement: Quantify CUDA API call costs
Timeline Visualization
Trace data can be imported into: - NVIDIA Nsight Systems: Comprehensive timeline analysis - Chrome Tracing: Web-based visualization - Custom analysis tools: Programmatic trace processing - Performance comparison tools: Before/after optimization analysis
Production Monitoring
Deploy in production environments to: - Monitor application performance over time - Detect performance regressions - Analyze real-world workload patterns - Generate automated performance reports
Advanced Usage
Custom Activity Filtering
Modify the injection library to focus on specific activities:
// Filter specific API calls
bool shouldTraceAPI(const char* apiName) {
return (strstr(apiName, "Launch") != nullptr ||
strstr(apiName, "Memcpy") != nullptr);
}
// Filter kernel activities
bool shouldTraceKernel(const char* kernelName) {
return !strstr(kernelName, "internal_");
}
Enhanced NVTX Integration
Leverage NVTX for better application context:
// In your application (optional, but enhances tracing)
nvtxRangePush("Critical Section");
// ... CUDA operations ...
nvtxRangePop();
nvtxMark("Checkpoint A");
Multi-GPU Analysis
The injection library automatically handles: - Multiple GPU contexts - Cross-device memory transfers - Peer-to-peer communications - Device-specific activity timelines
Output Formats and Analysis
Raw Data Processing
# Convert trace data to various formats
./process_trace_data --input trace.cupti --output timeline.json --format chrome
# Generate performance summary
./analyze_trace --input trace.cupti --summary performance_report.txt
# Compare multiple traces
./compare_traces --baseline baseline.cupti --optimized optimized.cupti
Integration with Analysis Tools
# Python analysis example
import cupti_trace_parser
trace = cupti_trace_parser.load('trace.cupti')
kernel_times = trace.get_kernel_durations()
api_overhead = trace.get_api_overhead()
print(f"Total kernel time: {sum(kernel_times)}μs")
print(f"Average API overhead: {api_overhead.mean()}μs")
Troubleshooting
Common Issues
- Library not loaded: Verify the full path in environment variables
- Permission errors: Ensure proper file and directory permissions
- Missing dependencies: Check that all required libraries are available
- NVTX not working: Verify NVTX_INJECTION64_PATH points to correct CUPTI library
Debug Tips
- Test with simple applications: Start with basic CUDA samples
- Check environment setup: Verify all paths are correct and accessible
- Enable verbose logging: Add debug output to the injection library
- Monitor library loading: Use system tools to verify injection is working
Platform-Specific Notes
Linux:
- Use ldd
to check library dependencies
- Verify LD_LIBRARY_PATH
includes required directories
- Check that shared libraries have execute permissions
Windows: - Use Dependency Walker to analyze DLL dependencies - Ensure all DLLs are in the system PATH or application directory - Verify that Visual C++ redistributables are installed
Next Steps
- Apply trace injection to profile your own CUDA applications
- Experiment with different NVTX annotations to enhance trace context
- Develop custom analysis scripts for your specific performance metrics
- Integrate trace collection into your development and deployment workflows
- Combine with other CUPTI samples for comprehensive performance analysis