CUPTI Checkpoint API Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
The CUPTI Checkpoint API provides a powerful mechanism for capturing and restoring GPU device state, enabling reproducible kernel execution even when kernels modify their own input data. This tutorial demonstrates how to use checkpoints to ensure consistent results across multiple kernel invocations.
What You'll Learn
- How to use CUPTI's checkpoint API to save and restore GPU state
- Techniques for ensuring reproducible kernel execution
- Understanding when checkpoints are necessary for correctness
- Managing device memory state across kernel invocations
- Best practices for checkpoint-based debugging and testing
Understanding the Problem
Many CUDA kernels modify their input data during execution, which can lead to different results when the same kernel is run multiple times. This is particularly common in:
- Reduction operations that overwrite input arrays
- In-place transformations that modify data during processing
- Iterative algorithms that use the same buffer for input and output
- Debugging scenarios where you want to replay the exact same conditions
The Checkpoint Solution
CUPTI's checkpoint API allows you to: 1. Save the complete state of GPU memory at a specific point 2. Restore that exact state later, ensuring identical conditions 3. Replay kernel executions with guaranteed reproducibility
Code Architecture
Checkpoint Structure
// Configure a checkpoint object
CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };
cp.ctx = context; // CUDA context to checkpoint
cp.optimizations = 1; // Enable optimizations
Basic Checkpoint Workflow
// 1. Save checkpoint before first kernel execution
CUPTI_API_CALL(cuptiCheckpointSave(&cp));
// 2. Run kernel (may modify input data)
MyKernel<<<blocks, threads>>>(deviceData, size);
// 3. For subsequent runs, restore checkpoint first
CUPTI_API_CALL(cuptiCheckpointRestore(&cp));
// 4. Run kernel again with identical initial conditions
MyKernel<<<blocks, threads>>>(deviceData, size);
Sample Walkthrough
The Problem Kernel
Our sample uses a reduction kernel that demonstrates the issue:
__global__ void Reduce(float *pData, size_t N)
{
float totalSumData = 0.0;
// Each thread sums its elements locally
for (int i = threadIdx.x; i < N; i += blockDim.x)
{
totalSumData += pData[i];
}
// Save per-thread sum back to input array (MODIFIES INPUT!)
pData[threadIdx.x] = totalSumData;
__syncthreads();
// Thread 0 reduces to final result
if (threadIdx.x == 0)
{
float totalSum = 0.0;
size_t setElements = (blockDim.x < N ? blockDim.x : N);
for (int i = 0; i < setElements; i++)
{
totalSum += pData[i];
}
pData[0] = totalSum; // Final result
}
}
Key Issue: This kernel overwrites the input array pData
with intermediate results, making subsequent runs produce different results.
Without Checkpoints
// Initialize array with all 1.0 values
for (size_t i = 0; i < elements; i++) {
pHostA[i] = 1.0;
}
cudaMemcpy(pDeviceA, pHostA, size, cudaMemcpyHostToDevice);
// Run kernel multiple times
for (int repeat = 0; repeat < 3; repeat++) {
Reduce<<<1, 64>>>(pDeviceA, elements);
float result;
cudaMemcpy(&result, pDeviceA, sizeof(float), cudaMemcpyDeviceToHost);
printf("Iteration %d: result = %f\n", repeat + 1, result);
}
Output:
Iteration 1: result = 1048576.000000 // Correct sum of 1M ones
Iteration 2: result = 64.000000 // Wrong! Input was modified
Iteration 3: result = 1.000000 // Even more wrong!
With Checkpoints
// Configure checkpoint
CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };
cp.ctx = context;
cp.optimizations = 1;
float expected;
for (int repeat = 0; repeat < 3; repeat++) {
// Save or restore checkpoint
if (repeat == 0) {
CUPTI_API_CALL(cuptiCheckpointSave(&cp));
} else {
CUPTI_API_CALL(cuptiCheckpointRestore(&cp));
}
// Run kernel with identical initial conditions
Reduce<<<1, 64>>>(pDeviceA, elements);
float result;
cudaMemcpy(&result, pDeviceA, sizeof(float), cudaMemcpyDeviceToHost);
if (repeat == 0) {
expected = result; // Save expected result
}
printf("Iteration %d: result = %f\n", repeat + 1, result);
// Verify reproducibility
if (result != expected) {
printf("ERROR: Inconsistent result!\n");
exit(1);
}
}
Output:
Iteration 1: result = 1048576.000000 // Correct result
Iteration 2: result = 1048576.000000 // Same result!
Iteration 3: result = 1048576.000000 // Consistent!
Building and Running
Prerequisites
- CUDA Toolkit with CUPTI support
- C++ compiler compatible with CUDA
- GPU with compute capability 3.5 or higher
Build Process
Execution
Advanced Checkpoint Techniques
Checkpoint Optimization
// Enable optimizations for better performance
CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };
cp.ctx = context;
cp.optimizations = 1; // Enable all optimizations
// Alternative: Disable optimizations for debugging
cp.optimizations = 0; // Slower but more thorough
Selective Memory Checkpointing
class SelectiveCheckpoint {
private:
std::vector<CUpti_Checkpoint> checkpoints;
std::vector<void*> criticalPointers;
public:
void addCriticalMemory(void* ptr, size_t size) {
criticalPointers.push_back(ptr);
// Configure checkpoint for specific memory regions
}
void saveSelectiveState() {
for (auto& cp : checkpoints) {
CUPTI_API_CALL(cuptiCheckpointSave(&cp));
}
}
void restoreSelectiveState() {
for (auto& cp : checkpoints) {
CUPTI_API_CALL(cuptiCheckpointRestore(&cp));
}
}
};
Checkpoint-Based Debugging
class CheckpointDebugger {
private:
CUpti_Checkpoint debugCheckpoint;
std::vector<float> expectedResults;
public:
void setDebugPoint(CUcontext context) {
debugCheckpoint = { CUpti_Checkpoint_STRUCT_SIZE };
debugCheckpoint.ctx = context;
debugCheckpoint.optimizations = 0; // Full state capture
CUPTI_API_CALL(cuptiCheckpointSave(&debugCheckpoint));
}
bool validateReproducibility(KernelFunction kernel, void* args) {
// Run kernel multiple times and verify identical results
std::vector<float> results;
for (int run = 0; run < 5; run++) {
if (run > 0) {
CUPTI_API_CALL(cuptiCheckpointRestore(&debugCheckpoint));
}
kernel(args);
float result = extractResult(args);
results.push_back(result);
if (run > 0 && results[run] != results[0]) {
printf("Non-deterministic behavior detected at run %d\n", run);
return false;
}
}
printf("Kernel behavior is reproducible\n");
return true;
}
};
Performance Considerations
Checkpoint Overhead
class CheckpointProfiler {
public:
struct ProfileData {
double saveTime;
double restoreTime;
size_t memorySize;
double overhead;
};
ProfileData profileCheckpoint(CUpti_Checkpoint& cp) {
ProfileData profile;
// Measure save time
auto start = std::chrono::high_resolution_clock::now();
CUPTI_API_CALL(cuptiCheckpointSave(&cp));
auto end = std::chrono::high_resolution_clock::now();
profile.saveTime = std::chrono::duration<double>(end - start).count();
// Measure restore time
start = std::chrono::high_resolution_clock::now();
CUPTI_API_CALL(cuptiCheckpointRestore(&cp));
end = std::chrono::high_resolution_clock::now();
profile.restoreTime = std::chrono::duration<double>(end - start).count();
// Estimate memory size (implementation-dependent)
profile.memorySize = estimateCheckpointSize(cp);
profile.overhead = (profile.saveTime + profile.restoreTime) * 100;
return profile;
}
};
Optimization Strategies
- Minimize Checkpoint Frequency: Only checkpoint when necessary
- Use Selective Checkpointing: Only save critical memory regions
- Enable Optimizations: Use
cp.optimizations = 1
for better performance - Batch Operations: Group multiple kernel calls between checkpoints
Real-World Use Cases
Scientific Computing
class IterativeSolver {
private:
CUpti_Checkpoint convergenceCheckpoint;
float* deviceVector;
public:
void solveProblem() {
// Save initial state for potential restart
CUPTI_API_CALL(cuptiCheckpointSave(&convergenceCheckpoint));
for (int iteration = 0; iteration < maxIterations; iteration++) {
// Run potentially destructive iteration
iterativeStep<<<blocks, threads>>>(deviceVector, size);
if (checkConvergence(deviceVector)) {
printf("Converged after %d iterations\n", iteration);
break;
}
// If diverging, restart from checkpoint
if (isDiverging(deviceVector)) {
printf("Restarting from checkpoint\n");
CUPTI_API_CALL(cuptiCheckpointRestore(&convergenceCheckpoint));
adjustParameters(); // Try different parameters
}
}
}
};
Machine Learning Training
class TrainingCheckpoint {
private:
CUpti_Checkpoint epochCheckpoint;
std::vector<float> lossHistory;
public:
void trainWithCheckpoints(Model& model, Dataset& data) {
for (int epoch = 0; epoch < totalEpochs; epoch++) {
// Save state at beginning of epoch
CUPTI_API_CALL(cuptiCheckpointSave(&epochCheckpoint));
// Train one epoch
float loss = trainEpoch(model, data);
lossHistory.push_back(loss);
// If loss exploded, restore and try different learning rate
if (loss > explosionThreshold) {
printf("Loss explosion detected, restoring checkpoint\n");
CUPTI_API_CALL(cuptiCheckpointRestore(&epochCheckpoint));
model.reduceLearningRate();
lossHistory.pop_back(); // Remove bad result
epoch--; // Retry this epoch
}
}
}
};
Error Handling and Best Practices
Robust Checkpoint Management
class CheckpointManager {
private:
std::vector<CUpti_Checkpoint> activeCheckpoints;
public:
~CheckpointManager() {
// Ensure all checkpoints are cleaned up
for (auto& cp : activeCheckpoints) {
CUPTI_API_CALL(cuptiCheckpointFree(&cp));
}
}
CUpti_Checkpoint* createCheckpoint(CUcontext context) {
CUpti_Checkpoint cp = { CUpti_Checkpoint_STRUCT_SIZE };
cp.ctx = context;
cp.optimizations = 1;
activeCheckpoints.push_back(cp);
return &activeCheckpoints.back();
}
void validateCheckpoint(const CUpti_Checkpoint& cp) {
// Verify checkpoint is valid before use
if (cp.ctx == nullptr) {
throw std::runtime_error("Invalid checkpoint context");
}
// Additional validation logic
}
};
Common Pitfalls to Avoid
- Forgetting to Free Checkpoints: Always call
cuptiCheckpointFree()
- Context Mismatches: Ensure checkpoint context matches current context
- Incomplete State Capture: Some GPU state may not be captured
- Performance Impact: Checkpoints have overhead, use judiciously
- Memory Pressure: Large checkpoints can consume significant memory
Integration with Testing Frameworks
Unit Testing with Checkpoints
class CheckpointTest {
public:
void testKernelReproducibility() {
// Setup test data
float* testData = setupTestData();
// Create checkpoint
CUpti_Checkpoint cp = createCheckpoint();
// Run test multiple times
for (int run = 0; run < 10; run++) {
if (run > 0) {
CUPTI_API_CALL(cuptiCheckpointRestore(&cp));
}
testKernel<<<1, 64>>>(testData, size);
// Verify result consistency
float result = extractResult(testData);
ASSERT_EQ(result, expectedResult);
}
CUPTI_API_CALL(cuptiCheckpointFree(&cp));
}
};
Next Steps
- Experiment with different types of kernels to understand when checkpoints are needed
- Implement checkpoint-based debugging in your own applications
- Explore checkpoint optimizations for your specific use cases
- Combine checkpoints with other CUPTI profiling tools for comprehensive analysis
- Consider integrating checkpoint validation into your testing workflow
The checkpoint API is a powerful tool for ensuring reproducible GPU computations and can significantly improve the reliability of CUDA applications that modify their input data.