CUDA Memory Tracing Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
The CUDA Memory Tracing sample demonstrates how to track and analyze memory operations in CUDA applications using CUPTI's activity tracing capabilities. This tutorial focuses specifically on memory management, transfer patterns, and memory usage optimization through detailed tracing and analysis.
What You'll Learn
- How to trace all types of CUDA memory operations
- Understanding memory transfer patterns and bottlenecks
- Analyzing memory allocation and deallocation patterns
- Detecting memory leaks and usage inefficiencies
- Optimizing memory bandwidth utilization
Understanding CUDA Memory Operations
CUDA applications involve various types of memory operations:
- Memory Allocation: cudaMalloc, cudaMallocPitch, cudaMallocManaged
- Memory Transfer: cudaMemcpy, cudaMemcpyAsync, peer-to-peer transfers
- Memory Mapping: cudaHostAlloc, cudaHostRegister
- Unified Memory: cudaMallocManaged, automatic migration
- Memory Deallocation: cudaFree, cudaFreeHost
Key Concepts
Memory Domains
Device Memory
- Global memory on GPU
- Texture and surface memory
- Constant memory
- Local memory (registers/shared)
Host Memory
- Pageable system memory
- Pinned (page-locked) memory
- Unified memory regions
Transfer Types
- Host-to-Device (H2D)
- Device-to-Host (D2H)
- Device-to-Device (D2D)
- Peer-to-Peer (P2P)
Building the Sample
Prerequisites
- CUDA Toolkit with CUPTI
- Application with diverse memory operations
- Sufficient memory for tracing buffers
Build Process
This creates the cuda_memory_trace
executable that demonstrates memory operation tracing.
Code Architecture
Memory Activity Tracking
class MemoryTracer {
private:
struct MemoryActivity {
CUpti_ActivityKind kind;
uint64_t start;
uint64_t end;
size_t bytes;
CUdeviceptr srcPtr;
CUdeviceptr dstPtr;
int srcDevice;
int dstDevice;
cudaMemcpyKind copyKind;
};
std::vector<MemoryActivity> activities;
std::map<CUdeviceptr, AllocationInfo> allocations;
public:
void processActivity(CUpti_Activity* record);
void analyzeMemoryPatterns();
void generateMemoryReport();
};
Memory Allocation Tracking
class AllocationTracker {
private:
struct AllocationInfo {
size_t size;
uint64_t allocTime;
uint64_t freeTime;
bool isActive;
std::string allocationType;
};
std::map<void*, AllocationInfo> hostAllocations;
std::map<CUdeviceptr, AllocationInfo> deviceAllocations;
size_t peakMemoryUsage;
size_t currentMemoryUsage;
public:
void recordAllocation(void* ptr, size_t size, const std::string& type, uint64_t timestamp);
void recordDeallocation(void* ptr, uint64_t timestamp);
void detectMemoryLeaks();
void calculateMemoryStatistics();
};
Running the Sample
Basic Execution
Sample Output
=== CUDA Memory Tracing Analysis ===
Memory Allocation Summary:
Total Device Allocations: 1,024 MB
Total Host Allocations: 512 MB
Peak Memory Usage: 1,536 MB
Active Allocations: 768 MB
Memory Leaks Detected: 0
Memory Transfer Analysis:
Host-to-Device: 2,048 MB (avg: 8.5 GB/s)
Device-to-Host: 1,024 MB (avg: 7.2 GB/s)
Device-to-Device: 512 MB (avg: 450 GB/s)
Peer-to-Peer: 256 MB (avg: 28.5 GB/s)
Transfer Patterns:
Sequential Transfers: 75.5%
Concurrent Transfers: 24.5%
Optimal Coalescing: 89.3%
Bandwidth Efficiency: 85.7%
Memory Hotspots:
Large Transfers (>100MB): 12 transfers, 2.1 GB total
Frequent Small Transfers (<1MB): 847 transfers, 45 MB total
Redundant Transfers: 23 transfers, 128 MB total
Performance Issues:
- Uncoalesced transfers detected: 18 instances
- Memory fragmentation: 12.3% overhead
- Synchronous transfers on default stream: 156 instances
Advanced Memory Analysis
Bandwidth Analysis
class BandwidthAnalyzer {
private:
struct TransferMetrics {
double achievedBandwidth;
double theoreticalBandwidth;
double efficiency;
size_t transferSize;
cudaMemcpyKind direction;
};
std::vector<TransferMetrics> transferHistory;
public:
void analyzeTransfer(const MemoryActivity& activity) {
double duration = (activity.end - activity.start) * 1e-9; // Convert to seconds
double achievedBandwidth = activity.bytes / duration / 1e9; // GB/s
TransferMetrics metrics;
metrics.achievedBandwidth = achievedBandwidth;
metrics.theoreticalBandwidth = getTheoreticalBandwidth(activity.copyKind);
metrics.efficiency = achievedBandwidth / metrics.theoreticalBandwidth;
metrics.transferSize = activity.bytes;
metrics.direction = activity.copyKind;
transferHistory.push_back(metrics);
}
void generateBandwidthReport() {
std::map<cudaMemcpyKind, std::vector<double>> bandwidthByType;
for (const auto& metrics : transferHistory) {
bandwidthByType[metrics.direction].push_back(metrics.achievedBandwidth);
}
for (const auto& [direction, bandwidths] : bandwidthByType) {
double avgBandwidth = std::accumulate(bandwidths.begin(), bandwidths.end(), 0.0) / bandwidths.size();
std::cout << "Average bandwidth for " << getDirectionName(direction)
<< ": " << avgBandwidth << " GB/s" << std::endl;
}
}
};
Memory Leak Detection
class MemoryLeakDetector {
private:
struct LeakInfo {
void* address;
size_t size;
uint64_t allocTime;
std::string allocationType;
std::string stackTrace;
};
std::vector<LeakInfo> detectedLeaks;
public:
void checkForLeaks(const AllocationTracker& tracker) {
for (const auto& [ptr, info] : tracker.getActiveAllocations()) {
if (info.isActive && !isValidPointer(ptr)) {
LeakInfo leak;
leak.address = ptr;
leak.size = info.size;
leak.allocTime = info.allocTime;
leak.allocationType = info.allocationType;
leak.stackTrace = getStackTrace(info.allocTime);
detectedLeaks.push_back(leak);
}
}
}
void reportLeaks() {
if (detectedLeaks.empty()) {
std::cout << "No memory leaks detected!" << std::endl;
return;
}
std::cout << "Memory Leaks Detected:" << std::endl;
size_t totalLeaked = 0;
for (const auto& leak : detectedLeaks) {
std::cout << " Address: " << leak.address
<< ", Size: " << leak.size << " bytes"
<< ", Type: " << leak.allocationType << std::endl;
totalLeaked += leak.size;
}
std::cout << "Total leaked memory: " << totalLeaked << " bytes" << std::endl;
}
};
Memory Access Pattern Analysis
class AccessPatternAnalyzer {
private:
struct AccessPattern {
CUdeviceptr baseAddress;
size_t stride;
size_t accessCount;
bool isCoalesced;
double coalescingEfficiency;
};
public:
void analyzeMemoryAccess(const std::vector<MemoryActivity>& activities) {
std::map<CUdeviceptr, std::vector<size_t>> accessOffsets;
// Group accesses by base address
for (const auto& activity : activities) {
if (activity.kind == CUPTI_ACTIVITY_KIND_MEMCPY) {
size_t offset = activity.dstPtr - getBaseAddress(activity.dstPtr);
accessOffsets[getBaseAddress(activity.dstPtr)].push_back(offset);
}
}
// Analyze patterns for each memory region
for (const auto& [baseAddr, offsets] : accessOffsets) {
AccessPattern pattern = analyzePattern(baseAddr, offsets);
if (!pattern.isCoalesced) {
std::cout << "Warning: Uncoalesced memory access detected at "
<< std::hex << baseAddr << std::dec
<< " (efficiency: " << pattern.coalescingEfficiency * 100 << "%)" << std::endl;
}
}
}
};
Memory Optimization Insights
Transfer Optimization
class TransferOptimizer {
public:
struct OptimizationSuggestion {
std::string issue;
std::string suggestion;
double potentialSpeedup;
};
std::vector<OptimizationSuggestion> analyzeTransfers(const std::vector<MemoryActivity>& activities) {
std::vector<OptimizationSuggestion> suggestions;
// Check for small, frequent transfers
int smallTransferCount = 0;
size_t totalSmallBytes = 0;
for (const auto& activity : activities) {
if (activity.bytes < 1024) { // Less than 1KB
smallTransferCount++;
totalSmallBytes += activity.bytes;
}
}
if (smallTransferCount > 100) {
OptimizationSuggestion suggestion;
suggestion.issue = "Many small memory transfers detected";
suggestion.suggestion = "Consider batching small transfers or using unified memory";
suggestion.potentialSpeedup = estimateSpeedup(smallTransferCount, totalSmallBytes);
suggestions.push_back(suggestion);
}
// Check for synchronous transfers
int syncTransferCount = 0;
for (const auto& activity : activities) {
if (isSynchronousTransfer(activity)) {
syncTransferCount++;
}
}
if (syncTransferCount > 50) {
OptimizationSuggestion suggestion;
suggestion.issue = "Many synchronous memory transfers";
suggestion.suggestion = "Use asynchronous transfers with streams for better overlap";
suggestion.potentialSpeedup = 1.2 + (syncTransferCount * 0.01);
suggestions.push_back(suggestion);
}
return suggestions;
}
};
Memory Pool Analysis
class MemoryPoolAnalyzer {
private:
struct PoolStatistics {
size_t totalAllocated;
size_t peakUsage;
size_t fragmentationWaste;
double utilizationEfficiency;
int allocationCount;
int deallocationCount;
};
public:
PoolStatistics analyzeMemoryPool(const std::vector<AllocationInfo>& allocations) {
PoolStatistics stats = {};
// Calculate fragmentation
std::map<size_t, int> sizeBuckets;
for (const auto& alloc : allocations) {
size_t bucket = roundToPowerOfTwo(alloc.size);
sizeBuckets[bucket]++;
stats.totalAllocated += alloc.size;
}
// Estimate fragmentation waste
for (const auto& [bucketSize, count] : sizeBuckets) {
size_t avgWaste = bucketSize / 4; // Estimate internal fragmentation
stats.fragmentationWaste += avgWaste * count;
}
stats.utilizationEfficiency = 1.0 - (double(stats.fragmentationWaste) / stats.totalAllocated);
return stats;
}
};
Real-World Applications
Deep Learning Memory Profiling
class DLMemoryProfiler {
public:
void profileTrainingStep(const std::vector<MemoryActivity>& activities) {
std::map<std::string, size_t> phaseMemory;
// Categorize memory operations by training phase
for (const auto& activity : activities) {
std::string phase = classifyTrainingPhase(activity);
phaseMemory[phase] += activity.bytes;
}
std::cout << "Memory usage by training phase:" << std::endl;
for (const auto& [phase, bytes] : phaseMemory) {
std::cout << " " << phase << ": " << bytes / (1024*1024) << " MB" << std::endl;
}
// Detect gradient accumulation patterns
detectGradientAccumulation(activities);
// Analyze batch size impact
analyzeBatchSizeEfficiency(activities);
}
private:
std::string classifyTrainingPhase(const MemoryActivity& activity) {
// Use heuristics to classify memory operations
if (activity.copyKind == cudaMemcpyHostToDevice) {
return "Data Loading";
} else if (isGradientOperation(activity)) {
return "Gradient Computation";
} else if (isWeightUpdate(activity)) {
return "Parameter Update";
} else {
return "Forward Pass";
}
}
};
Scientific Computing Memory Analysis
class ScientificMemoryAnalyzer {
public:
void analyzeComputationPattern(const std::vector<MemoryActivity>& activities) {
// Detect stencil computation patterns
detectStencilPatterns(activities);
// Analyze temporal locality
analyzeTemporalLocality(activities);
// Check for memory streaming patterns
analyzeStreamingPatterns(activities);
// Evaluate cache efficiency
evaluateCacheEfficiency(activities);
}
private:
void detectStencilPatterns(const std::vector<MemoryActivity>& activities) {
// Look for regular access patterns characteristic of stencil computations
std::map<CUdeviceptr, std::vector<size_t>> accessSequences;
for (const auto& activity : activities) {
CUdeviceptr baseAddr = getBaseAddress(activity.srcPtr);
size_t offset = activity.srcPtr - baseAddr;
accessSequences[baseAddr].push_back(offset);
}
for (const auto& [baseAddr, sequence] : accessSequences) {
if (isStencilPattern(sequence)) {
std::cout << "Stencil pattern detected at " << std::hex << baseAddr << std::dec << std::endl;
suggestStencilOptimizations(sequence);
}
}
}
};
Integration with Performance Tools
NVIDIA Nsight Integration
class NsightIntegration {
public:
void exportMemoryTrace(const std::vector<MemoryActivity>& activities, const std::string& filename) {
// Export in format compatible with Nsight Systems/Compute
std::ofstream file(filename);
file << "timestamp,operation,size,bandwidth,efficiency\n";
for (const auto& activity : activities) {
double bandwidth = calculateBandwidth(activity);
double efficiency = calculateEfficiency(activity);
file << activity.start << ","
<< getOperationName(activity.kind) << ","
<< activity.bytes << ","
<< bandwidth << ","
<< efficiency << "\n";
}
}
};
Custom Visualization
class MemoryVisualizer {
public:
void generateTimelineChart(const std::vector<MemoryActivity>& activities) {
// Generate data for memory timeline visualization
json timeline;
timeline["events"] = json::array();
for (const auto& activity : activities) {
json event;
event["name"] = getOperationName(activity.kind);
event["cat"] = "memory";
event["ph"] = "X"; // Complete event
event["ts"] = activity.start / 1000; // Convert to microseconds
event["dur"] = (activity.end - activity.start) / 1000;
event["args"]["size"] = activity.bytes;
event["args"]["bandwidth"] = calculateBandwidth(activity);
timeline["events"].push_back(event);
}
std::ofstream file("memory_timeline.json");
file << timeline.dump(2);
}
void generateMemoryMap(const AllocationTracker& tracker) {
// Create visual representation of memory layout
auto allocations = tracker.getActiveAllocations();
std::cout << "Memory Map:" << std::endl;
std::cout << "Address Range | Size | Type" << std::endl;
std::cout << "----------------------|----------|----------" << std::endl;
for (const auto& [ptr, info] : allocations) {
std::cout << std::hex << ptr << "-" << (ptr + info.size) << std::dec
<< " | " << formatSize(info.size)
<< " | " << info.allocationType << std::endl;
}
}
};
Troubleshooting Memory Issues
Common Memory Problems
- Memory Leaks: Unfreed allocations
- Fragmentation: Inefficient memory usage
- Bandwidth Underutilization: Poor transfer patterns
- Excessive Synchronization: Blocking memory operations
Debug Strategies
class MemoryDebugger {
public:
void validateMemoryOperations(const std::vector<MemoryActivity>& activities) {
// Check for invalid memory accesses
std::set<CUdeviceptr> validPointers;
for (const auto& activity : activities) {
if (activity.kind == CUPTI_ACTIVITY_KIND_MEMCPY) {
if (validPointers.find(activity.srcPtr) == validPointers.end() &&
validPointers.find(activity.dstPtr) == validPointers.end()) {
std::cerr << "Warning: Memory operation on potentially invalid pointer" << std::endl;
}
}
}
}
void checkMemoryAlignment(const std::vector<MemoryActivity>& activities) {
for (const auto& activity : activities) {
if (activity.srcPtr % 256 != 0 || activity.dstPtr % 256 != 0) {
std::cout << "Warning: Unaligned memory access detected" << std::endl;
}
}
}
};
Next Steps
- Apply memory tracing to identify bottlenecks in your applications
- Experiment with different memory optimization strategies
- Integrate memory analysis into your development workflow
- Develop custom memory management patterns based on trace analysis
- Combine with other CUPTI features for comprehensive performance profiling