SASS Metrics Collection Tutorial
The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.
Introduction
The SASS Metrics Collection sample demonstrates how to collect detailed performance metrics at the SASS (Shader ASSembly) level using CUPTI. This tutorial shows you how to gather low-level performance data about GPU instruction execution, providing insights into the actual assembly code performance of your CUDA kernels.
What You'll Learn
- How to collect SASS-level performance metrics
- Understanding GPU instruction-level performance analysis
- Analyzing assembly code efficiency and bottlenecks
- Correlating high-level CUDA code with SASS instructions
- Optimizing kernels based on assembly-level insights
Understanding SASS Metrics
SASS (Shader ASSembly) metrics provide the lowest level of performance insight available through CUPTI:
- Instruction-Level Analysis: Performance of individual GPU assembly instructions
- Execution Efficiency: How efficiently instructions are executed by the hardware
- Resource Utilization: How well kernels use available GPU execution units
- Pipeline Analysis: Understanding instruction pipeline behavior
- Bottleneck Identification: Pinpointing specific instruction-level performance issues
Key SASS Metrics
Instruction Execution Metrics
- inst_executed: Total instructions executed
- inst_issued: Instructions issued to execution units
- inst_fp_32/64: Floating-point instruction counts
- inst_integer: Integer instruction counts
- inst_bit_convert: Bit manipulation instructions
- inst_control: Control flow instructions
Memory Instruction Metrics
- inst_compute_ld_st: Load/store compute instructions
- global_load/store: Global memory access instructions
- shared_load/store: Shared memory access instructions
- local_load/store: Local memory access instructions
Execution Efficiency Metrics
- thread_inst_executed: Instructions executed per thread
- warp_execution_efficiency: Percentage of active threads in warps
- branch_efficiency: Percentage of non-divergent branches
Building the Sample
Prerequisites
- CUDA Toolkit with CUPTI
- GPU with compute capability 5.0 or higher
- Access to detailed GPU performance counters
Build Process
This creates the sass_metrics
executable for collecting SASS-level performance data.
Running the Sample
Basic Execution
Sample Output
=== SASS Metrics Analysis ===
Kernel: vectorAdd
SASS Instruction Analysis:
Total Instructions: 2,048,576
FP32 Instructions: 1,024,288 (50.0%)
Integer Instructions: 512,144 (25.0%)
Memory Instructions: 409,716 (20.0%)
Control Instructions: 102,428 (5.0%)
Execution Efficiency:
Warp Execution Efficiency: 87.5%
Branch Efficiency: 94.2%
Thread Instruction Efficiency: 91.8%
Memory Access Patterns:
Global Load Instructions: 204,858
Global Store Instructions: 204,858
Shared Memory Access: 0
Local Memory Access: 0
Pipeline Utilization:
ALU Pipeline: 85.3%
FPU Pipeline: 78.9%
Memory Pipeline: 92.1%
Control Pipeline: 15.2%
Performance Bottlenecks:
- Memory bandwidth limited: 45.2% of execution time
- ALU utilization could be improved
- No significant control divergence detected
Code Architecture
SASS Metrics Collector
class SASSMetricsCollector {
private:
struct SASSMetrics {
uint64_t totalInstructions;
uint64_t fp32Instructions;
uint64_t integerInstructions;
uint64_t memoryInstructions;
uint64_t controlInstructions;
double warpEfficiency;
double branchEfficiency;
double threadEfficiency;
};
std::map<std::string, SASSMetrics> kernelMetrics;
std::vector<CUpti_EventID> sassEventIds;
CUpti_EventGroup eventGroup;
public:
void setupSASSMetrics(CUcontext context, CUdevice device);
void startCollection();
void stopCollection();
void analyzeKernelSASS(const std::string& kernelName);
void generateSASSReport();
};
void SASSMetricsCollector::setupSASSMetrics(CUcontext context, CUdevice device) {
// Set up SASS-level event collection
CUPTI_CALL(cuptiEventGroupCreate(context, &eventGroup, 0));
// Add SASS instruction counting events
std::vector<std::string> sassEvents = {
"inst_executed",
"inst_fp_32",
"inst_integer",
"inst_compute_ld_st",
"inst_control",
"thread_inst_executed",
"warp_execution_efficiency",
"branch_efficiency"
};
for (const auto& eventName : sassEvents) {
CUpti_EventID eventId;
CUptiResult result = cuptiEventGetIdFromName(device, eventName.c_str(), &eventId);
if (result == CUPTI_SUCCESS) {
CUPTI_CALL(cuptiEventGroupAddEvent(eventGroup, eventId));
sassEventIds.push_back(eventId);
}
}
}
Assembly Code Analysis
class AssemblyAnalyzer {
private:
struct InstructionProfile {
std::string opcode;
uint64_t executionCount;
double averageLatency;
double throughput;
std::string category;
};
std::map<std::string, InstructionProfile> instructionProfiles;
public:
void analyzeInstructionMix(const SASSMetrics& metrics) {
std::cout << "Instruction Mix Analysis:" << std::endl;
uint64_t totalInsts = metrics.totalInstructions;
double fpPercentage = (double)metrics.fp32Instructions / totalInsts * 100;
double intPercentage = (double)metrics.integerInstructions / totalInsts * 100;
double memPercentage = (double)metrics.memoryInstructions / totalInsts * 100;
double ctrlPercentage = (double)metrics.controlInstructions / totalInsts * 100;
std::cout << " Floating Point: " << fpPercentage << "%" << std::endl;
std::cout << " Integer: " << intPercentage << "%" << std::endl;
std::cout << " Memory: " << memPercentage << "%" << std::endl;
std::cout << " Control: " << ctrlPercentage << "%" << std::endl;
// Analyze instruction balance
if (memPercentage > 40) {
std::cout << " WARNING: Memory-bound kernel detected" << std::endl;
}
if (ctrlPercentage > 20) {
std::cout << " WARNING: High control overhead detected" << std::endl;
}
}
void suggestOptimizations(const SASSMetrics& metrics) {
std::cout << "Optimization Suggestions:" << std::endl;
if (metrics.warpEfficiency < 0.8) {
std::cout << " - Consider optimizing warp utilization" << std::endl;
std::cout << " - Check for thread divergence patterns" << std::endl;
}
if (metrics.branchEfficiency < 0.9) {
std::cout << " - Optimize branch patterns to reduce divergence" << std::endl;
std::cout << " - Consider predicated execution for simple branches" << std::endl;
}
double memRatio = (double)metrics.memoryInstructions / metrics.totalInstructions;
if (memRatio > 0.3) {
std::cout << " - Consider memory access optimization" << std::endl;
std::cout << " - Evaluate shared memory usage opportunities" << std::endl;
}
}
};
Advanced SASS Analysis
Instruction Pipeline Analysis
class PipelineAnalyzer {
private:
struct PipelineMetrics {
double aluUtilization;
double fpuUtilization;
double memoryUtilization;
double controlUtilization;
std::vector<double> stallReasons;
};
public:
PipelineMetrics analyzePipelineUtilization(const SASSMetrics& metrics) {
PipelineMetrics pipelineMetrics;
// Calculate pipeline utilization based on instruction mix
uint64_t totalInsts = metrics.totalInstructions;
pipelineMetrics.aluUtilization =
(double)(metrics.integerInstructions) / totalInsts;
pipelineMetrics.fpuUtilization =
(double)(metrics.fp32Instructions) / totalInsts;
pipelineMetrics.memoryUtilization =
(double)(metrics.memoryInstructions) / totalInsts;
pipelineMetrics.controlUtilization =
(double)(metrics.controlInstructions) / totalInsts;
return pipelineMetrics;
}
void identifyBottlenecks(const PipelineMetrics& pipeline) {
std::cout << "Pipeline Bottleneck Analysis:" << std::endl;
double maxUtilization = std::max({
pipeline.aluUtilization,
pipeline.fpuUtilization,
pipeline.memoryUtilization,
pipeline.controlUtilization
});
if (maxUtilization == pipeline.memoryUtilization) {
std::cout << " Primary bottleneck: Memory pipeline" << std::endl;
std::cout << " Suggestions: Optimize memory access patterns, use shared memory" << std::endl;
} else if (maxUtilization == pipeline.fpuUtilization) {
std::cout << " Primary bottleneck: FP pipeline" << std::endl;
std::cout << " Suggestions: Balance compute with memory operations" << std::endl;
} else if (maxUtilization == pipeline.controlUtilization) {
std::cout << " Primary bottleneck: Control pipeline" << std::endl;
std::cout << " Suggestions: Reduce branch divergence, simplify control flow" << std::endl;
}
}
};
Performance Model Validation
class PerformanceModel {
private:
struct ModelParameters {
double peakALUThroughput;
double peakFPUThroughput;
double memoryBandwidth;
double instructionIssueRate;
};
ModelParameters getGPUParameters(int deviceId) {
ModelParameters params;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, deviceId);
// Estimate peak throughput based on GPU architecture
params.peakALUThroughput = prop.multiProcessorCount * prop.clockRate * 2; // Simplified
params.peakFPUThroughput = prop.multiProcessorCount * prop.clockRate;
params.memoryBandwidth = prop.memoryBusWidth * prop.memoryClockRate * 2 / 8; // GB/s
params.instructionIssueRate = prop.multiProcessorCount * prop.clockRate;
return params;
}
public:
void validatePerformanceModel(const SASSMetrics& metrics, int deviceId) {
ModelParameters params = getGPUParameters(deviceId);
// Calculate achieved vs theoretical performance
double achievedALUUtilization =
(double)metrics.integerInstructions / metrics.totalInstructions;
double achievedFPUUtilization =
(double)metrics.fp32Instructions / metrics.totalInstructions;
std::cout << "Performance Model Validation:" << std::endl;
std::cout << " ALU Utilization: " << achievedALUUtilization * 100 << "%" << std::endl;
std::cout << " FPU Utilization: " << achievedFPUUtilization * 100 << "%" << std::endl;
// Compare with theoretical limits
double efficiencyGap = 1.0 - std::max(achievedALUUtilization, achievedFPUUtilization);
if (efficiencyGap > 0.3) {
std::cout << " Significant efficiency gap detected: " << efficiencyGap * 100 << "%" << std::endl;
}
}
};
Real-World Applications
Matrix Multiplication Analysis
void analyzeMatrixMultiplication() {
SASSMetricsCollector collector;
AssemblyAnalyzer analyzer;
// Set up metrics collection
collector.setupSASSMetrics(context, device);
// Launch matrix multiplication kernel
collector.startCollection();
dim3 grid(32, 32);
dim3 block(16, 16);
matrixMulKernel<<<grid, block>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
collector.stopCollection();
// Analyze results
SASSMetrics metrics = collector.getMetrics("matrixMulKernel");
analyzer.analyzeInstructionMix(metrics);
analyzer.suggestOptimizations(metrics);
std::cout << "Matrix Multiplication SASS Analysis:" << std::endl;
std::cout << " Compute Intensity: " << calculateComputeIntensity(metrics) << std::endl;
std::cout << " Memory Efficiency: " << calculateMemoryEfficiency(metrics) << std::endl;
}
Reduction Operation Profiling
void analyzeReductionKernel() {
SASSMetricsCollector collector;
collector.setupSASSMetrics(context, device);
collector.startCollection();
// Launch reduction kernel
reductionKernel<<<grid, block, sharedMemSize>>>(d_input, d_output, N);
cudaDeviceSynchronize();
collector.stopCollection();
SASSMetrics metrics = collector.getMetrics("reductionKernel");
std::cout << "Reduction Kernel Analysis:" << std::endl;
std::cout << " Shared Memory Usage Efficiency: " <<
analyzeSharedMemoryUsage(metrics) << std::endl;
std::cout << " Synchronization Overhead: " <<
analyzeSynchronizationOverhead(metrics) << std::endl;
std::cout << " Warp Divergence: " <<
(100.0 - metrics.branchEfficiency) << "%" << std::endl;
}
Integration with Development Tools
Compiler Optimization Analysis
class CompilerAnalyzer {
public:
void compareOptimizationLevels() {
std::vector<std::string> optimizationFlags = {"-O0", "-O1", "-O2", "-O3"};
for (const auto& flag : optimizationFlags) {
std::cout << "Testing with " << flag << ":" << std::endl;
// Recompile with different optimization level
recompileKernel(flag);
// Collect SASS metrics
SASSMetrics metrics = collectMetricsForOptimization(flag);
std::cout << " Instructions: " << metrics.totalInstructions << std::endl;
std::cout << " Efficiency: " << metrics.warpEfficiency << std::endl;
analyzeOptimizationImpact(metrics, flag);
}
}
private:
void analyzeOptimizationImpact(const SASSMetrics& metrics, const std::string& flag) {
// Analyze how compiler optimizations affect SASS metrics
if (flag == "-O3" && metrics.controlInstructions > baseline.controlInstructions) {
std::cout << " Warning: High optimization may have increased control overhead" << std::endl;
}
if (metrics.warpEfficiency < 0.8) {
std::cout << " Consider different optimization strategies" << std::endl;
}
}
};
Assembly Code Correlation
class CodeCorrelator {
public:
void correlateCUDAtoSASS(const std::string& cudaFile, const std::string& sassFile) {
// Parse CUDA source code
auto cudaLines = parseCUDASource(cudaFile);
// Parse SASS assembly
auto sassInstructions = parseSASSAssembly(sassFile);
// Correlate performance metrics with source lines
for (const auto& [lineNum, sassInsts] : sassInstructions) {
if (lineNum < cudaLines.size()) {
std::cout << "CUDA Line " << lineNum << ": " << cudaLines[lineNum] << std::endl;
std::cout << " SASS Instructions: " << sassInsts.size() << std::endl;
for (const auto& inst : sassInsts) {
std::cout << " " << inst.opcode << " (executed "
<< inst.executionCount << " times)" << std::endl;
}
// Identify performance hotspots
if (getTotalExecutionCount(sassInsts) > hotspotThreshold) {
std::cout << " >>> PERFORMANCE HOTSPOT DETECTED <<<" << std::endl;
}
}
}
}
};
Troubleshooting SASS Metrics
Common Issues
- Metric Availability: Not all GPUs support all SASS metrics
- Event Multiplexing: Some metrics cannot be collected simultaneously
- Precision Limitations: SASS counters may have limited precision
- Context Sensitivity: Metrics may vary with different execution contexts
Debug Strategies
class SASSDebugger {
public:
void validateMetricSupport(CUdevice device) {
uint32_t numEvents;
CUPTI_CALL(cuptiDeviceGetNumEventDomains(device, &numEvents));
std::cout << "Available event domains: " << numEvents << std::endl;
// Check specific SASS events
std::vector<std::string> requiredEvents = {
"inst_executed", "inst_fp_32", "warp_execution_efficiency"
};
for (const auto& eventName : requiredEvents) {
CUpti_EventID eventId;
CUptiResult result = cuptiEventGetIdFromName(device, eventName.c_str(), &eventId);
if (result == CUPTI_SUCCESS) {
std::cout << "✓ " << eventName << " supported" << std::endl;
} else {
std::cout << "✗ " << eventName << " not supported" << std::endl;
}
}
}
void checkEventGroupCompatibility(const std::vector<CUpti_EventID>& events, CUcontext context) {
// Test if events can be collected together
CUpti_EventGroup testGroup;
CUptiResult result = cuptiEventGroupCreate(context, &testGroup, 0);
for (auto eventId : events) {
result = cuptiEventGroupAddEvent(testGroup, eventId);
if (result != CUPTI_SUCCESS) {
std::cout << "Event " << eventId << " cannot be added to group" << std::endl;
}
}
cuptiEventGroupDestroy(testGroup);
}
};
Next Steps
- Apply SASS metrics analysis to understand your kernel performance at the lowest level
- Experiment with different compiler optimizations and measure their SASS-level impact
- Correlate SASS metrics with high-level algorithm performance
- Develop custom analysis tools for your specific instruction patterns
- Integrate SASS analysis into your optimization workflow for detailed performance tuning