SASS Metrics Collection Tutorial

The GitHub repo and complete tutorial is available at https://github.com/eunomia-bpf/cupti-tutorial.

Introduction

The SASS Metrics Collection sample demonstrates how to collect detailed performance metrics at the SASS (Shader ASSembly) level using CUPTI. This tutorial shows you how to gather low-level performance data about GPU instruction execution, providing insights into the actual assembly code performance of your CUDA kernels.

What You'll Learn

How to collect SASS-level performance metrics
Understanding GPU instruction-level performance analysis
Analyzing assembly code efficiency and bottlenecks
Correlating high-level CUDA code with SASS instructions
Optimizing kernels based on assembly-level insights

Understanding SASS Metrics

SASS (Shader ASSembly) metrics provide the lowest level of performance insight available through CUPTI:

Instruction-Level Analysis: Performance of individual GPU assembly instructions
Execution Efficiency: How efficiently instructions are executed by the hardware
Resource Utilization: How well kernels use available GPU execution units
Pipeline Analysis: Understanding instruction pipeline behavior
Bottleneck Identification: Pinpointing specific instruction-level performance issues

Key SASS Metrics

Instruction Execution Metrics

inst_executed: Total instructions executed
inst_issued: Instructions issued to execution units
inst_fp_32/64: Floating-point instruction counts
inst_integer: Integer instruction counts
inst_bit_convert: Bit manipulation instructions
inst_control: Control flow instructions

Memory Instruction Metrics

inst_compute_ld_st: Load/store compute instructions
global_load/store: Global memory access instructions
shared_load/store: Shared memory access instructions
local_load/store: Local memory access instructions

Execution Efficiency Metrics

thread_inst_executed: Instructions executed per thread
warp_execution_efficiency: Percentage of active threads in warps
branch_efficiency: Percentage of non-divergent branches

Building the Sample

Prerequisites

CUDA Toolkit with CUPTI
GPU with compute capability 5.0 or higher
Access to detailed GPU performance counters

Build Process

cd sass_metrics
make

This creates the sass_metrics executable for collecting SASS-level performance data.

Running the Sample

Basic Execution

./sass_metrics

Sample Output

evice Num: 0
Lazy Patching Enabled
Device Name: NVIDIA H100 MIG 1c.4g.48gb
Device compute capability: 9.0
Metric Name: smsp__sass_inst_executed, MetricID: 8913560818632243504, Metric Description: # of warp instructions executed
Enable SASS Patching
Launching VectorAdd

Module cubinCrc: 2917979105
Kernel Name: _Z9VectorAddPKiS0_Pii
metric Name: smsp__sass_inst_executed
                [Inst] pcOffset: 0x0    metricValue:    [0]: 1000
                [Inst] pcOffset: 0x10   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x20   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x30   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x40   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x50   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x60   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x70   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x80   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x90   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xa0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xb0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xc0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xd0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xe0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xf0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x100  metricValue:    [0]: 1000
                [Inst] pcOffset: 0x110  metricValue:    [0]: 1000
                [Inst] pcOffset: 0x120  metricValue:    [0]: 1000
                [Inst] pcOffset: 0x130  metricValue:    [0]: 1000

Launching VectorSubtract
Disable SASS Patching
Launching VectorMultiply

Code Architecture

SASS Metrics Collector

class SASSMetricsCollector {
private:
    struct SASSMetrics {
        uint64_t totalInstructions;
        uint64_t fp32Instructions;
        uint64_t integerInstructions;
        uint64_t memoryInstructions;
        uint64_t controlInstructions;
        double warpEfficiency;
        double branchEfficiency;
        double threadEfficiency;
    };
    
    std::map<std::string, SASSMetrics> kernelMetrics;
    std::vector<CUpti_EventID> sassEventIds;
    CUpti_EventGroup eventGroup;
 
public:
    void setupSASSMetrics(CUcontext context, CUdevice device);
    void startCollection();
    void stopCollection();
    void analyzeKernelSASS(const std::string& kernelName);
    void generateSASSReport();
};
 
void SASSMetricsCollector::setupSASSMetrics(CUcontext context, CUdevice device) {
    // Set up SASS-level event collection
    CUPTI_CALL(cuptiEventGroupCreate(context, &eventGroup, 0));
    
    // Add SASS instruction counting events
    std::vector<std::string> sassEvents = {
        "inst_executed",
        "inst_fp_32", 
        "inst_integer",
        "inst_compute_ld_st",
        "inst_control",
        "thread_inst_executed",
        "warp_execution_efficiency",
        "branch_efficiency"
    };
    
    for (const auto& eventName : sassEvents) {
        CUpti_EventID eventId;
        CUptiResult result = cuptiEventGetIdFromName(device, eventName.c_str(), &eventId);
        if (result == CUPTI_SUCCESS) {
            CUPTI_CALL(cuptiEventGroupAddEvent(eventGroup, eventId));
            sassEventIds.push_back(eventId);
        }
    }
}

Assembly Code Analysis

class AssemblyAnalyzer {
private:
    struct InstructionProfile {
        std::string opcode;
        uint64_t executionCount;
        double averageLatency;
        double throughput;
        std::string category;
    };
    
    std::map<std::string, InstructionProfile> instructionProfiles;
 
public:
    void analyzeInstructionMix(const SASSMetrics& metrics) {
        std::cout << "Instruction Mix Analysis:" << std::endl;
        
        uint64_t totalInsts = metrics.totalInstructions;
        
        double fpPercentage = (double)metrics.fp32Instructions / totalInsts * 100;
        double intPercentage = (double)metrics.integerInstructions / totalInsts * 100;
        double memPercentage = (double)metrics.memoryInstructions / totalInsts * 100;
        double ctrlPercentage = (double)metrics.controlInstructions / totalInsts * 100;
        
        std::cout << "  Floating Point: " << fpPercentage << "%" << std::endl;
        std::cout << "  Integer: " << intPercentage << "%" << std::endl;
        std::cout << "  Memory: " << memPercentage << "%" << std::endl;
        std::cout << "  Control: " << ctrlPercentage << "%" << std::endl;
        
        // Analyze instruction balance
        if (memPercentage > 40) {
            std::cout << "  WARNING: Memory-bound kernel detected" << std::endl;
        }
        if (ctrlPercentage > 20) {
            std::cout << "  WARNING: High control overhead detected" << std::endl;
        }
    }
    
    void suggestOptimizations(const SASSMetrics& metrics) {
        std::cout << "Optimization Suggestions:" << std::endl;
        
        if (metrics.warpEfficiency < 0.8) {
            std::cout << "  - Consider optimizing warp utilization" << std::endl;
            std::cout << "  - Check for thread divergence patterns" << std::endl;
        }
        
        if (metrics.branchEfficiency < 0.9) {
            std::cout << "  - Optimize branch patterns to reduce divergence" << std::endl;
            std::cout << "  - Consider predicated execution for simple branches" << std::endl;
        }
        
        double memRatio = (double)metrics.memoryInstructions / metrics.totalInstructions;
        if (memRatio > 0.3) {
            std::cout << "  - Consider memory access optimization" << std::endl;
            std::cout << "  - Evaluate shared memory usage opportunities" << std::endl;
        }
    }
};

Advanced SASS Analysis

Instruction Pipeline Analysis

class PipelineAnalyzer {
private:
    struct PipelineMetrics {
        double aluUtilization;
        double fpuUtilization;
        double memoryUtilization;
        double controlUtilization;
        std::vector<double> stallReasons;
    };
 
public:
    PipelineMetrics analyzePipelineUtilization(const SASSMetrics& metrics) {
        PipelineMetrics pipelineMetrics;
        
        // Calculate pipeline utilization based on instruction mix
        uint64_t totalInsts = metrics.totalInstructions;
        
        pipelineMetrics.aluUtilization = 
            (double)(metrics.integerInstructions) / totalInsts;
        pipelineMetrics.fpuUtilization = 
            (double)(metrics.fp32Instructions) / totalInsts;
        pipelineMetrics.memoryUtilization = 
            (double)(metrics.memoryInstructions) / totalInsts;
        pipelineMetrics.controlUtilization = 
            (double)(metrics.controlInstructions) / totalInsts;
        
        return pipelineMetrics;
    }
    
    void identifyBottlenecks(const PipelineMetrics& pipeline) {
        std::cout << "Pipeline Bottleneck Analysis:" << std::endl;
        
        double maxUtilization = std::max({
            pipeline.aluUtilization,
            pipeline.fpuUtilization, 
            pipeline.memoryUtilization,
            pipeline.controlUtilization
        });
        
        if (maxUtilization == pipeline.memoryUtilization) {
            std::cout << "  Primary bottleneck: Memory pipeline" << std::endl;
            std::cout << "  Suggestions: Optimize memory access patterns, use shared memory" << std::endl;
        } else if (maxUtilization == pipeline.fpuUtilization) {
            std::cout << "  Primary bottleneck: FP pipeline" << std::endl;
            std::cout << "  Suggestions: Balance compute with memory operations" << std::endl;
        } else if (maxUtilization == pipeline.controlUtilization) {
            std::cout << "  Primary bottleneck: Control pipeline" << std::endl;
            std::cout << "  Suggestions: Reduce branch divergence, simplify control flow" << std::endl;
        }
    }
};

Performance Model Validation

class PerformanceModel {
private:
    struct ModelParameters {
        double peakALUThroughput;
        double peakFPUThroughput;
        double memoryBandwidth;
        double instructionIssueRate;
    };
    
    ModelParameters getGPUParameters(int deviceId) {
        ModelParameters params;
        
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, deviceId);
        
        // Estimate peak throughput based on GPU architecture
        params.peakALUThroughput = prop.multiProcessorCount * prop.clockRate * 2; // Simplified
        params.peakFPUThroughput = prop.multiProcessorCount * prop.clockRate;
        params.memoryBandwidth = prop.memoryBusWidth * prop.memoryClockRate * 2 / 8; // GB/s
        params.instructionIssueRate = prop.multiProcessorCount * prop.clockRate;
        
        return params;
    }
 
public:
    void validatePerformanceModel(const SASSMetrics& metrics, int deviceId) {
        ModelParameters params = getGPUParameters(deviceId);
        
        // Calculate achieved vs theoretical performance
        double achievedALUUtilization = 
            (double)metrics.integerInstructions / metrics.totalInstructions;
        double achievedFPUUtilization = 
            (double)metrics.fp32Instructions / metrics.totalInstructions;
        
        std::cout << "Performance Model Validation:" << std::endl;
        std::cout << "  ALU Utilization: " << achievedALUUtilization * 100 << "%" << std::endl;
        std::cout << "  FPU Utilization: " << achievedFPUUtilization * 100 << "%" << std::endl;
        
        // Compare with theoretical limits
        double efficiencyGap = 1.0 - std::max(achievedALUUtilization, achievedFPUUtilization);
        if (efficiencyGap > 0.3) {
            std::cout << "  Significant efficiency gap detected: " << efficiencyGap * 100 << "%" << std::endl;
        }
    }
};

Real-World Applications

Matrix Multiplication Analysis

void analyzeMatrixMultiplication() {
    SASSMetricsCollector collector;
    AssemblyAnalyzer analyzer;
    
    // Set up metrics collection
    collector.setupSASSMetrics(context, device);
    
    // Launch matrix multiplication kernel
    collector.startCollection();
    
    dim3 grid(32, 32);
    dim3 block(16, 16);
    matrixMulKernel<<<grid, block>>>(d_A, d_B, d_C, N);
    
    cudaDeviceSynchronize();
    collector.stopCollection();
    
    // Analyze results
    SASSMetrics metrics = collector.getMetrics("matrixMulKernel");
    analyzer.analyzeInstructionMix(metrics);
    analyzer.suggestOptimizations(metrics);
    
    std::cout << "Matrix Multiplication SASS Analysis:" << std::endl;
    std::cout << "  Compute Intensity: " << calculateComputeIntensity(metrics) << std::endl;
    std::cout << "  Memory Efficiency: " << calculateMemoryEfficiency(metrics) << std::endl;
}

Reduction Operation Profiling

void analyzeReductionKernel() {
    SASSMetricsCollector collector;
    
    collector.setupSASSMetrics(context, device);
    collector.startCollection();
    
    // Launch reduction kernel
    reductionKernel<<<grid, block, sharedMemSize>>>(d_input, d_output, N);
    cudaDeviceSynchronize();
    
    collector.stopCollection();
    
    SASSMetrics metrics = collector.getMetrics("reductionKernel");
    
    std::cout << "Reduction Kernel Analysis:" << std::endl;
    std::cout << "  Shared Memory Usage Efficiency: " << 
        analyzeSharedMemoryUsage(metrics) << std::endl;
    std::cout << "  Synchronization Overhead: " << 
        analyzeSynchronizationOverhead(metrics) << std::endl;
    std::cout << "  Warp Divergence: " << 
        (100.0 - metrics.branchEfficiency) << "%" << std::endl;
}

Integration with Development Tools

Compiler Optimization Analysis

class CompilerAnalyzer {
public:
    void compareOptimizationLevels() {
        std::vector<std::string> optimizationFlags = {"-O0", "-O1", "-O2", "-O3"};
        
        for (const auto& flag : optimizationFlags) {
            std::cout << "Testing with " << flag << ":" << std::endl;
            
            // Recompile with different optimization level
            recompileKernel(flag);
            
            // Collect SASS metrics
            SASSMetrics metrics = collectMetricsForOptimization(flag);
            
            std::cout << "  Instructions: " << metrics.totalInstructions << std::endl;
            std::cout << "  Efficiency: " << metrics.warpEfficiency << std::endl;
            
            analyzeOptimizationImpact(metrics, flag);
        }
    }
    
private:
    void analyzeOptimizationImpact(const SASSMetrics& metrics, const std::string& flag) {
        // Analyze how compiler optimizations affect SASS metrics
        if (flag == "-O3" && metrics.controlInstructions > baseline.controlInstructions) {
            std::cout << "    Warning: High optimization may have increased control overhead" << std::endl;
        }
        
        if (metrics.warpEfficiency < 0.8) {
            std::cout << "    Consider different optimization strategies" << std::endl;
        }
    }
};

Assembly Code Correlation

class CodeCorrelator {
public:
    void correlateCUDAtoSASS(const std::string& cudaFile, const std::string& sassFile) {
        // Parse CUDA source code
        auto cudaLines = parseCUDASource(cudaFile);
        
        // Parse SASS assembly
        auto sassInstructions = parseSASSAssembly(sassFile);
        
        // Correlate performance metrics with source lines
        for (const auto& [lineNum, sassInsts] : sassInstructions) {
            if (lineNum < cudaLines.size()) {
                std::cout << "CUDA Line " << lineNum << ": " << cudaLines[lineNum] << std::endl;
                std::cout << "  SASS Instructions: " << sassInsts.size() << std::endl;
                
                for (const auto& inst : sassInsts) {
                    std::cout << "    " << inst.opcode << " (executed " 
                             << inst.executionCount << " times)" << std::endl;
                }
                
                // Identify performance hotspots
                if (getTotalExecutionCount(sassInsts) > hotspotThreshold) {
                    std::cout << "  >>> PERFORMANCE HOTSPOT DETECTED <<<" << std::endl;
                }
            }
        }
    }
};

Troubleshooting SASS Metrics

Common Issues

Metric Availability: Not all GPUs support all SASS metrics
Event Multiplexing: Some metrics cannot be collected simultaneously
Precision Limitations: SASS counters may have limited precision
Context Sensitivity: Metrics may vary with different execution contexts

Debug Strategies

class SASSDebugger {
public:
    void validateMetricSupport(CUdevice device) {
        uint32_t numEvents;
        CUPTI_CALL(cuptiDeviceGetNumEventDomains(device, &numEvents));
        
        std::cout << "Available event domains: " << numEvents << std::endl;
        
        // Check specific SASS events
        std::vector<std::string> requiredEvents = {
            "inst_executed", "inst_fp_32", "warp_execution_efficiency"
        };
        
        for (const auto& eventName : requiredEvents) {
            CUpti_EventID eventId;
            CUptiResult result = cuptiEventGetIdFromName(device, eventName.c_str(), &eventId);
            
            if (result == CUPTI_SUCCESS) {
                std::cout << "✓ " << eventName << " supported" << std::endl;
            } else {
                std::cout << "✗ " << eventName << " not supported" << std::endl;
            }
        }
    }
    
    void checkEventGroupCompatibility(const std::vector<CUpti_EventID>& events, CUcontext context) {
        // Test if events can be collected together
        CUpti_EventGroup testGroup;
        CUptiResult result = cuptiEventGroupCreate(context, &testGroup, 0);
        
        for (auto eventId : events) {
            result = cuptiEventGroupAddEvent(testGroup, eventId);
            if (result != CUPTI_SUCCESS) {
                std::cout << "Event " << eventId << " cannot be added to group" << std::endl;
            }
        }
        
        cuptiEventGroupDestroy(testGroup);
    }
};

Next Steps

Apply SASS metrics analysis to understand your kernel performance at the lowest level
Experiment with different compiler optimizations and measure their SASS-level impact
Correlate SASS metrics with high-level algorithm performance
Develop custom analysis tools for your specific instruction patterns
Integrate SASS analysis into your optimization workflow for detailed performance tuning

Continue exploring

Back to index

About Eunomia

The broader Eunomia project family, including GPU tutorials, research notes, experiments, and related open-source tools.

NVIDIA CUPTI Tutorials

The GitHub repo and complete tutorial is available at <https://github.com/eunomia-bpf/cupti-tutorial.

CUPTI SASS Source Mapping Tutorial

The GitHub repo and complete tutorial is available at <https://github.com/eunomia-bpf/cupti-tutorial.

Last updated: Sep 1, 2025
First published: Jun 17, 2025
Contributors: yunwei37, github-actions[bot]

Edit this page Share on X Share on Facebook Join discussion RSS feed

Was this page helpful?