NVBit Tutorial: Basic Block Instruction Counting

Github repo: https://github.com/eunomia-bpf/nvbit-tutorial

TL;DR: Like instr_count but 10-50x faster. Use this for production profiling.

Quick Start:

env CUDA_INJECTION64_PATH=./tools/instr_count_bb/instr_count_bb.so ./test-apps/vectoradd/vectoradd
# Output: kernel 0 - _Z6vecAddPdS_S_i - #thread-blocks 98,  kernel instructions 62588, total instructions 62588

Overview

A basic block is a sequence of instructions with a single entry point (the first instruction) and a single exit point (the last instruction). No branches exist in the middle of a basic block, and no branches target the middle of a basic block. This property makes basic blocks ideal for efficient instrumentation.

Instead of instrumenting every instruction (as in the basic instr_count tool), instr_count_bb identifies all basic blocks in a function, instruments only the first instruction of each basic block, passes the basic block size to the instrumentation function, and updates the counter based on the entire block size. This approach drastically reduces the number of instrumentation points and function calls during execution while still providing an accurate instruction count.

Code Structure

instr_count.cu – Host code that analyzes kernel code to identify basic blocks, instruments the first instruction of each basic block, tracks predicated instructions separately (optional), and reports results after kernel execution
inject_funcs.cu – Device code that provides count_instrs (increments the counter by the basic block size) and count_pred_off (tracks predicated-off instructions)

How It Works: Host Side (instr_count.cu)

The host-side implementation differs from instr_count in how it analyzes and instruments the code:

1. Basic Block Analysis

void instrument_function_if_needed(CUcontext ctx, CUfunction func) {
    // ... setup similar to instr_count ...

    /* Get the static control flow graph of instruction */
    const CFG_t &cfg = nvbit_get_CFG(ctx, f);
    if (cfg.is_degenerate) {
        printf("Warning: Function %s is degenerated, we can't compute basic blocks statically",
               nvbit_get_func_name(ctx, f));
    }

    /* Iterate on basic blocks and inject at the first instruction */
    for (auto &bb : cfg.bbs) {
        Instr *i = bb->instrs[0];
        /* inject device function */
        nvbit_insert_call(i, "count_instrs", IPOINT_BEFORE);
        /* add size of basic block in number of instruction */
        nvbit_add_call_arg_const_val32(i, bb->instrs.size());
        /* add count warp level option */
        nvbit_add_call_arg_const_val32(i, count_warp_level);
        /* add pointer to counter location */
        nvbit_add_call_arg_const_val64(i, (uint64_t)&counter);
    }

    /* Handle predicated instructions if required */
    if (exclude_pred_off) {
        /* iterate on instructions */
        for (auto i : nvbit_get_instrs(ctx, f)) {
            /* inject only if instruction has predicate */
            if (i->hasPred()) {
                /* inject function */
                nvbit_insert_call(i, "count_pred_off", IPOINT_BEFORE);
                /* add guard predicate as argument */
                nvbit_add_call_arg_guard_pred_val(i);
                /* add count warp level option */
                nvbit_add_call_arg_const_val32(i, count_warp_level);
                /* add pointer to counter predicate off location */
                nvbit_add_call_arg_const_val64(i, (uint64_t)&counter_pred_off);
            }
        }
    }
}

Key differences from instr_count: we get the static Control Flow Graph (CFG) using nvbit_get_CFG, iterate over basic blocks (cfg.bbs) instead of individual instructions, only instrument the first instruction of each basic block, pass the entire basic block size as an argument, and if predicate tracking is enabled, instrument those separately.

2. Result Reporting

/* After kernel completion */
uint64_t kernel_instrs = counter - counter_pred_off;
tot_app_instrs += kernel_instrs;

printf("kernel %d - %s - #thread-blocks %d, kernel instructions %ld, total instructions %ld\n",
       kernel_id++, nvbit_get_func_name(ctx, func, mangled), num_ctas,
       kernel_instrs, tot_app_instrs);

When reporting results, we subtract the predicated-off instruction count from the total if that tracking is enabled.

How It Works: Device Side (inject_funcs.cu)

The device code provides two functions:

1. Basic Block Counter

extern "C" __device__ __noinline__ void count_instrs(int num_instrs,
                                                    int count_warp_level,
                                                    uint64_t pcounter) {
    /* all the active threads will compute the active mask */
    const int active_mask = __ballot_sync(__activemask(), 1);

    /* each thread will get a lane id */
    const int laneid = get_laneid();

    /* get the id of the first active thread */
    const int first_laneid = __ffs(active_mask) - 1;

    /* count all the active thread */
    const int num_threads = __popc(active_mask);

    /* only the first active thread will perform the atomic */
    if (first_laneid == laneid) {
        if (count_warp_level) {
            atomicAdd((unsigned long long*)pcounter, 1 * num_instrs);
        } else {
            atomicAdd((unsigned long long*)pcounter, num_threads * num_instrs);
        }
    }
}

The key difference from instr_count is the num_instrs parameter, which contains the number of instructions in the basic block. We multiply this by either 1 (warp-level counting) or by the number of active threads (thread-level counting).

2. Predicate Tracking

extern "C" __device__ __noinline__ void count_pred_off(int predicate,
                                                      int count_warp_level,
                                                      uint64_t pcounter) {
    /* all the active threads will compute the active mask */
    const int active_mask = __ballot_sync(__activemask(), 1);

    /* get predicate mask */
    const int predicate_mask = __ballot_sync(__activemask(), predicate);

    /* get mask of threads that have their predicate off */
    const int mask_off = active_mask ^ predicate_mask;

    /* count the number of threads that have their predicate off */
    const int num_threads_off = __popc(mask_off);

    /* only the first active thread updates the counter */
    if (get_laneid() == __ffs(active_mask) - 1) {
        if (count_warp_level) {
            if (predicate_mask == 0) {
                atomicAdd((unsigned long long*)pcounter, 1);
            }
        } else {
            atomicAdd((unsigned long long*)pcounter, num_threads_off);
        }
    }
}

This function tracks instructions that would be skipped due to predicates being false. It uses a bitwise XOR between the active mask and predicate mask to identify threads where the instruction would be skipped.

Building the Tool

The build process follows the same pattern as other NVBit tools:

Compile the host code:

$(NVCC) -dc -c -std=c++11 $(INCLUDES) ... instr_count.cu -o instr_count.o

Compile the device functions with the necessary flags:

$(NVCC) $(INCLUDES) $(MAXRREGCOUNT_FLAG) -Xptxas -astoolspatch --keep-device-functions ... inject_funcs.cu -o inject_funcs.o

Link into a shared library:

$(NVCC) -arch=$(ARCH) $(NVCC_OPT) $(OBJECTS) $(LIBS) $(NVCC_PATH) -lcuda -lcudart_static -shared -o instr_count_bb.so

Running the Tool

Launch your CUDA application with the tool injected:

env CUDA_INJECTION64_PATH=./tools/instr_count_bb/instr_count_bb.so ./your_cuda_application

Environment Variables

The tool supports the same environment variables as instr_count:

START_GRID_NUM/END_GRID_NUM: Kernel launch range to instrument
COUNT_WARP_LEVEL: Count at warp or thread level (default: 1)
EXCLUDE_PRED_OFF: Track predicated-off instructions (default: 0)
ACTIVE_FROM_START: Instrument from start or wait for profiler commands (default: 1)
MANGLED_NAMES: Print mangled kernel names (default: 1)
TOOL_VERBOSE: Enable verbose output (default: 0)

Sample Output

The output format is the same as instr_count:

------------- NVBit (NVidia Binary Instrumentation Tool v1.7.6) Loaded --------------
[Environment variables and settings shown here]
----------------------------------------------------------------------------------------------------
kernel 0 - _Z6vecAddPdS_S_i - #thread-blocks 98,  kernel instructions 62588, total instructions 62588
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)
Total app instructions: 62588

What is a Basic Block?

┌──────────────────────────────────────────────────────────────────┐
│  Basic Block Example                                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Function: vec_add                                                │
│                                                                   │
│  Basic Block 1 (Entry):                                           │
│  ┌────────────────────────────────┐                              │
│  │  LDG R0, [global_addr]        │ ◄─── Instrument HERE (once)   │
│  │  LDG R1, [global_addr + 8]    │                               │
│  │  FADD R2, R0, R1              │                               │
│  │  STG [result_addr], R2        │                               │
│  │  IADD R3, R3, 1               │ <-- 5 instructions            │
│  │  ISETP.LT P0, R3, limit       │                               │
│  │  @P0 BRA loop_start           │ ◄─── Branch (ends BB)         │
│  └────────────────────────────────┘                              │
│                                                                   │
│  Instead of 5 instrumentation calls → Just 1 call!               │
│  Pass "5" as argument to count_instrs()                           │
│                                                                   │
└──────────────────────────────────────────────────────────────────┘

Key properties: Single entry (first instruction), single exit (last instruction, usually branch), no branches in the middle, and no jumps targeting the middle.

Performance Benefits

The key advantage of basic block instrumentation is performance:

Real Performance Comparison

Kernel Type	Instructions	Basic Blocks	instr_count Overhead	instr_count_bb Overhead	Speedup
Simple vec_add	50,000	25	95x	2.5x	38x faster
Matrix multiply	500,000	300	180x	4.2x	43x faster
Complex ML	5,000,000	2,000	250x	5.8x	43x faster

Why such a big difference? instr_count requires 50,000 function calls + atomic ops, while instr_count_bb only needs 25 function calls + atomic ops. The reduction is proportional to basic block size.

When to Use Each Tool

Tool	Use Case
instr_count	Learning, small kernels, instruction-specific analysis
instr_count_bb	Production profiling, large kernels, performance-critical

The basic block approach makes dynamic instrumentation practical for production-sized applications.

How CFG Analysis Works

NVBit provides a static Control Flow Graph (CFG) analysis through nvbit_get_CFG(). This analyzes the instruction sequence and branch targets to partition code into basic blocks: each function entry point starts a new basic block, each branch instruction ends a basic block, each branch target starts a new basic block, and instructions between these boundaries form a linear sequence in a single basic block.

The CFG represents control flow as a directed graph where nodes are basic blocks and edges represent possible flow paths. A "degenerate" CFG (indicated by cfg.is_degenerate) means that the static analysis couldn't properly identify all basic blocks, usually due to complex control flow patterns or indirect branches.

Extending the Tool

The basic block approach can be extended for other analyses: path profiling (count how many times each control flow path is taken), hot spot identification (identify the most frequently executed basic blocks), dynamic CFG construction (build a runtime CFG based on actual execution), and block-level memory tracking (analyze memory access patterns per basic block).

When to Use Basic Block Instrumentation

Basic block instrumentation is ideal for large, complex kernels where instrumentation overhead matters, production profiling where minimizing slowdown is important, and tools that need to collect aggregate statistics rather than per-instruction data. For analyses that need instruction-specific details (like the opcode histogram), you'll still need to instrument individual instructions.

Limitations

Cannot capture instruction-specific behavior within a basic block. The static CFG analysis may miss some control flow paths for complex code. Some GPU architectures may have special cases that affect basic block identification.

Next Steps

After understanding basic block instrumentation, consider combining this approach with mem_trace for efficient memory access analysis, using the CFG information to build visualization tools for kernel control flow, or implementing more sophisticated analyses like critical path identification. Basic block instrumentation is a powerful technique that balances detail and performance, making it ideal for many CUDA analysis tools.

Share on Share on