GPU Profiling Under the Hood: An Implementation-Focused Survey of Modern Accelerator Tracing Tools

Profiling and tracing heterogeneous accelerators (GPUs, DPUs, and APUs) is crucial for optimizing performance in modern systems. This survey provides a deep implementation-oriented review of how state-of-the-art tools capture low-level execution details. We target two audiences: (1) tool developers seeking insight into how existing profilers are built – including what interfaces they hook and how they trace both CPUs and accelerators – and (2) system engineers deciding which tools to integrate, based on capabilities and overhead. We explore internal architectures of key profiling tools, the runtime libraries and driver APIs they intercept, whether they trace events on CPU, GPU, or both, and how they handle cross-device correlation. We then examine implementation strategies (instrumentation vs. sampling, use of performance counters, dynamic binary instrumentation, kernel hooks like eBPF), as well as GPU-specific techniques (e.g. warp-level instruction sampling). Data handling and visualization approaches are compared, including trace data formats, storage and export methods, and GUI/CLI/Web interfaces (with integration into dashboards like TensorBoard or Grafana). We catalog the performance metrics these tools collect – from kernel latencies and throughput to SM occupancy, cache misses, and interconnect utilization – explaining how raw hardware counters are translated into derived metrics and the granularity/accuracy trade-offs involved. Next, we discuss what insights users can glean from visualizations (timelines, flame charts, dependency graphs, etc.), such as detecting pipeline stalls, memory bottlenecks, or CPU–GPU desynchronization. Tool extensibility (support for plugins or user-defined instrumentation) is reviewed, as are the relative overheads and intrusiveness of different measurement techniques. We also address security and sandboxing concerns – e.g. profiling in containerized or multi-tenant environments – and outline known limitations plus future directions (including support for emerging DPUs and integrated CPU-GPU APUs). The goal is a comprehensive systems-level survey that goes beyond feature lists to compare how these profilers work under the hood, including any reverse-engineered or undocumented methods used by open-source tools.

Internal Architectures of Key Profiling Tools

Understanding each tool’s internal design reveals what levels of the software/hardware stack it can observe. Below we review representative GPU profiling and tracing tools – both vendor-provided and open-source – focusing on their architecture, data collection mechanisms, and target platforms.

NVIDIA Nsight Systems

NVIDIA Nsight Systems is a system-wide profiler that provides a timeline view of CPU and GPU activities for CUDA applications. Internally, Nsight Systems combines lightweight CPU sampling with GPU activity tracing. It leverages the Linux Performance Events interface under the hood to capture CPU thread scheduling and sample CPU instruction pointers. In practice, this means it relies on perf_event_open for minimal-overhead sampling, requiring an appropriate perf_event_paranoid setting on Linux. For GPU tracing, Nsight Systems interfaces with NVIDIA’s CUDA Profiling Tools Interface (CUPTI) to receive callback “activity” records whenever GPU operations occur (kernel launches, memory copies, etc.). By hooking into the CUDA driver/runtime via CUPTI, Nsight Systems records timestamps for each GPU kernel execution and memory transfer. It synchronizes CPU and GPU timelines through a common timestamp base (CUPTI events are timestamped and merged with CPU events) so that events on both processors can be correlated in a unified timeline. The tool traces a variety of runtimes beyond CUDA as well: it can capture OS events, storage I/O, and GPU-accelerated libraries (e.g. cuDNN, cuBLAS) when instrumented via NVTX annotations. Nsight Systems primarily uses event-based instrumentation for GPU (via CUPTI callbacks) and statistical sampling for CPU, making it a hybrid approach. This design achieves low overhead—usually only a few percent—sufficient for profiling full applications. Data is recorded to a proprietary trace file (e.g. .nsys-rep) which can be viewed in the Nsight Systems GUI or exported (the CLI can generate JSON or CSV summary stats). It supports multi-threaded and multi-GPU tracing, showing context switches and concurrency on a timeline. However, being vendor-specific, Nsight Systems focuses on NVIDIA GPUs and CUDA or related NVIDIA APIs; it does not directly support AMD or Intel GPU runtimes.

NVIDIA Nsight Compute

Nsight Compute is NVIDIA’s kernel-level profiler for deep analysis of individual GPU kernels. Instead of a timeline, it provides a detailed report of metrics for each kernel launch. Internally, Nsight Compute instruments and measures GPU hardware performance counters extensively. It uses NVIDIA’s profiling interfaces (such as CUPTI’s Event and Metric API or the newer CUDA Profiling API) to program GPU performance monitors to count events like instructions executed, memory transactions, warp occupancy, etc. Because GPUs have limited counters available per run, Nsight Compute often replays each kernel multiple times to gather all requested metrics. For example, a kernel might run dozens of times (e.g. 46 passes) to collect different counter sets, which are later aggregated. This post-mortem instrumentation approach yields comprehensive metrics (including derived ones like achieved occupancy or memory throughput), but incurs significant runtime overhead – thus Nsight Compute is used typically on small kernels or in selective profiling sessions. The tool’s architecture includes a kernel injection mechanism: it pauses the target program at each kernel launch, sets up counters (or even inserts timing instruments around the kernel), then resumes execution and collects results. It supports only NVIDIA GPUs (starting from Volta architecture) and primarily CUDA kernels (though it can also profile kernels from GPU-accelerated libraries or CUDA Graphs by hooking those calls). Results are stored in a report file or shown in an interactive GUI/CLI with “Sections” grouping metrics. Nsight Compute’s metrics calculation system is advanced – it computes high-level stats (e.g. achieved occupancy, stall reasons breakdown) from low-level counters using formulas embedded in the tool. The output can be exported to text or CSV for further analysis. While extremely detailed, Nsight Compute’s tight integration with NVIDIA’s driver means it is not extensible to other vendors and must be used offline (it’s too intrusive for continuous monitoring).

AMD Omnitrace

Omnitrace is an open-source profiling and tracing tool developed by AMD Research. Its design emphasizes comprehensive tracing of both CPU and GPU execution via dynamic binary instrumentation and sampling. Omnitrace can attach to unmodified applications and instrument function calls (including HIP and OpenCL API calls) at runtime. Under the hood, it employs a dynamic binary instrumentation framework to insert probes into the target program’s functions (the documentation suggests it may use tools like Dyninst or similar). Users can configure which functions or modules to instrument prior to execution, allowing them to limit overhead by selecting only specific regions. At runtime, the instrumentation logs entry/exit events and timestamps, while a parallel call-stack sampling component periodically samples all threads to capture CPU execution hotspots. For GPU work, Omnitrace hooks into AMD’s ROCm stack: it uses ROCm’s roctracer callbacks to trace HIP runtime events and GPU kernel executions on AMD GPUs. (Currently, Omnitrace supports AMD GPUs via the HIP and HSA APIs only – NVIDIA GPUs are not yet supported, reflecting its AMD focus.) Collected trace data is written in the open Perfetto format (protobuf-based) which can be visualized in a Chrome/Perfetto web viewer. This choice of format enables rich timeline views in any browser (Perfetto UI). In addition, Omnitrace produces aggregated profiles (e.g. total time per function, call graphs) in JSON and text files. The JSON outputs integrate with analysis tools like Hatchet (which loads the data into pandas for custom analysis). Omnitrace’s architecture is highly extensible due to its open-source nature: users can modify it to support new events or use its user API to mark custom intervals. It effectively merges capabilities of a statistical profiler and a tracer, giving both high-level summaries and full execution traces. Because it instruments at the binary level, there is some overhead, but this is mitigated by letting users choose sampling frequency and which functions to instrument. Overall, Omnitrace provides a blueprint for building cross-platform profilers (it is designed to work on CPU-only apps as well, supporting any CPU vendor, while GPU support is currently AMD-specific).

Rice HPCToolkit

HPCToolkit is a longstanding open-source performance analysis toolkit, notable for its ability to profile large HPC codes with low overhead. Its recent extensions support GPU-accelerated programs by combining CPU sampling with GPU activity tracing. The internal architecture of HPCToolkit uses a vendor-neutral GPU monitoring substrate. This substrate connects to vendor-specific drivers: on NVIDIA, it uses CUPTI; on AMD, it uses ROC-tracer; and for Intel GPUs, it wraps the Level Zero API. Essentially, HPCToolkit registers callback functions so that whenever the GPU runtime (CUDA, HIP, Level0, OpenCL) launches a kernel or performs a data transfer, HPCToolkit is notified. In addition, HPCToolkit installs its own intercepting wrappers for OpenCL and Level Zero to catch events on platforms where a built-in tracer isn’t available. All GPU events (kernel launch begin/end, etc.) are time-stamped and recorded in a GPU operations trace. Meanwhile, the CPU side of the application is typically profiled via sampling – HPCToolkit’s hpcrun uses timer or hardware-counter interrupts to sample call stacks of CPU threads, attributing costs to code regions. These two data sources are merged during analysis to correlate CPU call paths with GPU activity. HPCToolkit goes further by providing fine-grained GPU profiling: on NVIDIA GPUs, it can perform PC sampling on the GPU itself. NVIDIA hardware since Maxwell can periodically sample a warp’s program counter during execution. HPCToolkit leverages this via CUPTI to collect a statistical profile of GPU instruction addresses, which are later mapped back to source lines and functions (HPCToolkit performs binary analysis of CUDA CUBIN files to recover function names and line mappings). This yields a hierarchical profile of GPU code, identifying hotspots at the source level similar to how CPU sampling does. (On AMD GPUs, such fine-grained sampling is currently not available due to lack of hardware support, so HPCToolkit only traces kernel-level timings there.) The tool also can trace GPU memory allocation calls and other API usage by wrapping those APIs. Data collected by HPCToolkit is stored in a structured format (experiment directory with binary performance data files). Visualization is done via HPCToolkit’s dedicated viewers: HPC‐Viewer for aggregated profiles and HPC‐TraceViewer for timelines. The design emphasizes post-mortem analysis – data is collected with low interference and later analyzed to produce call path profiles, mixture of tracing and profiling. HPCToolkit’s open design allows adding support for new platforms (recent versions added Intel oneAPI Level0 and even OpenMP offload tracing). Its strengths are in correlating GPU work back to CPU context (reconstructing a GPU calling context tree) and doing so in a vendor-agnostic way (e.g., presenting warps and wavefronts uniformly as abstract threads). One limitation is that configuring and processing the traces can be complex, and fine-grained GPU sampling (especially on NVIDIA) can perturb execution (earlier CUDA versions serialized kernels with PC sampling enabled, causing overhead). Nonetheless, HPCToolkit illustrates a hybrid approach: sampling + tracing, with heavy use of vendor interfaces but also static analysis to attribute performance metrics to code.

eBPF-Based GPU Performance Tracing

eGPU and bpftime are two recently merged open-source projects that demonstrate a novel approach to true zero-instrumentation GPU performance monitoring using Linux eBPF technology. Unlike traditional tools that often require explicit modification of the target application, recompilation, or reliance on vendor SDKs (such as NVIDIA Nsight), this new solution employs operating system-level dynamic tracing to achieve real-time monitoring of GPU execution phases without altering the target program’s binary. On the CPU side, bpftime utilizes eBPF uprobes (user-space probes) to dynamically attach to key APIs in the NVIDIA CUDA runtime library (libcudart.so), such as cudaMalloc, cudaMemcpy, cudaFree, and cudaLaunchKernel. These probes intercept and log application calls to CUDA APIs, as detailed in eBPF Implementation for CUDA Event Tracing. When a hooked CUDA function call is triggered, the uprobe executes an extremely lightweight BPF program in the Linux kernel, recording call parameters (e.g., memory allocation size, data transfer size) and precise CPU timestamps. This data is stored in the kernel’s ring buffer structure.

Compared to earlier methods like uprobes, which were limited to CPU-side tracing, the innovation of eGPU lies in its ability to implement full eBPF probe capabilities at the GPU kernel level. eGPU performs just-in-time (JIT) translation of user-written BPF bytecode into NVIDIA GPU native instructions (PTX) and directly injects it into running CUDA kernels. This enables real-time capture of internal GPU events (such as kernel launches, thread block scheduling, event synchronization, and compute unit execution). Since this kernel-level injection is transparent and requires no modification to the GPU kernel source or binary, it significantly enhances the precision of GPU internal tracing, providing sub-microsecond GPU-side timestamps. The real-time nature and accuracy of the tracing data far surpass traditional user-space tools. eGPU has now been integrated into https://github.com/eunomia-bpf/bpftime, achieving unified aggregation and synchronization of CPU and GPU events. Both are merged and correlated through an efficient kernel-space ring buffer, ensuring data and timestamp consistency with low latency during tracing. Functionally, eGPU and bpftime fill a gap among existing GPU performance tools: they require neither developer instrumentation nor recompilation, nor dependence on vendor-specific toolchains, yet they can reveal memory usage patterns, data transfer performance bottlenecks, and the true launch latency and execution efficiency of GPU kernels in real time. This efficient and truly non-intrusive design greatly simplifies the complexity of real-world deployment.

NVIDIA DCGM (Data Center GPU Manager)

DCGM is not a profiler for individual applications so much as a GPU telemetry and monitoring suite at the cluster/node level. Its architecture is service-oriented: DCGM runs as a background host engine service (nv-hostengine) or can be embedded in management applications via its libdcgm.so API. The core DCGM library interacts with NVIDIA driver interfaces to gather a wide range of metrics about GPUs – including utilization (SM busy %, memory usage), temperatures, power draw, clock speeds, memory errors, and even throughput stats like PCIe bytes transferred. Many of these metrics come from the same underlying counters as used by nvidia-smi (via NVML, the NVIDIA Management Library). DCGM aggregates this data for all GPUs on a server. It can be configured to record metrics periodically and trigger alerts or apply power caps based on policies. The open-core design means the core library and APIs are open source, while some diagnostic modules remain proprietary. DCGM’s host engine can accept commands via CLI (dcgmi) or serve data via the DCGM-Exporter, which integrates with Kubernetes. The DCGM-Exporter provides an HTTP endpoint that continuously exports GPU metrics in Prometheus format, allowing easy scraping and visualization in Grafana dashboards. This makes DCGM ideal for long-running monitoring in containerized environments. In terms of internal tracing, DCGM mostly uses polling of driver counters and event callbacks for certain health events (like double-bit ECC errors). It is less about tracing individual kernel launches and more about aggregated performance and health over time. For example, DCGM can report average SM occupancy or memory bandwidth of a job, but not which function caused a cache miss. It operates at a coarser granularity than tools like Nsight or HPCToolkit – often at timescales of seconds – and is designed to have very low overhead (“low-overhead, non-invasive health monitoring while jobs run”). Security-wise, DCGM can be run with limited privileges: by default, GPU performance counters that might leak details are restricted to admin, but configurations (like setting NVreg_RestrictProfilingToAdminUsers=0) can allow non-root DCGM use. In summary, DCGM’s architecture serves system administrators: it provides a unified API and service for GPU metrics collection across nodes, with integration points to cluster managers and cloud monitoring. It complements the per-application profilers by focusing on global GPU behavior and health rather than code-specific performance events.

TAU (Tuning and Analysis Utility)

TAU is a portable performance profiling/tracing toolkit widely used in HPC, which has evolved to support GPU offloading in addition to CPU and network tracing. TAU’s architecture is modular: it can use either source-level instrumentation, binary rewriting, or interception wrappers to instrument an application, and it supports many programming models (MPI, OpenMP, Pthreads, etc.). For GPU support, TAU leverages existing performance tool interfaces: it supports NVIDIA GPUs via CUPTI, AMD GPUs via ROCm’s ROCprofiler or roctracer, and Intel GPUs via Level Zero tools. Specifically, TAU’s runtime library can subscribe to CUPTI’s callback API to track CUDA kernel launches, memory copies, etc., similar to HPCToolkit’s approach. It also supports the MPI Tools Interface (MPI_T) and Kokkos profiling hooks to correlate high-level parallel runtime events with GPU activity. TAU is often used by launching the program with a wrapper (tau_exec), which can automatically inject instrumentation. For example, tau_exec -cuda would enable CUPTI-based instrumentation of all CUDA API calls and kernels. TAU records timers for each function, so if a GPU kernel launch is intercepted, TAU will log the time spent in that launch call on the CPU as well as the kernel execution time on the GPU (through CUPTI’s asynchronous activity records). The tool can also track data transfer sizes and timings (e.g. cudaMemcpy). Beyond explicit instrumentation, TAU supports event-based sampling as well. It can configure hardware counters or timers to periodically sample the program counter, and then use binary debugging info (DWARF via libdwarf/BFD) to map those samples to source lines. This is useful to profile codes without function instrumentation or to capture system-level anomalies. TAU stores profile results in either a packed binary format or a database (it has a PerfDMF framework for data management). Visualization can be done with TAU’s ParaProf GUI (which shows tables of metrics per function, timelines per thread, call graphs, etc.), or by converting traces to formats like OTF2 for use with tools such as Vampir. TAU is extensible – users can use the TAU API to mark custom events or phases in their code, which then show up in profiles. Plugin-like functionality exists in that TAU can interface with new back-end measurement sources relatively easily (the TAU team added support for new GPUs and even OMPT without changing the user-facing API). TAU’s design philosophy is to intercept as much as possible through standard tool interfaces (CUDA/CUPTI, ROCm tools, MPI PMPI, OpenMP OMPT) so that it doesn’t require modifying the target code. This makes it a multi-platform “glue” profiler but also means TAU’s capabilities on a given platform are bounded by what the platform’s APIs expose. For instance, TAU can trace time spent in GPU kernels and data moves, but for in-kernel details on NVIDIA it would rely on CUPTI metrics (which TAU can collect similarly to Nsight Compute, though TAU typically focuses on timelines and aggregate times rather than every hardware metric). In summary, TAU’s internal architecture combines instrumentation wrappers (automatically inserted) with sampling, and unifies the data in a performance database. It provides a holistic view across CPU, GPU, and MPI, which is valuable in HPC settings where an application spans all those layers.

(Other tools: In addition to the above, there are vendor-specific GPU profilers such as AMD’s Radeon GPU Profiler (RGP) for low-level AMD GPU traces, Intel’s VTune Profiler and Graphics Performance Analyzers for Intel GPUs, and research tools like rocProf (the ROCm command-line profiler on AMD). These generally follow similar patterns: e.g. RGP uses driver-level instrumentation and GPU shader performance counters to produce a timeline of graphics/compute waves, VTune uses Intel’s drivers to trace Level0 or OpenCL kernels on Intel GPUs. Due to space, we focus on the major tools above, which cover a range of implementation strategies.)

APIs and Runtimes Traced by These Tools

Profiling tools achieve coverage by hooking into various layers of the software stack. Table 1 summarizes which runtimes, libraries, or driver APIs each tool is capable of tracing:

CUDA Driver/Runtime (NVIDIA): Nearly all tools (Nsight Systems/Compute, HPCToolkit, TAU, GPUprobe) support CUDA. Nsight and TAU use CUPTI to intercept CUDA calls, HPCToolkit uses CUPTI for activities, GPUprobe hooks libcudart via uprobes. Open-source Omnitrace currently does not support CUDA (focusing on AMD), and DCGM monitors at a higher level via NVML (it sees overall CUDA usage but not individual calls).
HIP/ROCm (AMD): Omnitrace and TAU support AMD’s HIP runtime via ROCtracer/ROCprofiler. HPCToolkit also uses ROC-tracer for AMD GPU kernels. These allow tracing of HIP kernel launches, memory copies, etc. Tools like Nsight (NVIDIA) do not support AMD APIs. AMD’s RGP and rocProf of course target ROCm/HIP natively.
OpenCL: HPCToolkit has function wrappers to intercept OpenCL API calls for GPUs that lack a better interface. TAU also supports OpenCL tracing. Nsight Systems can trace OpenCL on NVIDIA platforms (through CUPTI if the OpenCL implementation routes through CUDA driver). Omnitrace likely can handle OpenCL on AMD via its binary instrumentation (if configured to instrument those calls).
Vulkan/OpenGL (Graphics APIs): Nsight Systems and NVIDIA tools can trace GPU work originating from graphics APIs (showing GPU work such as Vulkan launches on the timeline), but our focus is compute. Tools like RGP target graphics as well. In general, HPC-focused tools (HPCToolkit, TAU) do not trace graphics APIs by default.
Intel oneAPI Level Zero / DPC++: HPCToolkit and TAU have added support for Intel GPU offload. HPCToolkit wraps the Level Zero (oneAPI) API to catch kernel submissions. TAU supports oneAPI through the OMPT (OpenMP offload) interface and Level Zero interceptors. Intel’s VTune uses a driver to trace these as well, but third-party support is emerging.
Driver-level Hooks: Some tools integrate at the driver level: DCGM interfaces with the GPU driver via NVML for metrics, and possibly directly with kernel driver for things like throttling or health events. Most tracing of individual ops is done at the user API level (driver-level instrumentation by third-parties is rare since GPU drivers are proprietary closed-source, except for AMD’s open driver where ROCtracer plugs in).

In summary, NVIDIA’s CUPTI is a common interface used by Nsight, HPCToolkit, TAU (and formerly by nvprof, Visual Profiler) to trace CUDA at both driver API level and collect counters. AMD’s ROCm tools (rocTracer/rocProfiler) play a similar role for HIP/HSA. Intel’s Level Zero is newer; support is available in HPC Toolkit/TAU. Tools cover CPU threading libraries too (pthreads, OpenMP via OMPT, MPI via PMPI/MPI_T), to correlate CPU-side phases with GPU activity.

Tracing on CPU vs GPU and Correlation of Events

An important aspect of these profilers is whether they capture events on the CPU side, the GPU side, or both – and how they merge these timelines. CPU-side tracing typically involves recording when the application thread invokes certain API calls or when certain functions execute. For example, instrumentation-based tools like TAU or Omnitrace will log a timestamp when a CUDA kernel launch function is called on the CPU. Sampling-based tools like Nsight Systems or HPCToolkit also record CPU-side samples (which can indicate where the CPU was busy or idle). GPU-side tracing means capturing events that occur on the GPU device, such as the actual execution interval of a kernel on the GPU, or other GPU engine activities (memory copies on GPU engines, etc.).

Most tools rely on vendor support to get GPU-side event timing. For instance, CUPTI on NVIDIA provides timestamps for kernel start and stop on the GPU, which Nsight, HPCToolkit, and TAU use. Similarly, on AMD, ROCm’s activity callbacks give GPU kernel durations. By collecting these, the tools can plot actual GPU utilization timelines. GPU hardware often has a separate clock domain, but the profiling interfaces ensure timestamps can be translated to the host clock (often by sampling a common clock at start of profiling).

Correlation is then achieved by merging CPU and GPU event streams by time. Tools like Nsight Systems do this in their timeline: one can see a CPU thread launching a kernel and later (on the same timeline) the GPU activity occurring, aligned by timestamps. HPCToolkit takes a different approach by correlating via call context: it records that a particular CPU call path launched a kernel X, and in the profile attributes the kernel’s GPU time to that call path, effectively linking the GPU time into the CPU calling context tree. This answers questions like “which CPU function caused this GPU work and how long did it take?”. TAU similarly attributes GPU kernel time to the CPU code region that launched it (with features like TAU_CALLSITE to pinpoint the source line).

Handling context switches and concurrency is another challenge. Nsight Systems explicitly traces context switch events on CPU (via perf or kernel tracepoints) so it can show which thread was running vs preempted. When correlating with GPU events, tools must consider that GPU kernels run asynchronously: e.g., the CPU might enqueue a kernel and continue, while the GPU executes it later. Profilers often include stream synchronization info so they can show GPU work still in progress after the CPU dispatch.

If multiple CPU threads launch work to multiple GPU streams, correlation ensures the correct matching. Unique IDs (like CUDA stream IDs or HIP queue IDs) are used internally to tag events. For example, CUPTI will report a kernel launch with an ID that corresponds to a specific CPU thread and stream, allowing the tool to match it with the CPU-side launch API call record.

In summary, advanced profilers trace both CPU events (function calls, thread scheduling, etc.) and GPU events (kernel execution intervals, device memory transfers). They align these by time and metadata, enabling a coherent view. This is critical to identify pipeline delays (e.g., if CPU is waiting for GPU or vice versa). Without GPU-side data, one might only know when the CPU launched a kernel, not when it finished on the GPU. Conversely, GPU-only traces would lack the context of what the CPU was doing. Thus, all the reviewed tools use some form of combined tracing: for instance, HPCToolkit’s profile shows combined CPU+GPU call paths, and Nsight Systems uses a timeline covering both CPU cores and GPU timelines in parallel. Ensuring the clocks are synced (CUPTI handles this for CUDA) and handling asynchrony is a key part of their implementation.

Instrumentation vs. Sampling: Strategies for Data Collection

Profiling tools generally employ one (or a mix) of two strategies to collect performance data: instrumentation (event tracing) or statistical sampling. Each strategy has implications for overhead, accuracy, and intrusiveness.

Instrumentation-based Profiling: This involves inserting hooks or probes to record events of interest each time they occur. For example, wrapping a function with start/stop timers (as TAU does), or using CUPTI callbacks that fire on every kernel launch. Instrumentation can be done at compile-time (inserting code in the source or binary), runtime injection (using LD_PRELOAD to intercept library calls, or dynamic binary instrumentation like Omnitrace), or at the OS level (using kprobes/uprobes as GPUprobe does). The advantage is complete information: every event is logged, giving exact counts and timings. The disadvantage is overhead can be high if events are frequent. For instance, instrumenting every function call in a fine-grained application can slow it dramatically. Tools like TAU and Omnitrace use instrumentation to get detailed traces (Omnitrace through binary instrumentation, TAU through wrapper libraries and compiler-based instrumentation). GPU driver APIs are often instrumented via official callback interfaces (CUPTI, ROCtracer) to trace GPU events without modifying vendor code. Instrumentation can also use dynamic runtime methods: e.g., LD_PRELOAD is used to intercept malloc or CUDA calls by overriding symbols at load time; binary rewriting can permanently add probes in the code; and uprobes/kprobes allow inserting breakpoints at function entry (GPUprobe attaches at runtime to cudaLaunchKernel etc. using uprobes). Each approach has trade-offs in flexibility vs. performance.
Sampling-based Profiling: This technique gathers information at intervals (time-based or event-based) rather than logging every event. For CPU, this often means using hardware timers or performance counter overflow interrupts to periodically sample the program counter (PC) and call stack. For GPU, sampling can mean periodically querying utilization or, in NVIDIA’s case, using PC sampling on GPU as described. Nsight Systems uses timed sampling for CPU threads (e.g., capturing stack traces at intervals to statistically infer hot code). HPCToolkit heavily relies on sampling for CPU (and optionally uses GPU PC sampling for fine-grained GPU profiles). The advantage of sampling is much lower overhead – e.g., capturing 1000 samples per second imposes negligible slowdown, yet provides statistical coverage of where time is spent. It also automatically filters out very fast events (if a function executes 100ns, it might never be sampled, effectively focusing on major costs). The downside is reduced precision: you may miss infrequent events or introduce sampling error. Also, for event timings (like exact kernel duration), sampling is not suitable – it’s better for aggregate metrics like “which code consumed how many cycles on average.”
Hybrid Approaches: Many tools combine both. For example, Nsight Systems instruments GPU events (for precise timing of kernels) but samples CPU activity (to reduce overhead). HPCToolkit instruments at a coarse level (GPU kernels) and uses sampling for everything else. Omnitrace instruments function entry/exit and also does sampling of call stacks. This hybrid strategy allows capturing key events while keeping overhead manageable.

In modern GPU profiling, event tracing is typically used for GPU operations because the GPU work is often sparse (kernels, memcopies) and well-defined by the runtime APIs – it’s feasible to log each kernel launch. Meanwhile, sampling is often used for CPU computation or within long GPU kernels to understand their internal behavior (like PC sampling inside a kernel, which essentially samples an event within the GPU execution). A special case is Hardware counter sampling: some tools set hardware PMUs to sample specific events, e.g. counting cache misses and sampling an instruction address on overflow (this merges both worlds: instrumentation to set up the counter, sampling to catch events).

Summary of strategies: If a developer needs exact event timelines and counts, instrumentation is preferred (used by tracing tools, timeline visualizers, etc.). If needing low overhead profiling of large applications to find hotspots, sampling is preferred (used by HPC profilers, statistical profilers). Many tools provide modes for both. The implementation must manage the enabling/disabling of instrumentation carefully (e.g., support for selective instrumentation—Omnitrace allows configuring which functions to instrument, TAU allows toggling instrumenting MPI, I/O, GPUs via flags). This helps target the analysis and reduce data volume.

Use of Hardware Performance Counters (PMUs)

Hardware Performance Monitoring Units (PMUs) are essential for collecting low-level metrics on both CPUs and GPUs. Profiling tools tap into these counters to measure events like instruction counts, cache misses, memory throughput, etc., which can’t be obtained by high-level timing alone.

CPU PMUs: Most CPUs expose counters via interfaces like Linux perf. Tools like HPCToolkit and TAU can use the PAPI library to configure CPU counters (cycles, cache misses, FLOPs, etc.) and sample on overflow. For instance, HPCToolkit’s command hpcrun -e PAPI_TOT_CYC@500000 -e PAPI_L2_TCM@200000 would sample on total cycles and L2 cache misses after certain counts. By using such counters as sampling triggers, the tools attribute hardware events to code locations. TAU also integrates with PAPI, and can record counters per function invocation if instrumented (adding up counts inside each region). In kernel-level analyses (like OS scheduling), one might use CPU PMUs via ftrace or BPF too, but that’s less common for the user-facing profilers.

GPU PMUs: Modern GPUs have extensive performance counters (for each Streaming Multiprocessor on NVIDIA, or each Compute Unit on AMD). Access to these is through vendor libraries:

NVIDIA provides CUPTI/Event and Metric APIs that allow tools to list available counters (e.g., “dram_read_transactions”, “active_warps”) and read them. Tools like Nsight Compute program these counters and read them after kernel execution. Some counters can be read after a kernel (giving aggregated counts for that kernel) using CUDA’s profiling modes. Others might be sampled continuously (NVIDIA supports a “streaming performance monitor” mode for continuous collection, but that is typically internal).
AMD provides a similar interface via ROCprofiler or the GPUPerfAPI (GPAPI) for performance counters. Radeon GPU Profiler (RGP) and rocprof can gather counters for AMD GPUs, such as wavefront occupancies, cache hits, etc. Omnitrace likely uses ROCprofiler under the hood (since it’s built for AMD) to get counters in addition to trace callbacks.
Intel GPUs (Xe) expose counters through their VTune or Metrics Discovery API, which HPC tools are only beginning to support. HPCToolkit mentions support for Intel GPU metrics via Level Zero in a limited fashion (perhaps linking to Intel’s GT-Pin or using the oneAPI VTune backend).

In-Kernel Counters and PMUs: Some profiling approaches also use OS-level counters. For example, Linux has tracepoints for GPU scheduler events (on some open-source drivers) that could be used to track context switches on GPU engines. Also, NVIDIA’s driver tracks things like memory page migrations, NVLink usage, etc., which DCGM can retrieve. DCGM uses a subset of counters for health (like memory ECC counts) and throughput (PCIe bytes) not typically exposed to app profilers.

PMU Data Usage: The raw counter values often need post-processing to be meaningful:

Derived metrics: Many tools provide computed ratios or percentages. E.g., SM occupancy is derived from “active warps” vs. “theoretical max warps” counters. Cache hit rate is derived from cache hit and miss counters. Nsight Compute explicitly computes such metrics and even offers guided analysis rules (like if a certain stall reason counter is high, suggest optimizations). HPCToolkit presents raw metrics in a neutral way (counts or percentages) but doesn’t focus on guided hints.
Sampling vs. aggregate: When counters are sampled (like setting a counter to trigger an interrupt), you get sample-based attribution (like “20 samples of L2 misses fell in function X”). When counters are aggregated per kernel (as Nsight Compute does), you get exact counts for that kernel but have to attribute to the whole kernel invocation. Some tools might break down within a kernel by source line using PC sampling combined with counters (e.g., correlate PC samples with an “issued instructions” count metric to estimate how many instructions were executed in each part of the code).

Overheads: Accessing GPU counters often requires exclusive control of the GPU’s performance monitor hardware, which is why only one profiling session can run at a time. On NVIDIA, if a counter profiler (Nsight Compute or CUPTI metrics) is running, another cannot, and by default only privileged users can use counters because of potential side-channel leaks. AMD’s ROCm might allow more concurrent use but still has overhead. Tools have to stagger counter reads to avoid perturbing the GPU pipeline (hence kernel replays for Nsight Compute ensure the original execution isn’t slowed by reading too many counters at once).

In summary, PMUs are the foundation for low-level metrics. Tool implementers work with these via vendor APIs (CUPTI, ROCprofiler, etc.), which abstract away direct register fiddling. In some cases, open-source tools have reverse-engineered parts of this: e.g., reading Nvidia GPU performance counters via undocumented perf events – but generally using official APIs is the norm due to complexity. The use of PMUs enables insights into memory bandwidth (by counting bytes transferred), execution efficiency (issued vs retired instructions), stalls (hardware can count stall reasons by category on Nvidia since Volta), and more. These metrics greatly enrich a profiler’s output beyond simple timings, at the cost of more complex data collection.

Hooks and Techniques for Instrumentation

To implement instrumentation, tools may hook into different layers of the software stack. Here we outline common techniques:

User-Mode API Wrapping: Many tools create wrapper libraries that intercept calls to standard APIs. Examples include TAU’s intercept of MPI (via PMPI), or wrapping CUDA runtime calls. This can be done by providing a library that defines the same symbols (e.g., a cudaLaunchKernel that calls the real one after logging) and using LD_PRELOAD or link order to ensure the wrapper is used. HPCToolkit’s substrate has function wrappers for OpenCL and Level Zero, which likely operate this way. This approach is straightforward and requires no special privileges, but you must implement a wrapper for each function of interest. It’s used for high-level events (e.g., log that a kernel was launched, then call through to actual launch).
Compiler/Source Instrumentation: TAU can instrument source code or use compiler flags (like using -finstrument-functions or using Clang/OMPT for OpenMP). This compiles probes into the program. The advantage is rich context (you can instrument at loop or function entry with minimal overhead for timers), but it requires recompiling or at least adding an extra compilation step.
Dynamic Binary Instrumentation (DBI): Tools like Omnitrace perform DBI, meaning they inject instrumentation into the binary at runtime or ahead-of-time without source code. Frameworks for DBI include Dyninst, Intel PIN, or DynamoRIO. Omnitrace is described as doing dynamic instrumentation for functions. This is powerful because it can instrument even third-party library code (for which source is not available) and be configured on the fly. The cost is complexity and some overhead due to code patching and running under a DBI environment.
Kernel Probes (kprobes) and eBPF: For system-wide or kernel-level events, profilers can use kprobes (instrument kernel functions) or tracepoints. For example, to measure GPU driver events, one could insert kprobes in the GPU driver if you know the symbol (not common outside debugging). More practically, eBPF is used by GPUprobe to attach to user-level functions (uprobes) and could also attach to kernel tracepoints (like scheduling or GPU interrupts). eBPF’s advantage is it’s dynamic (no app rebuild) and safe to run in production (sandboxed in kernel). We saw GPUprobe’s use of uprobes; similarly, one could attach kprobes to, say, the GPU scheduler’s job submission function to know when a kernel is scheduled on hardware. Few profiling tools currently use eBPF aside from GPUprobe, but this technique is gaining interest for low-overhead instrumentation of both kernel and user events. For example, one might imagine an eBPF-based profiler that samples GPU utilization via polling a device file every few ms, or traces OS events that affect GPU (like GPU interrupts, DMA completion events).
Driver/Kernel Instrumentation Hooks: Some vendors allow inserting hooks at the driver level. NVIDIA’s CUPTI essentially registers hooks inside the CUDA driver – NVIDIA built this in for profilers. AMD’s ROCtracer does similarly by interfacing with the ROCm driver stack. If source is available (AMD’s runtime is open source), one could even modify or extend it with custom probes, but generally the provided hooks suffice. On CPUs, the OS provides tracepoints (like context switch events, system call events) that tools like Nsight Systems use to track scheduling. These are essentially instrumentation points in the kernel (e.g., sched:sched_switch tracepoint). Tools may enable these via the perf subsystem or ftrace interface.
GPU Instruction Injection: In some cases, instrumentation can be inserted into GPU code itself. This is not common in commercial tools (due to risk of altering behavior), but research tools or certain debug modes do this. For example, one could modify PTX (CUDA assembly) to call a counter increment at certain points. NVIDIA’s older profiling (CUPTI PC sampling) is a better approach than explicit instrumentation, but NVIDIA could instrument at warp-trap instructions. AMD’s shader profiling might insert “wave timestamps” in code. Typically, however, vendors avoid modifying user kernels for profiling, instead running them under special modes (like single-warp replay or serializing execution for analysis).
Uprobe/Uretprobe for arguments/returns: As seen with GPUprobe, sometimes both entry and exit of a function need to be instrumented to get arguments and return values. Uretprobes catch the function return, but since registers may be clobbered, a combination of an entry uprobe (to stash arguments) and a return probe is used. This technique is needed for measuring things like allocation sizes (argument at entry) and allocated pointer (return at exit) without altering the function.
Event Buffers and Async Logging: Instrumentation often logs data into a buffer to minimize interference. CUPTI, for instance, writes activity records into a buffer which the profiler reads later. eBPF writes events into a BPF ring buffer or map, consumed by user-space asynchronously. This decoupling is critical for performance: the instrumented function triggers a quick log write (e.g., just drop an event in a lock-free queue) and then continues, while another thread or process handles heavy work like writing to disk or network.

Overall, implementing instrumentation is about carefully choosing where to intercept (user API, library, or kernel), how to inject the probe (source vs binary vs dynamic), and how to handle the data (immediate print vs buffered). The tools we discussed illustrate these choices: e.g., TAU uses mostly user API wrapping, Omnitrace uses DBI, GPUprobe uses kernel uprobes, Nsight relies on driver-provided hooks, HPCToolkit uses a mix of driver hooks and its own wrappers.

GPU Instruction-Level Tracing Techniques

Profiling at the level of individual GPU instructions or warps is extremely challenging due to the volume of operations and limited observability. However, some approaches exist:

GPU PC Sampling: As mentioned, NVIDIA GPUs support periodic sampling of the program counter on each SM. This is analogous to CPU PC sampling and provides a statistical view of which instructions (or source lines) are executing frequently on the GPU. HPCToolkit uses this to attribute time or cycles to lines in GPU kernels. The CUPTI PC sampling mechanism can also attribute stall reasons for sampled warps (e.g., whether the warp was stalled on memory, execution dependency, etc.) – giving insight into why performance is lost. This effectively provides instruction-level insight (which parts of the code are hotspots, and what stalls they incur). The overhead can be non-trivial and early implementations serialized execution, but newer GPUs handle PC sampling more efficiently by offloading sample collection to hardware buffers.
Hardware Pipeline Statistics: Rather than tracing every instruction, GPUs have counters for things like “number of warp instructions issued per cycle” or “number of stalled cycles on X reason”. Nsight Compute reads these to produce an instruction pipeline utilization analysis (e.g., showing IPC, issue stall breakdown). While not a literal trace, it’s an aggregated view at instruction-level categories. Some of these counters essentially bucket the cycles of warps into categories (memory, compute, etc.), which is almost like tracing the cause of pipeline bubbles over time.
Wavefront/warp trace debug modes: In debugging contexts, GPUs can be put into modes where a single warp runs to completion or logs certain events. AMD’s GPU PerfStudio had low-level modes, and NVIDIA has tools like Nsight Graphics that can capture per-draw call timing down to shader instructions (for graphics). But for general compute, such instruction traces are rarely used due to overhead.
Reverse-engineering and Undocumented Methods: Open-source developers sometimes reverse-engineer shader ISA to get at instruction info. For example, HPCToolkit’s developers wrote their own CUBIN analyzer to map instruction addresses to line numbers, since NVIDIA doesn’t provide source-line mapping for SASS. Similarly, on AMD, one might parse the code object (their ISA) to correlate with HSAIL or source. These are offline analyses but crucial for instruction-level profiling because the tool must attribute performance metrics to individual instructions or lines. Without vendor documentation, this involves a lot of custom tooling.

In summary, instruction-level tracing on GPU is done via smart sampling and counters rather than literal event-by-event tracing (which would overwhelm any system). The combination of PC sampling and specialized counters gives an approximation of an instruction trace, sufficient to highlight inefficient instructions or stalled portions of code. This level of detail is typically only needed when optimizing a specific kernel (hence tools like Nsight Compute or academic projects like GPU performance simulators). Most general-purpose profilers don’t emit a log of every instruction executed (that’s more the domain of simulators or very special debug modes).

Data Collection, Transfer, and Storage

The profilers gather large amounts of data – how they manage it is as important as the collection itself. Key considerations are: how data moves from GPU to host, how it’s buffered, what file formats are used, and how the user ultimately accesses it.

On-GPU to CPU Data Transfer: When collecting GPU performance data (counters or traces), the data must be transferred to the host. CUPTI’s Activity API, for example, accumulates records in a buffer and then flushes them to the host memory via driver callbacks. This typically happens after a kernel completes or when a buffer fills. For continuous metrics, DCGM or others periodically query the GPU via driver calls (which under the hood read registers over PCIe). The overhead of data transfer is mitigated by batching: profilers use large buffers and only occasionally interrupt the GPU to drain data. Some advanced usage might pin a buffer and have the GPU DMA out the data (like NVIDIA’s NVPW for streaming counters does something akin to that). But details are often abstracted by the vendor library.
Buffering and Streaming: Most tools implement a double-buffer or queue so that data collection and writing to disk don’t block the application. For example, HPCToolkit logs GPU events into an in-memory buffer and writes to a file after the run. Nsight Systems writes to a memory-mapped file incrementally (it can even stream to the UI if connected). GPUprobe uses a BPF queue that the user-space reads every few seconds. Streaming trace in real-time is tricky due to volume, but some tools (like certain debug modes or Intel GPA) allow live viewing by sending events over a socket to a GUI. In HPC, typically data is written to local storage for post-mortem analysis because the volume is high (multi-GB traces).
File Formats: There are a variety of formats:
Proprietary Binary: Nsight Systems and Compute use their own binary formats (.nsys-rep is actually a SQLite database in Nsight Systems; Nsight Compute .ncu-rep is a structured binary with sections). These are optimized for their GUIs and not intended to be hand-edited.
Open Trace Formats: Some open tools use standard formats like OTF2 (Open Trace Format) or CTF. TAU can export to OTF2 for example. These formats are designed for merging traces from multiple ranks in HPC.
Perfetto/Chrome JSON: Perfetto’s format (protobuf trace) or the older Chrome Trace Event format (JSON) is used by several tools because Chrome’s tracing infrastructure is a de-facto standard. PyTorch’s profiler, for instance, outputs a Chrome JSON trace that TensorBoard or chrome://tracing can display. Omnitrace writes Perfetto .proto files viewable in Perfetto UI. GPUprobe doesn’t produce a full timeline, but it could be extended to output a Chrome trace JSON of events.
Plain text/CSV: Many tools also generate summary reports in human-readable form. Nsight Compute CLI can output CSV of metrics per kernel. HPCToolkit produces text reports of top hotspots. DCGM’s dcgmi can print current metrics to console, and DCGM exporter outputs text for Prometheus.
Database: Some tools store data in a SQL or custom database. As mentioned, Nsight Systems uses SQLite under the hood; TAU’s PerfDMF historically used a DB for storing profile data for query. Using a database can facilitate querying specific metrics after the fact.
Export Paths and Telemetry: Tools differ in how you retrieve the data.
File-based workflow: e.g., run nsys profile ... which produces a file to open later; run hpcrun then hpcstruct/hpcprof to generate an analysis database.
Live UI: e.g., Nsight Systems can also be run with the GUI attached to a remote app, streaming data. Intel VTune can attach to a running process and periodically update stats in the UI.
Web Dashboard: Some modern profilers have web interfaces. For instance, NVIDIA launched Nsight Systems in a web app for their cloud, and Omnitrace results can be loaded into a Perfetto web viewer easily. DCGM’s Kubernetes integration is specifically for dashboards: DCGM exporter feeds Prometheus, and you visualize in Grafana or others.
Integration into TensorBoard: Tools aimed at ML (like PyTorch’s Kineto or TensorFlow’s profiler) export data that TensorBoard’s profiling plugin can display. These often use the Chrome trace format. As an example, PyTorch with Kineto can capture CUDA kernel timelines and display them in TensorBoard, effectively providing a Nsight-like view but within the ML tooling context. Underneath, Kineto uses CUPTI (so similar data as Nsight Systems) but converts to Chrome format.
Custom Telemetry APIs: A few tools allow programmatic access. DCGM has a C and Python API so you can query metrics from within your application or a script. CUPTI provides a programmatic way to start/stop tracing and get records (so one could embed CUPTI in an app to collect its own traces). Some HPC monitoring systems integrate profiling by hooking these APIs.
Data Volume and Reduction: A single run can produce enormous data (traces of every kernel launch, etc.). Tools implement filtering and levels of detail. Nsight Systems has options to limit collection (e.g., only trace CUDA, ignore OS scheduling if not needed). HPCToolkit can be configured to only trace GPU and not collect CPU samples, etc. They also have options to include/exclude specific kernels or API calls. This is important because writing out every event in a long run might be infeasible. Some tools, like Score-P (used with TAU in large-scale scenarios), allow switching between profiling (aggregating counts) and tracing (detailed event log) modes, because full tracing doesn’t scale beyond a point.

To sum up, data handling in these tools involves careful buffering of events, using efficient file formats (often binary or compressed), and providing ways to either visualize directly or export to other analysis frameworks. The trends are toward standardizing on formats like Perfetto/Chromium trace for timeline data and JSON/CSV for summary metrics to ease integration. Tools like Omnitrace explicitly embrace web visualization by producing Perfetto traces, while HPC tools remain more custom but offer conversion utilities. In cluster monitoring, streaming export (Prometheus via DCGM exporter) is key for real-time observability. So we see both post-mortem analysis workflows (HPC, desktop profiling) and live monitoring (datacenter, cloud) depending on the use case.

Visualization and Analysis Interfaces

The way profiling data is presented is crucial for deriving insights. Different visualizations help answer different questions:

Timeline Graphs: These are offered by Nsight Systems, Chrome/Perfetto trace viewer, HPCToolkit’s TraceViewer, etc. A timeline shows time on the horizontal axis and different activities (threads, GPU streams, network transfers) on the vertical axis. This is excellent for identifying concurrency issues and idle gaps. For example, one can see if the GPU is idle while the CPU does something (indicating a potential CPU bottleneck or insufficient work queued), or if CPU threads are idle waiting for GPU (maybe synchronization delays). In a timeline, you might see overlapping rectangles where one kernel is executed while another CPU thread runs – which is good utilization – versus large gaps which signal pipeline stalls. Dependency or event links can be drawn (Nsight shows arrows from a CPU launch event to the GPU kernel execution event it triggered), so you see relationships and any queuing delay. Timelines can also show context switches on CPU (with different thread colors before/after a switch), helping identify if too many context switches are hurting performance. Overall, timeline visualization yields insights like pipeline bottlenecks (e.g., GPU waiting on data from disk/CPU), overlapping computation/communication, and load balance across devices.
Flame Charts / Call Stack Graphs: A flame chart is typically produced from sampling data – it visualizes the call stack profile over time or aggregate. HPCToolkit’s HPCViewer and TAU’s ParaProf can display something akin to a flame graph: basically, which functions called which, and how much time was spent in each (often drawn as a stacked bar or “flame” where width corresponds to time). This helps identify hotspots in terms of call context: e.g., you can see that function foo() takes 30% of time, mostly when called from main() vs when called from init(). It also can expose recursion or expensive call chains. In GPU context, HPCToolkit can show a flame graph of GPU kernel execution attributed to the CPU call path that launched them – which is unique, as it connects device time into the flame graph. Flame charts (like those generated by Brendan Gregg’s flamegraph tool) are usually static visualizations of sample data; many performance engineers use them to quickly see where time is going.
Directed Acyclic Graphs (Dependency Graphs): Some tools (especially in a parallel runtime context) can visualize a task graph or dependencies. For example, in a pipeline of GPU operations, a graph view might show that Kernel A and B run in parallel, then feed into C. NVIDIA’s Nsight Compute doesn’t do this, but Nsight Systems implicitly shows dependencies via timeline. Other analysis tools might reconstruct a graph of computation and data movement (this is more niche, often done manually or via custom analysis, but not a primary feature of the listed tools except perhaps as part of visualizing asynchronous dependencies with arrows).
Scatter Plots: These can be useful when you have many similar events and want to see distribution. For instance, plotting kernel execution time vs. kernel launch index can show if later kernels got slower (maybe due to thermal throttling or contention). Scatter plots might also show correlation between two metrics – e.g., one could plot achieved occupancy vs. execution time for kernels to see if low occupancy correlates with longer runtime. Tools don’t typically include scatter plots in their UI, but users export data to do such analysis. An exception is some research/professional tools that let you plot one metric vs another across kernels.
Heatmaps: A heatmap might be used to show, for example, a timeline heatmap of GPU unit utilization (some tools in graphics show per-SM utilization over time as a heatmap). Another use is visualizing memory access patterns – e.g., GPU memory addresses vs time. These are specialized and not found in general profilers, but one could generate a heatmap of, say, cache miss rate over time slices to spot phases of computation. In profiling literature, a “GPU utilization heatmap” could refer to showing how busy each engine was over time.
Metrics Dashboards: Some visualizations are simply tables or charts of metrics. Nsight Compute’s GUI shows a table of metrics and may highlight in red those that are problematic. DCGM with Grafana will show line charts of metrics over time (like GPU temperature or utilization %). Those help with monitoring and trend analysis (did a job saturate memory bandwidth? Did GPU utilization drop at some point?).

Each visualization provides different insights:

From a timeline, one could observe a CPU-GPU desynchronization: e.g., CPU launches a kernel then does nothing (idle) until result comes back – indicating the CPU could have done other work or the kernel is a bottleneck. Or the opposite: GPU is idle because CPU isn’t feeding it (maybe single-threaded CPU code causing GPU starvation). These scenarios show up clearly on a timeline as big gaps.
Pipeline bottlenecks: In deep learning, for instance, a timeline might show that data loading on CPU is slow, making GPU wait (common issue). Or if multiple GPUs, one might see one GPU finishes early and then waits at a synchronization – indicating load imbalance.
Memory bandwidth contention: If the profiler captures memory throughput, one might see it maxed out (near 100% utilization) whenever certain kernels run, explaining why those kernels don’t speed up further (memory-bound). DCGM and Nsight can report PCIe or NVLink usage; high usage there might explain delays if the job moves a lot of data. If multiple processes run, a timeline might even show overlapping data transfers that saturate the bus.
Cache misses and stalls: Tools like Nsight Compute list cache hit rates and stall reasons. They might not visualize them, but the report effectively highlights memory contention (e.g., a high L2 miss rate with many stall cycles on memory). A user can infer that memory latency is hurting performance from those metrics.
CPU thread issues: A flame graph or timeline may show excessive context switching or lock contention (if instrumented). Nsight Systems can show mutex wait times for CPU threads, helping find concurrency issues on CPU that indirectly slow GPU feeding.
Dependency graphs (if manually analyzed) could highlight a critical path in asynchronous work – e.g., maybe kernels could run concurrently but are scheduled sequentially due to a dependency that could be optimized.
Flame charts on GPU code (via PC sampling) can reveal which part of a kernel’s code is the hotspot. For example, HPCToolkit’s GPU PC sampling might show that 40% of time in a kernel is spent in a particular loop (source line X), hinting that that loop is memory-bound or compute-heavy and ripe for optimization.

In practice, many of these tools integrate multiple views. Nsight Systems has timeline plus the ability to zoom in and see per-thread call stacks at an instant. TAU’s ParaProf can show both profile tables and timeline (if tracing enabled). HPCToolkit has separate viewers for aggregate profile (like a top-down call tree with metrics) and the event timeline (if one wants the fine detail). The combination is powerful: one might use a profile to find the main culprit function, then examine timeline to see why it’s slow (perhaps concurrency issues), then use hardware metrics view to see if it’s CPU or memory-bound, etc.

Extensibility and Custom Instrumentation

Different tools offer varying degrees of extensibility – the ability to incorporate new event sources or define custom metrics and plugins:

User Markers and Annotations: Many profilers allow the user to mark sections of code. NVIDIA provides NVTX (NVIDIA Tools Extension) – a library where developers can annotate code regions or events. Nsight Systems will show these NVTX ranges on the timeline (useful for labeling phases of an app). Similarly, for CPU, there are annotation APIs like Intel’s ITT (Instrumentation and Tracing Technology) that some tools (like VTune or oneAPI tools) use; TAU has its own API (e.g., TAU_PROFILE_TIMER_START(name)). These are not extensions to the tool per se, but they allow the user to inject custom events that the tool will record and visualize. This is crucial for understanding high-level phases (like “Data Preprocessing”, “Training step”, etc. on a timeline).
Plugin Architecture: Some tools have a modular design where new modules can be added. For example, DCGM’s diagnostics are modular – one can add a new diagnostic plugin (though writing one might require NVIDIA’s cooperation since not fully open). HPCToolkit is not really plugin-based, but being open source, users have added features (like one could add support for a new GPU by writing a new connector for its APIs). TAU is fairly extensible; it’s part of the academic Performance Tools community (aligned with standards like MPI_T, OMPT). If a new runtime comes out (say, a new tasking library), TAU could add an interface for it.
Defining New Events: In tools like TAU and HPCToolkit, you can instrument new events by either using the tool’s API (for an app developer) or by the tool maintainers adding wrappers for new library functions. For instance, when OpenACC and OMPT interfaces appeared, TAU quickly integrated those, meaning they effectively “taught” TAU to handle new event types (OpenACC offload events, OpenMP runtime events). Similarly, HPCToolkit adding Level Zero support was a matter of implementing new callback handlers and wrappers. Because HPCToolkit’s design is vendor-neutral, adding an API meant extending its substrate with another connector. Omnitrace, as an AMD research project, could potentially be extended to support CUDA by adding a CUPTI integration (if someone implemented it) – the framework is there (binary instrumentation and Perfetto output), it just needs event sources from NVIDIA.
Customization of Data Collection: Some profilers allow scripting or configuration for what to collect. Nsight Compute, for example, has “sections” and rules – advanced users can define custom metric sets or even custom formulas for derived metrics in Nsight Compute’s section files (there’s a way to write custom analysis rules in Nsight). This is a limited form of extensibility targeted at metrics. Similarly, one can choose which counters to collect (reducing overhead). Tools like rocprof for AMD allow users to specify which counters and which API callbacks to enable via a config file.
Output Integration: Extensibility also means how easily the data can be used elsewhere. Tools providing standard formats (JSON, CSV, Prometheus, etc.) are effectively extensible because they allow plugging the data into custom pipelines (like a custom ML performance dashboard or automated regression tester). Omnitrace’s compatibility with Hatchet (which is an analysis library)means users can extend analysis by writing Python to filter or compare profiles. TAU has an interface to merge profiles from multiple runs, which can be used to automate testing and analysis.
Limitations on Extensibility: Notably, closed-source tools like Nsight Systems/Compute do not support plugins – you cannot add a new type of event to Nsight beyond what NVIDIA provided. You are limited to NVTX markers for custom annotations. In contrast, open tools (HPCToolkit, TAU, Omnitrace, GPUprobe) allow you to modify or extend them, but it requires programming. GPUprobe, for example, could be extended to trace new CUDA functions or even other libraries by adding more BPF probes in its source.
Scripting and Automation: Some profiling frameworks (especially in HPC) support automated profiling experiments. For example, using TAU’s Python interface or scripting TAU commander to collect different metrics on different runs, then synthesizing results. This isn’t a plugin, but it shows the tool can be integrated into larger workflows (like auto-tuning loops or continuous integration to catch performance regressions).

In summary, extensibility ranges from simple in-app annotations (for end users to mark events) to the ability to integrate new back-ends (for tool developers to support new hardware). The survey tools show that open frameworks and HPC tools prioritize flexibility (because they need to target evolving platforms), whereas vendor GUIs prioritize a polished experience within a fixed scope. For someone building a new profiler, leveraging standards (like CUPTI, ROCtracer, OMPT) is a good starting point because it immediately grants support for various events, and designing the tool to accept plugin modules or config files for new metrics can future-proof it.

Overhead and Intrusiveness Trade-offs

Profiling inherently perturbs the target program – the key is to minimize and understand this overhead. Different tools have different footprints:

Lightweight Sampling – Low Overhead: Sampling at a low frequency (e.g., 100-1000 samples per second) typically incurs under 5% overhead, often unnoticeable. HPCToolkit’s CPU sampling falls in this category, making it suitable for large HPC jobs where adding even 10% overhead would be too much. HPCToolkit explicitly aims for a few hundred samples/sec per thread to keep overhead reasonable. Similarly, Nsight Systems’ CPU sampling is designed to be very low overhead, using OS-level perf which is optimized for this use. The trade-off is that you might not catch very short-lived functions or get exact counts.
Function Instrumentation – Moderate to High Overhead: Wrapping or inserting code on every function entry/exit or every CUDA call can add significant time, especially if those functions are frequent. TAU’s overhead depends on how fine-grained the instrumentation is: instrumenting only outer loops or MPI calls is cheap, but instrumenting every C++ inline function could be huge. TAU allows choosing the level (e.g., you might not instrument every single GPU kernel launch, only the overall region, if trying to reduce overhead). Omnitrace’s dynamic instrumentation will have some startup overhead to instrument code and runtime overhead for each instrumented function call (they try to minimize it, but it’s there). The Omnitrace docs emphasize using it for overall execution profiling rather than super-fine detail in every kernel (in fact recommending if you want deep insight into individual kernels, use another tool – implying Omnitrace focuses on broad profiling, not micro-optimizing one kernel).
GPU Trace Overheads: Simply tracing kernel events via CUPTI has low overhead – a few microseconds to log an event per kernel. If kernels are numerous (thousands per second), this can start to matter. CUPTI’s activity buffering typically can handle many events efficiently. But when you turn on more detailed GPU profiling (like counters or PC sampling), overhead increases. CUPTI’s PC sampling can degrade application performance, especially in older GPUs where it prevented overlap of kernels. Tools usually warn about this; e.g., Nsight Compute replays kernels because trying to gather all metrics in one go would slow the kernel drastically or is impossible due to counter limits. That replay itself is overhead (often huge – 46x runtime for one kernel in the example). So Nsight Compute is intentionally intrusive – it sacrifices real-time performance for data.
Memory and Storage Overhead: Profiling can also increase memory usage (buffers) and produce large output files. If an application is memory-sensitive, large trace buffers might perturb cache usage or memory allocation patterns. For example, enabling detailed tracing might allocate hundreds of MB for buffers, possibly causing more cache misses for the application. Most tools let you configure buffer sizes to trade off between lost events vs memory overhead.
Perturbation of concurrency: Intrusive instrumentation can change scheduling. If profiling causes a function to take 2x longer, that could alter how threads interleave or how the GPU stream gets work. For instance, heavy instrumentation on CPU could feed work slower to GPU, making the GPU appear underutilized when in reality it’s the profiler’s overhead causing it. Good profilers try to quantify or bound their overhead to avoid misleading the user. Some provide an “overhead compensation” or at least measure how much time was spent in the profiling code itself (TAU sometimes includes metrics for “profiler overhead”).
Accuracy vs Intrusiveness: The more accuracy/detail you want, typically the more overhead you incur. E.g., logging every single memory allocation (as GPUprobe does) can slow down an application that allocates frequently, but GPUprobe found in typical CUDA codes, the relative overhead is small because those calls are not extremely frequent and already take time. If someone ran a micro-benchmark that calls cudaMalloc thousands of times a second, GPUprobe’s overhead might become more noticeable (due to context switch on each call). Similarly, sampling has adjustable frequency: if you sample at 10 kHz on every CPU core, you might start seeing a few percent overhead or more, and more disturbingly, you might interfere with CPU turbo frequencies or power usage slightly. Usually, a safe range is chosen as default (like 1 kHz).
Parallel Scalability of Overhead: In distributed runs (MPI with GPUs), one also considers that a profiling tool might increase communication or I/O overhead. E.g., if each rank tries to write a large trace file, the I/O load can slow things. Some tools merge data to avoid too many outputs (Score-P merges traces, etc.), but that can cause sync overhead at program end. Overheads can multiply with scale if not careful, which is why sampling (with fixed overhead per process) is often preferred in HPC at scale.

Comparison of the tools:

Nsight Systems: Designed for low overhead system profiling, typically <5% overhead with default settings (more if CPU sampling frequency increased or if tracing extremely many events). It’s generally acceptable to run in production for a short period.
Nsight Compute: Very high overhead (can be orders of magnitude) for target kernels, not meant for production runs – it’s a diagnostics tool for devs.
HPCToolkit: Very low overhead for sampling (suitable for long HPC runs). If GPU PC sampling is used, overhead on GPU could be moderate (maybe 2x slowdown for that kernel depending on sampling rate). But HPCToolkit’s default of only tracing kernels adds little overhead.
Omnitrace: Medium overhead – dynamic instrumentation of potentially many functions and sampling combined. It’s optimized for use on large apps but it’s still doing a lot. They likely achieve overhead in the low tens of percent for moderate instrumentation, but the user should choose narrower focus to keep overhead manageable.
GPUprobe: Very low overhead for what it does (<4%), because it only hooks a few slow calls. It’s explicitly touted as lightweight.
DCGM: Extremely low overhead. It’s running in the background and only polling metrics (which is usually negligible on GPU – reading a counter register isn’t free but it’s minor, and DCGM is optimized for minimal impact). Unless you set it to sample at an insanely high frequency, it won’t bother the GPU or CPU much. It’s safe for always-on monitoring.
TAU: Varies widely. If using only sampling or coarse instrumentation (MPI calls, high-level loops), overhead is low. If using automatic source instrumentation at fine granularity, overhead can be high (possibly 2x or more runtime). TAU provides many options to adjust this, like filtering out small functions from instrumentation.

Trade-off summary: For day-to-day performance engineering, one often starts with low-overhead broad tools (like Nsight Systems or HPCToolkit sampling) to find major issues, then uses higher overhead targeted tools (like Nsight Compute or adding more instrumentation in TAU) on specific kernels or code regions to dive deeper. The implementer of a new tool must decide where on this spectrum to sit – a new always-on monitoring tool should be low overhead (maybe sampling-based, like GPUperf counters streaming), whereas a new deep analysis tool can be intrusive but used sparingly. The user often needs to be aware of the overhead and not confuse profiler-induced slowdowns with real behavior (which is why cross-checking with multiple tools or running at different sampling rates is good practice).

Security and Sandboxing Considerations

Profiling tools often need low-level access, which raises security and isolation concerns, especially on shared systems or in the cloud:

Privilege Requirements: As noted, accessing hardware counters on GPUs is often restricted to administrators by default (NVIDIA’s driver default NVreg_RestrictProfilingToAdminUsers=1 prevents non-root from using CUPTI metrics or NSight). Many tools (Nsight, HPCToolkit) will simply not capture certain metrics unless that is disabled or run as root. The reason is that performance counters could potentially be used as side-channels (e.g., timing another process’s GPU usage). DCGM similarly might require elevated privileges for certain ops, but its open-source core runs in user space with proper config.
Containerization: Running profilers inside containers (Docker/Kubernetes) is tricky. By design, containers restrict access to performance monitoring. For instance, the perf_event_paranoid sysctl applies system-wide – inside a container you might not be allowed to profile unless the host setting is permissive. Tools like Nsight can run in containers if given the right privileges (the NVIDIA container toolkit can pass through GPU profiling capability). There is an official DCGM container for monitoring GPUs in Kubernetes, which runs privileged to collect data and then provides metrics in a safe way to unprivileged consumers (via network endpoint). If one tries to run HPCToolkit or TAU in a container on a cluster without privileges, they might not be able to attach to GPU events. Some cloud providers explicitly disable profiling on their GPUs for security.
Multi-tenant GPUs: If two users share a GPU (via MPS or virtualization), profiling could either fail or potentially expose the other user’s activity. Typically, vendor drivers prevent that by the admin-only restriction. AMD’s ROCm might allow multiple user processes but rocm profiling likely only sees your own queues (still, usually root access is needed for perf counters). In multi-tenant scenarios, often the operator will allow monitoring (like DCGM aggregate stats) but not detailed tracing by arbitrary users. This is something a new profiling tool must consider – you may have to integrate with permission models. For example, a site might run the profiler as a privileged service that users can request data from, rather than giving the binary to users to run.
Security of instrumentation methods: Techniques like eBPF and ptrace (which some tools use for attaching to processes) are restricted by Linux capabilities. By default, a process cannot ptrace another process of a different user, and unprivileged BPF might be limited (recent kernels allow some unprivileged BPF with restrictions, but uprobes might need CAP_SYS_ADMIN depending on config). Therefore, GPUprobe likely needs root or at least the BPF capability to attach uprobes to another process’s libcudart. This limits using such a tool in production unless proper privileges are arranged.
Data Sanitization: Profiling data might inadvertently capture sensitive information (e.g., addresses, sizes of allocations that hint at data sizes, or even kernel names that might be proprietary algorithms). So sharing profiles must be done carefully. Some tools anonymize certain data when needed (e.g., one could filter out kernel names or memory addresses).
DPUs and isolated environments: In a DPU (like Nvidia BlueField, which has an Arm CPU and runs its own OS), you might profile the DPU similar to any Linux server (since BlueField runs Ubuntu). But if the DPU is running in a mode where the host offloads tasks to it, profiling the interplay is complex. One might need to profile the host and DPU separately then correlate. For security, DPUs often isolate the host view – the host might not easily introspect the DPU’s processes without permission. So profiling a DPU might require logging into the DPU’s OS. This is an emerging area; tools like DCGM currently don’t cover DPUs (DCGM is GPU specific). A future “DPU profiler” might need to consider network security (ensuring tracing network packets doesn’t expose data contents) and multi-tenant (a DPU might serve multiple VMs).
Side-effects: Some instrumentation (like writing to /proc or loading kernel modules) could be disallowed in hardened environments. For example, to profile a GPU, you might load a kernel module (Linux nvidia driver has some profiling hooks). In HPC centers, admins often pre-install and enable needed configs. In cloud, one might need container privileges or special orchestration.

In short, profiling in sandboxed environments often requires deliberate enabling by administrators. An in-house observability platform might use a privileged agent (like DCGM exporter) to gather metrics and expose them in a controlled way to users, rather than letting users run arbitrary profilers on shared hardware. Many vendors are working on making profiling more accessible in containers (NVIDIA’s support for NVTX and certain CUPTI features in containers improved with the NVIDIA Docker integration).

For someone building new tools, it’s wise to be aware of these restrictions and possibly design the tool so it can either:

operate in a reduced functionality mode without special privileges (e.g., only use user-space timing, not hardware counters), or
be deployed as a privileged service that multiple users can safely use (maybe via an authenticated API).

Limitations, Caveats, and Future Directions

Despite their power, current GPU/DPU profiling tools have limitations. Being aware of these helps identify areas for future work:

Limited Vendor Support in One Tool: No single tool covers all accelerators well. Nsight is NVIDIA-only; Omnitrace currently AMD-only; Intel’s tools are Intel-only. Tools like HPCToolkit and TAU try to be multi-platform, but even they require different backends per vendor and may not expose every vendor-specific metric. A truly unified profiling standard is still lacking. OneAPI is a step in that direction for CPUs/GPUs (with Level Zero and an Instrumentation interface), but adoption is ongoing. Future work could develop a universal trace format or API so that one tool (or a plugin to it) can handle GPUs from multiple vendors seamlessly.
Depth vs. Overhead Trade-off: As discussed, capturing everything (full trace of a long run) can be infeasible. So tools either limit duration, frequency, or level of detail. Users need to be mindful that, for example, Nsight Systems might drop events if the buffer overflows, or HPCToolkit might not record every single tiny kernel if they occur too frequently (there’s often filters for very short kernels to avoid drowning in data). This means profiles might not have 100% fidelity. A caveat is that enabling profiling can change performance; e.g., the act of profiling might serialize some GPU operations or change timing, so the user must treat results as approximations. Future profilers might leverage more hardware support to reduce overhead (e.g., hardware trace buffers, as seen in CPUs with Intel PT; maybe someday GPUs have a hardware trace that tools can harvest).
DPU and Heterogeneous Workload Profiling: Profiling DPUs (data processing units) and tightly integrated systems (APUs where CPU and GPU share memory) is an evolving area. For DPUs, since they combine network, storage, and compute offloads, a profiler needs to capture network events (packets or protocol processing) along with CPU events on the DPU’s Arm cores. Today, one might use standard Linux tools on the DPU (like perf for CPU and maybe specialized NIC counters via ethtool). There isn’t yet a widely-used dedicated DPU profiler that shows, say, “this flow in the NIC caused this CPU processing on the DPU.” Future work could integrate NIC telemetry with CPU profiling – for instance, using something like DPDK or DOCA libraries to instrument packet processing on DPUs.
For APUs (CPU+GPU on one die, sharing memory), one limitation is that many profiling tools still treat CPU and GPU separately. But an APU could enable novel profiling: since they share memory, one could trace memory accesses system-wide. AMD’s HSA was meant to unify CPU/GPU memory space, and indeed HSA’s tools could measure things like if a CPU access to a buffer and a GPU access interfere. We don’t yet see mainstream tools showing a combined cache profile of CPU+GPU on an APU. Future research might allow tracking cache coherence events or memory traffic between CPU and GPU in an APU, to see contention (e.g., if CPU and GPU fight over memory bandwidth, how to detect that in a profile? Today you might see both CPU and GPU show high memory use but correlating them is manual).
Another APU-specific challenge is timeline alignment – on an APU the CPU and GPU clocks might be more closely coupled, possibly even the same oscillator. This could simplify correlation (no PCIe latency to consider), but tools haven’t exploited any special APU features yet.
Scalability: As HPC moves to exascale, profiling hundreds or thousands of GPUs simultaneously becomes difficult. HPCToolkit and TAU address this by focusing on sampling and profiles (reducing data) rather than giant traces. But even collecting one sample per second from 100,000 nodes is a lot of data. Future directions include online analysis (profilers that analyze and throw away data on the fly, keeping only summaries or anomalies) and streaming approaches where only interesting events are sent to a central collector. Perhaps AI techniques will be applied to performance data to smartly decide what to log.
Automation and Insight Generation: Right now, profilers present data, and it’s up to the user to infer insights (with some hints from the tool). Nsight Compute has a rules engine to hint at common issues (“memory bound, try increasing concurrency” etc.), but most tools do not automatically pinpoint root causes. A future direction is more automated performance diagnosis. For example, correlating patterns: “Every time GPU is idle, CPU thread X is in a mutex – likely a synchronization bottleneck.” Some research tools or prototype features aim for this kind of expert system. TAU, integrated with tuning frameworks, can perhaps auto-apply instrumentation to suspected hotspots. We might see more performance anomaly detection built into profilers (especially for always-on monitoring, flagging when an iteration is slower than usual and giving possible reasons).
Energy and Efficiency Metrics: As accelerators proliferate, power and energy profiling is important. DCGM provides power usage, but few tools integrate that with performance. A limitation in many profilers is they don’t attribute energy usage per kernel easily. In the future, tools could use power telemetry to, for instance, identify phases where energy is wasted (GPU at high power but low utilization). AMD and Intel support some power metrics via RAPL or ROCm-smi; integration of those into performance analysis is ongoing.
New Accelerator types: The term DPU aside, there are also TPUs (Google’s), FPGAs, custom ASICs. Each tends to have its own tools (Google TPU has its profiler integrated in TensorBoard). A challenge for the community is to generalize profiling to any accelerator with a standard interface. A project in this direction is the open-source Kokkos Tools for node-level profiling which abstract the device. It allows any backend (CUDA, HIP, etc.) to be profiled through a common API. Tools like HPCToolkit and TAU participate in such initiatives to be ready for new devices (if an accelerator vendor provides an OMPT-like or CUPTI-like API, these tools can hook in).
Reverse-engineering Required: Open-source tools often have to chase undocumented features. E.g., until NVIDIA provided official support, HPCToolkit devs had to figure out how to map GPU PCs to source by parsing ELF symbols in CUBINs, and how to collect GPU metrics via CUPTI’s evolving API. This means open tools lag behind vendor tools when new hardware arrives (the vendors might not update CUPTI immediately, etc.). A future ideal scenario is vendors open up more of their driver’s instrumentation hooks or standardize them. Meanwhile, one can expect that for bleeding-edge features (like new GPU memory types or new DPU offload engines), early profiling might involve creative hacks or limited info until official support catches up.

In conclusion, future work in this field will likely focus on unifying and simplifying the toolchain for heterogeneous profiling, reducing overhead further (perhaps with hardware support), integrating profiling with monitoring (to catch issues in production continuously), and smarter analysis to help developers pinpoint problems faster. The landscape is moving toward open standards (like Perfetto for traces, and maybe something akin to PAPI for GPUs one day) which will make it easier to build custom profilers that plug in. The insights from current tools’ implementations – their successes and pain points – will guide these developments. As hardware continues to evolve (with things like chiplet GPUs, unified memory hierarchies, etc.), profiling tools will need to adapt, but the fundamental techniques (instrumentation vs sampling, hardware counter use, multi-stream correlation) will remain cornerstones of any performance analysis toolkit.