The Accelerator Toolkit: A Review of Profiling and Tracing for GPUs and other co-processor

Modern computing increasingly relies on specialized accelerators – notably GPUs, DPUs, and APUs – to handle diverse workloads. A graphics processing unit (GPU) is a massively parallel processor originally for graphics, now essential in general-purpose computing (HPC) and AI. A data processing unit (DPU) is a newer class of programmable processor combining CPU cores with high-performance network/storage engines. DPUs offload networking, security, and storage tasks from the CPU, and are considered the "third pillar" of computing alongside CPUs and GPUs. Meanwhile, accelerated processing units (APUs) integrate CPU and GPU components on one chip – an approach pioneered by AMD's Fusion architecture – enabling unified memory and high throughput for HPC and AI workloads. These accelerators run a range of workloads: GPUs excel in parallel math (HPC simulation, deep learning training/inference, data analytics) and rendering graphics; DPUs focus on data-centric tasks (network packet processing, encryption, storage offload, virtualization); and APUs target heterogeneous workloads needing tight CPU-GPU coupling (e.g. sharing memory for AI or multimedia applications).

Profiling and tracing tools are crucial for optimizing performance on these accelerators. Such tools collect low-level hardware telemetry (e.g. counters for utilization, memory throughput, SM occupancy, cache misses) and can perform instruction or event-level tracing (capturing timelines of kernel executions, memory copies, network packet flows, etc.). The goal is to identify bottlenecks and inefficiencies in both general-purpose code and domain-specific pipelines (like ML model training, network function processing, or graphics rendering). However, profiling highly parallel, heterogeneous systems presents challenges of overhead, data volume, and cross-platform compatibility. This review categorizes current tracing/profiling tools by hardware type and domain, compares open-source and commercial solutions, and highlights major projects and recent research. We also discuss specialized toolsets (including eBPF-based approaches akin to Linux's BCC) adapted for GPUs/DPUs/APUs, typical workloads and matching toolchains, and the limitations and emerging directions in this landscape.

Profiling and Tracing Tools for GPUs

GPU profiling has matured over years of graphics and GPGPU development, yielding a rich ecosystem of vendor tools and open-source frameworks. Broadly, GPU profilers fall into two categories: development-time profilers that provide fine-grained insight for optimization (often with high overhead and GUI analysis), and lightweight monitors suitable for runtime or production use (low overhead, focusing on high-level metrics).

NVIDIA's Profiling Suite: NVIDIA's GPUs dominate in HPC/AI, and their tools are considered industry-standard. Nsight Systems provides system-wide timeline tracing – capturing CPU threads, CUDA kernel launches, memory copies, etc. – to pinpoint bottlenecks across the CPU-GPU boundary. It uses the CUDA Profiling Tools Interface (CUPTI) to gather detailed metrics and events. Nsight Systems is extremely powerful for deep analysis of GPU-accelerated applications, but it must explicitly start a profiling session and can incur significant overhead (often slowing programs by 2×–10× during profiling). It's intended for development and tuning rather than continuous use. Nsight Compute, on the other hand, focuses on per-kernel deep dives: it profiles individual GPU kernels, providing instruction-level stats (e.g. instruction mix, memory transactions, warp occupancy, execution stalls) and can even associate performance metrics with source lines or PTX/SASS assembly. NVIDIA also offers Visual Profiler (an older GUI, now largely replaced by Nsight) and command-line profiling via nvprof/nsys. For graphics workloads (DirectX, OpenGL, Vulkan), NVIDIA's Nsight Graphics captures frame renderer pipelines, shader timings, and GPU state to assist game developers. All these NVIDIA tools are free but proprietary (closed-source). They are tightly optimized for NVIDIA hardware and support domain-specific analysis modes (graphics vs. compute vs. AI).
AMD's GPU Profiling Tools: AMD provides both open-source and proprietary tools as part of the ROCm and GPUOpen ecosystems. Historically, CodeXL was AMD's all-in-one profiler and debugger for CPUs, GPUs, and APUs. CodeXL (now discontinued) could profile OpenCL, HIP, and HSA applications on AMD APUs/GPUs, collecting kernel execution times and hardware counters on integrated devices. In recent years, AMD shifted to ROCm-based tooling: rocProfiler and rocTracer libraries (analogous to NVIDIA's CUPTI) enable profiling and tracing of HIP and OpenCL applications. In 2024, AMD introduced new tools in ROCm 6.2: Omniperf and Omnitrace. Omniperf is a kernel-level profiler for machine learning and HPC workloads on AMD Instinct GPUs, offering detailed performance counters analysis via CLI or a GUI dashboard. Omnitrace is a multi-purpose profiling/tracing tool for CPU and GPU, supporting dynamic binary instrumentation, call-stack sampling, and even causal profiling to pinpoint which functions consume time on heterogeneous CPU-GPU executions. These tools are open-source or part of AMD's open ROCm stack. For graphics and game development, AMD provides Radeon GPU Profiler (RGP) and Radeon Developer Panel under GPUOpen. RGP offers low-level timeline and wavefront occupancy data for Vulkan/DX12 applications on Radeon GPUs, while Radeon Memory Visualizer helps track memory usage. AMD's tools can be used on their APUs as well, benefiting from unified memory (e.g., profiling an APU means GPU kernels can be traced without PCIe transfer overhead). Overall, AMD's approach emphasizes open interfaces (e.g., the GPUPerfAPI library gives developers direct access to GPU performance counters) and integration with generic profilers.
Intel and Other GPU Tools: Intel's GPUs (integrated Iris Xe and data-center GPUs like Ponte Vecchio) can be profiled by Intel's oneAPI toolset. Intel VTune Profiler supports GPU offload analysis, providing kernel timelines, EU (execution unit) occupancy, and memory bandwidth for Intel GPUs. Intel also offers Graphics Performance Analyzers (GPA) for game/graphics profiling on Intel hardware. Notably, Intel has open-sourced a suite called Profiling Tools Interface (PTI) for GPU, which includes lightweight tracing tools for oneAPI Level Zero and OpenCL applications. These command-line tools (available on GitHub) can trace GPU kernel submissions, memory operations, etc., on Intel GPUs, reflecting Intel's push for an open profiling ecosystem. Beyond the big three vendors, there are domain-specific GPU profilers: e.g., ARM's Mali GPUs (for mobile) have ARM Mobile Studio with tools like Streamline for profiling mobile GPU workloads; Qualcomm Adreno GPUs can be analyzed with Qualcomm's Snapdragon Profiler. These are more specialized but underscore that across vendors, profiling often requires proprietary SDKs or tools unique to each architecture, with little standardization – a pain point if one needs to support multiple GPU vendors.
HPC and Cross-Platform Profiling: Outside vendor-specific utilities, the HPC community has developed powerful open-source profiling frameworks that work across CPUs and accelerators. HPCToolkit is a prominent example: it uses statistical sampling to profile both CPU and GPU execution with minimal overhead (often <5%). HPCToolkit can trace GPU operations (kernels, memcopies, sync) on NVIDIA, AMD, and Intel GPUs, and on NVIDIA it even leverages hardware PC sampling to measure instruction-level execution and stall cycles. The tool correlates GPU activity back to CPU call stacks, allowing a unified profile attributing GPU costs to the calling CPU code context. This is invaluable for heterogeneous applications, e.g. identifying which CPU-side function launched a slow GPU kernel. Other cross-platform tools include TAU and Score-P/Extrae, which can instrument MPI+GPU programs and produce integrated traces. For instance, the Extrae tracer (from BSC) records CPU events and CUDA runtime events, enabling visualization in Paraver of both CPU timeline and GPU kernel timelines. These OSS tools typically support multiple accelerators via plugins (CUPTI for CUDA, ROCm tools for AMD, etc.), providing vendor-neutral analysis. They may not always expose the full depth of vendor tools' metrics, but they excel in coordinating multi-node, multi-accelerator traces. Academic efforts like Paraver/Extrae, Vampir, and Allinea MAP (Arm Forge) have evolved to handle GPU-accelerated HPC codes, indicating a trend toward unified performance analysis for heterogeneous systems.
Graphics and Game Profiling: In graphics domains, tracing GPU workloads often involves capturing API calls and GPU command streams. Open-source projects like RenderDoc allow frame-by-frame capture of Vulkan/OpenGL/Direct3D calls, with introspection of GPU draw call timing (helpful for graphics debugging/performance). While RenderDoc is more of a debugger, it can give insights on whether the GPU is bound by certain draw calls or shaders. Platform-specific tools exist too: Microsoft's PIX (for DirectX on Windows/Xbox) provides detailed GPU timing for each render pass, and even Linux has tools like GPUView that trace kernel-level GPU scheduling events for graphics workloads. These are highly domain-specific (targeting graphics pipelines rather than general compute). They complement general GPU profilers by focusing on frame rendering latency, vs. kernel throughput.

Table 1 below summarizes a sample of representative GPU profiling/tracing tools across vendors and domains, highlighting their availability and focus:

Tool	Vendor / Origin	Open-Source?	Primary Focus	Supported Domain
Nsight Systems	NVIDIA	No (free)	Timeline tracing (CPU & GPU), deep metrics	CUDA Compute, AI, some Graphics
Nsight Compute	NVIDIA	No (free)	Kernel microarchitecture profiling	CUDA/HPC/AI (per-kernel)
NVIDIA PerfKit/DCGM	NVIDIA	No (free)	Low-level HW counters (PerfKit); Datacenter GPU monitoring (DCGM)	System GPU Telemetry
Radeon GPU Profiler	AMD (GPUOpen)	Yes (free)	Low-level GPU trace (wavefront, ISA)	Graphics (Vulkan/DX12), GPGPU
ROCm Omniperf	AMD ROCm (Instinct MI GPUs)	Yes	Kernel profiling (counters & analysis)	HPC/AI Compute
ROCm Omnitrace	AMD ROCm	Yes	CPU-GPU tracing, call-stack profiling	HPC/AI, Heterogeneous apps
Intel VTune & GPA	Intel oneAPI	Partial (VTune closed, PTI open)	VTune: GPU offload analysis; GPA: frame analysis	Compute (oneAPI) & Graphics
HPCToolkit	Rice Univ. (HPC tool)	Yes	Sampling-based profiling (CPU & GPU)	HPC/AI (CPU+GPU)
RenderDoc	Community / Baldur Karlsson	Yes	Frame capture & API trace	Graphics debugging
PyTorch Kineto	FB/Intel (via PyTorch)	Yes	In-framework profiler (CPU+GPU)	AI/ML (Deep Learning)

Sources: GPU vendor documentation and tool websites.

As the table suggests, open-source solutions are prominent in research and HPC (e.g. HPCToolkit, Omniperf/Omnitrace, RenderDoc), while commercial/vendor tools (Nsight, VTune, etc.) often provide the most optimized access to proprietary hardware features (like NVIDIA's profilers using privileged CUPTI APIs). Open tools may trade some low-level detail for broader applicability – for example, HPCToolkit can profile across NVIDIA, AMD, and Intel GPUs in a uniform way, but for deepest NVIDIA-specific metrics (e.g. SM warp stall reasons), developers still rely on Nsight Compute. Conversely, vendor tools are typically free-of-cost but closed-source, and each vendor's toolchain is distinct, leading to fragmentation. A developer targeting multiple GPU platforms might need to juggle multiple profilers (one for CUDA, one for ROCm, one for Intel) since there is no single standard interface for GPU performance counters across vendors. The lack of standardization has led to projects like ARM's HWCPipe library that attempt to abstract GPU counters for multiple architectures in one API, but such efforts are still evolving.

GPU Tracing with Low-Overhead and Continuous Monitoring

Classic GPU profilers (as described above) are extremely useful during development, but their overhead and intrusive workflows make them unsuitable for always-on monitoring in production (for instance, you can't afford a 5× slowdown on a live AI inference server just to collect traces). To fill this gap, recent tools leverage low-overhead tracing techniques inspired by systems like Linux's eBPF. One notable approach is Meta's deployment of eBPF for fleet-wide GPU profiling, as presented by Selim (2023) at the eBPF Summit. Instead of instrumenting GPU code, Meta's tool attaches to GPU driver events via eBPF, enabling continuous collection of GPU metrics across thousands of machines with negligible overhead.

An example of an open project in this vein is GPUprobe, a Linux eBPF-based GPU observability tool. GPUprobe uses uprobes (user-level probes) to hook into NVIDIA's CUDA runtime library functions at the kernel level. In doing so, it can monitor events like GPU memory allocations (cudaMalloc/free) and kernel launches in real time – without requiring any modifications or instrumentation in the target application code. The overhead is very low (measured under 4% in benchmarks), so it's feasible to run in production continuously. GPUprobe fills a middle ground between heavy profilers and coarse monitoring: it provides richer, per-application insights than NVIDIA's Data Center GPU Manager (DCGM) – such as tracking memory leaks per process and kernel launch frequencies – but with far less overhead than Nsight's full profiling. As the GPUprobe authors note, Nsight Systems is like a "GPU-specific debugger" that's great for deep dives but not for continuous use, while DCGM gives high-level stats (utilization, temps, health) and misses app-specific details. Tools like GPUprobe bridge this gap, exporting metrics to standard observability systems (e.g. Prometheus/Grafana) for integration into data center dashboards. In fact, GPUprobe's design allows scraping of its metrics (memory usage maps, kernel launch counts, bandwidth usage) in OpenMetrics format, so operators can visualize GPU behavior over time in Grafana alongside CPU, network, and other metrics.

This BCC/eBPF-inspired approach is an emerging trend for GPU profiling. It aims to bring the powerful methodology of kernel tracing (pioneered on CPUs by tools like perf, bcc, and eBPF) into the GPU realm. Research prototypes have even explored running eBPF programs on the GPU itself (for instance, an academic project "eGPU" offloaded BPF bytecode to GPUs via PTX injection), though such techniques are not yet mainstream. At present, the more practical uses involve hooking GPU driver or runtime events from the CPU side. The result is a non-intrusive peek into GPU operations: for example, detecting if a GPU job is launch-bound (many small kernel launches) or memory-leak-prone, without recompiling the application. This is particularly valuable for cloud providers or large-scale AI deployments, where continuous profiling can catch performance regressions or resource leaks in long-running GPU services.

Profiling and Tracing Tools for DPUs (SmartNICs)

DPUs (data processing units), often manifested as SmartNICs, combine general-purpose cores with specialized packet-processing hardware. They are used to offload networking (packet switching, virtualization), storage (NVMe-oF, encryption), and security tasks from the main CPU. Profiling DPUs presents a distinct challenge: one must consider both the on-board CPU (usually an Arm SoC running Linux) and the networking data plane which may involve FPGA logic or ASIC accelerators on the NIC.

In general, profiling on a DPU can leverage many of the standard Linux tools for the embedded CPU portion. For example, NVIDIA's BlueField DPUs run Ubuntu, so one can use Linux perf, standard CPU profilers, or even eBPF-based monitors on the DPU's Arm cores to profile software running locally (e.g., an offloaded software switch). If a user application or agent runs on the DPU's OS, it's profiled much like on any Linux server – albeit with an awareness of the limited cores and unique tasks (often packet-handling threads). In fact, one could run BCC tools on a BlueField to measure syscalls, or use perf to sample cache misses in the DPU's code. This is the general-purpose workload profiling on DPUs (similar to any Linux host, but constrained resources).

However, much of a DPU's workload is domain-specific (networking and storage) and handled by specialized hardware blocks. For instance, a DPU may accelerate an Open vSwitch (OVS) datapath in hardware, or perform RDMA and NVMe operations via dedicated engines. Profiling these aspects often relies on vendor-provided telemetry and counters. Vendors like NVIDIA (Mellanox) and Broadcom supply tools to monitor packet throughput, latency, and offload engine stats on their SmartNICs. NVIDIA's DOCA SDK for BlueField includes profiling APIs and performance monitors for accelerated functions (e.g., crypto, RDMA). The BlueField DPU exposes metrics such as packets per second, drops, and queue depths via standard interfaces (perhaps through DPDK or /sys counters). In the case of DPDK (a common user-space packet I/O library used with SmartNICs), developers can profile their packet processing pipeline on CPU using Intel VTune or perf, and measure NIC throughput using DPDK's built-in event counters. Intel's documentation even covers using VTune to analyze DPDK event scheduling on their infrastructure processing units (IPUs).

Commercial SmartNIC vendors offer their own monitoring suites. For example, Napatech (a SmartNIC manufacturer) distributes profiling tools that report port throughput, packet counters (RMON statistics), and even host application interaction metrics. These tools often come as command-line monitors or GUI dashboards. Napatech's monitoring CLI (shown in their docs) can live-update line rate (e.g., ~48 Gbps Rx/Tx on a 100G NIC) and various packet size counters. Such vendor tools are usually proprietary (bundled with the NIC), highlighting a similarity with GPU space: to get full visibility into hardware accelerators on the DPU, you typically use the vendor's API or utility. Another example: Broadcom/Pensando DPUs (now part of AMD) have an SDK that likely includes telemetry for their packet processors, though details are often behind NDAs. Cisco and Marvell likewise provide manageability interfaces for their SmartNICs (often as part of network OS or NIC firmware), focusing on throughput and latency metrics rather than instruction-level traces.

That said, open-source efforts are emerging for SmartNIC performance analysis. The P4 language, used to program some NICs, has debugging tools which can simulate or log packet flow through the pipeline (though not exactly a profiler in the traditional sense). Academic research has produced tools like Clara and Pipefuse to analyze or even predict network function performance on SmartNICs. These aim to answer questions like "if I offload this function to a SmartNIC, what throughput can I expect?" by modeling the NIC's resources. While not runtime profilers, they address the broader performance tuning of DPU workloads. Another research example is LogNIC, which provides a performance model for SmartNIC pipelines. Such tools are largely experimental but point toward future high-level profilers for networking tasks.

In summary, DPU profiling today is a patchwork of general CPU profiling on one hand, and specialized network telemetry on the other. One might profile the software control plane on the DPU using familiar tools (to ensure the DPU's CPU isn't a bottleneck) while simultaneously using NIC counters or synthetic traffic tests to profile the data plane throughput. Coordinating these is often manual. For instance, to profile an Open vSwitch offloaded to a DPU, you'd measure the DPU's CPU usage (for control tasks, flow setup) and gather NIC stats for packet rate and latency, possibly by generating test traffic and measuring end-to-end latency. Standard performance profilers for network workloads (like how to trace a P4 program on hardware) are still nascent. We expect that as DPUs become more common, vendor-agnostic profiling standards may emerge – perhaps an extension of eBPF/XDP to trace through a SmartNIC, or an open telemetry schema for NICs – but currently much is vendor-specific.

Profiling and Tracing Tools for APUs (CPU–GPU Integrated Platforms)

Accelerated Processing Units (APUs) blend CPU and GPU on a single die, sharing memory and interconnect. AMD's latest Instinct MI300A is a prime example: a data-center APU combining 24 Zen4 CPU cores with 128 GB of HBM memory and a CDNA3 GPU in one package. Profiling APUs involves understanding the interaction between the on-chip CPU and GPU, which can be both a blessing and a challenge. On one hand, unified memory means developers don't need to profile PCIe transfer bottlenecks – CPU and GPU can access the same HBM pool, and data movement is via pointers rather than explicit copies. On the other hand, an APU's GPU shares power/thermal budgets with CPUs, which can introduce contention that profiling tools should reveal (e.g., if the GPU is throttling when CPU is maxed out).

Tools for APUs largely overlap with the CPU and GPU tools discussed, with added emphasis on integrated analysis. AMD's toolchain, for example, is APU-aware: AMD uProf is a profiling suite that covers CPU performance (PMU events, cache misses, etc.) and also can correlate with GPU activity on supported APUs. AMD uProf and CodeXL historically allowed profiling OpenCL kernels on an APU's iGPU, reporting each kernel's performance counters. The new ROCm Omnitrace (mentioned earlier) explicitly supports profiling both CPU and GPU in one timeline, which is ideal for APUs where CPU threads launch GPU work frequently. Omnitrace's ability to use binary instrumentation and call-stack sampling on the CPU side, while tracing GPU kernels, helps map performance "holographically" across the APU. This means if a CPU function on the APU calls a GPU kernel, the tool can show the time in the CPU function and the nested time in the GPU kernel as part of the same call tree – a critical capability for optimizing heterogeneous code.

For consumer APUs (like AMD Ryzen processors with Radeon graphics), developers commonly use graphics profilers (for gaming use-cases) or OpenCL/Vulkan profilers for compute. AMD's Radeon GPU Profiler, for instance, works on integrated GPUs the same as discrete. The unified memory also allows use of standard OS performance counters: on Linux, AMD's GPU drivers expose certain GPU utilization metrics via drm/sysfs, so one could even use system monitors or custom scripts to log GPU activity alongside CPU. Windows developers with APUs might use Microsoft's PIX or AMD's Radeon Developer tools to profile DirectX12 games running on the integrated GPU – these tools show CPU and GPU timelines and could highlight if the CPU is starving the GPU or vice versa. Essentially, APU profiling doesn't require an entirely new class of tools, but it benefits from tools that can correlate CPU and GPU performance tightly. This is similar to profiling on a discrete GPU system, except the latency between CPU-GPU is lower and memory is shared, which tools need to account for (e.g., a cache coherency effect might appear where CPU and GPU contend on memory).

It's worth noting that integrated architectures spurred the development of HSA (Heterogeneous System Architecture) in the mid-2010s, and AMD's tools had HSA-specific profiling modes. For example, CodeXL included an HSA profiler for AMD APUs which could trace HSA kernel dispatches and HSAIL instructions. Much of that functionality has been absorbed into ROCm tools now. The essence remains: APUs require profiling of the whole system rather than just "CPU vs GPU" in isolation. Tools like Omnitrace, VTune, or HPCToolkit (with its heterogeneous call path analysis) are particularly apt for APU-based workloads because they naturally mix CPU and GPU metrics.

Open-Source vs Commercial Solutions – A Comparison

There is a healthy mix of open-source (OSS) and commercial/proprietary solutions in accelerator profiling, each with pros and cons. Here we compare key aspects:

Feature Depth and Hardware Access: Vendor-provided tools (usually closed-source, but often free of charge) tend to have the most comprehensive access to hardware performance counters and features. For example, NVIDIA's Nsight can report SM warp stall reasons and texture cache hit rates – metrics exposed by NVIDIA's secret sauce interfaces that open tools generally don't get. Similarly, AMD's proprietary driver might expose GPU wavefront occupancy details to RGP that generic tools can't obtain. OSS tools rely on published or reverse-engineered interfaces; for instance, AMD's GPUPerfAPI (open library) provides cross-platform counter access, which is why AMD's own tools could be open-sourced. Open-source projects sometimes lack the very latest hardware support until vendors release documentation, whereas commercial tools are ready on Day 1 for new GPUs (since the vendor builds them). On the other hand, open tools like HPCToolkit have innovated features like fully automated call-stack unwinding and statistical GPU instruction sampling that are not available in vendor GUIs, showing that OSS can lead in certain capabilities (particularly around integration and low overhead).
Extensibility and Customization: Open-source profilers (HPCToolkit, TAU, etc.) allow users to modify or script them, enabling custom analyses or integration into automated pipelines. For instance, you can instrument code with Score-P and emit traces in Open Trace Format (OTF2), then post-process with custom analytics – all because the formats and code are open. In contrast, commercial tools often lock data in proprietary formats (e.g., Nsight's .nsys-rep trace files) that require the vendor's viewer, though some export options exist (like CSV exports). The OSS approach also fosters community contributions – e.g., support for new programming models (OpenMP offload, Kokkos, etc.) often appears first in tools like TAU or Score-P via community patches.
Cross-Vendor Support: As noted, open-source tools are generally more vendor-neutral. A single tool like Vampir or HPCToolkit can handle multiple accelerator types in one run, whereas vendor tools are siloed (Nsight won't profile an AMD GPU, and AMD's rocprof won't work on NVIDIA). For a heterogeneous environment (say, an Intel CPU, an NVIDIA GPU, and maybe a Xilinx FPGA in one system), your only hope for a unified trace might be an open tool that supports all via plugins or standard APIs (OpenCL, oneAPI, etc.). This is a strong point in favor of OSS solutions in research or multi-vendor shops. The downside is that vendor tools are often better optimized for their own hardware – they may offer a more polished UI, or more stable data collection on that platform. For example, an open tool using unofficial GPU counters might be brittle or less accurate if drivers change.
Cost and Support: Most vendor tools for GPUs/DPUs are free (as in beer) but closed. There are a few truly commercial (for-purchase) performance tools in HPC, such as the Arm Forge suite (which includes the MAP profiler) – these come with professional support. Open-source tools are free (as in speech/beer) but support comes from community or self-expertise. Companies with mission-critical needs sometimes prefer tools backed by vendor support (to help interpret results or get bug fixes). That said, big vendors (NVIDIA, Intel) do provide support forums even for their free tools. In niche domains like networking, some commercial analyzers (e.g., deep packet inspection performance profilers) might come from specialized firms and require licenses – but these are relatively rare.

In practice, environments often use a combination: e.g., an HPC center might use vendor profilers to optimize code on a specific GPU, but then integrate HPCToolkit or IPM (an MPI profiler) for regression testing and cross-system comparisons. Notably, open and closed tools can complement each other. A developer might run an OSS tracing tool for a high-level overview and cross-check specific kernels with the vendor's detailed profiler. An example from the GPU domain: a user could run a Score-P instrumented program to get an overall MPI+GPU timeline, then zoom into a particular GPU kernel of interest with Nsight Compute to inspect its memory throughput. This layered approach plays to each tool's strengths.

Workload-Specific Tool Mappings

Different workloads stress accelerators in different ways, and accordingly, certain tools are favored in each domain:

General-Purpose Computing / HPC: These workloads (scientific simulations, linear algebra, data analytics) use GPUs for throughput. Profiling focuses on kernel efficiency and GPU utilization. Tools like Nsight Compute (for compute kernel analysis) or HPCToolkit/Omniperf are well-suited. HPC codes also run on thousands of GPUs in parallel; tracing each in detail is impractical, so tools like HPCToolkit that add only ~1-5% overhead via sampling are invaluable for profiling large-scale runs. HPC workloads often use MPI + GPU, so tools that can correlate communication and computation (e.g., timeline traces via Extrae, or MPI profiles via mpiP combined with GPU profiles) map well. On DPUs in HPC (e.g., using SmartNICs for RDMA), the "workload" is typically just networking – here one cares about throughput and overlap (profiling ensures that the DPU handles data transfers while GPUs compute, for example). Tools: network benchmarks (like IB Profiler for InfiniBand) plus GPU profilers to see if communication overlaps with computation.
Machine Learning / AI: ML training combines heavy GPU compute with data pipeline overhead. Profiling an ML workload might involve framework-level profilers: e.g., PyTorch Profiler (built on Kineto) which internally uses CUPTI to record each op's GPU time, or TensorFlow Profiler which similarly captures timelines of ops and streams. These produce high-level views (which model layer took time) as well as low-level kernel details. NVIDIA has a DLProf tool that integrates with TensorBoard to show GPU kernel metrics in the context of neural network operations. For multi-GPU training, Nsight Systems can trace activity across GPUs (especially if using NCCL for communication – Nsight can show NCCL calls timeline). An emerging challenge is profiling distributed training: tools like PyTorch Profiler now have distributed traces, but it's still an area of active development to seamlessly profile 100s of GPUs training one model. On the DPU side, AI may use DPUs for preprocessing or moving data – profiling the DPU's effect (say using it to do data filtering) would involve monitoring how much the DPU speeds up data ingestion (tracked via throughput) and ensuring the GPU is not starved. In the future, AI accelerators on DPUs (some DPUs might include tiny ML cores) could require new profilers, but currently most AI work is GPU-centric.
Networking and I/O Workloads: For pure networking tasks on DPUs or GPUs (yes, GPUs can also do packet processing in some cases using CUDA or OpenCL), the profiling is about latency and throughput. Tools here include packet generators (to measure how many Mpps a pipeline can handle) and tracing tools for code paths. For instance, if using a GPU to accelerate packet encryption, one might use Nsight to ensure kernel launches overlap with data transfers. If using a DPU to run, say, an IDS (intrusion detection) in software, one might profile it with perf to see if it's CPU-bound and use NIC counters for drops. Networking workloads often demand real-time tracing (to catch jitter spikes), so eBPF-based monitors or even hardware telemetry (like P4 runtime logs) could be employed. There's also interest in using GPU for networking (GPU-accelerated NIC offloads via CUDA pipelines), which would involve both GPU and network profiling – but that's fairly niche and experimental.
Graphics and Visualization: For game engines or VR apps on GPUs (especially APUs in consoles or laptops), tools like RenderDoc, Nsight Graphics, and platform profilers (e.g., Apple's Xcode Instruments for Metal) are tailored to measure frame times, GPU pipeline stages, and CPU-GPU synchronization. A graphics workload is typically limited by either the GPU shader throughput or the CPU draw-call submission rate. Profiling maps to checking if the GPU's frame time budget is being exceeded and why (which stage – vertex shading? fragment? memory?). These tools often provide specialized visualizations (HUD overlays, frame scrubbers) that general compute profilers don't. For APUs handling graphics, one must also consider that the CPU and GPU share memory bandwidth – graphics debuggers can show if CPU memory traffic (e.g., updating textures) is affecting the GPU. Additionally, in professional visualization (CAD, etc.), GPU memory usage can be limiting; tools like NVIDIA's Nsight or AMD's RGPA can profile VRAM usage and cache behavior to optimize large models.

In essence, each processing unit type sees use in particular domains, and the profiling solutions have evolved accordingly – but there is also convergence. A modern AI application may involve GPUs for compute, CPUs for orchestration, DPUs for data loading, all in one pipeline. This raises the need for multi-accelerator profiling – the ability to trace an operation as it moves through CPU, DPU, and GPU. Today this often means running multiple tools and correlating timestamps manually. For instance, one might use Nsight to trace the GPU and simultaneously run tcpdump or NIC counters on the network side, then align the logs. Such multi-component workflows are cumbersome, pointing to an opportunity for more integrated profiling of heterogeneous workflows.

Current Research Directions and Emerging Trends

The field of accelerator tracing/profiling is actively evolving. Several key research and development directions are apparent:

Low-Overhead, Always-On Profiling: As discussed, techniques borrowing from OS telemetry (eBPF, hardware performance monitors) are being adapted for accelerators. Meta's continuous GPU profiler and tools like GPUprobe demonstrate that one can collect useful data in production with minimal overhead. We anticipate more work in this area: for example, continuous DPU monitoring integrated into data center observability stacks (similar to DCGM for GPUs) – perhaps using eBPF to monitor DPU NIC drivers or using in-hardware telemetry (many NICs have telemetry streams for packet pacing, queue occupancy, etc., which could be tapped into). For GPUs, researchers are exploring sampling-based profiling to reduce overhead further. NVIDIA's latest architectures support PC sampling, where the GPU periodically samples its program counter and reports which instructions are executing and if they stalled. This can be done in the background with little interference. HPCToolkit already leverages this on NVIDIA GPUs to sample instructions and derive stall breakdowns. Future tools might extend such sampling to gather a statistical trace of GPU activity without instrumenting each kernel launch.
Unified and Standardized Interfaces: A recurring theme is the lack of standardization across vendors. There are calls for a vendor-neutral GPU profiling API – for instance, a past suggestion in the OpenGL community proposed a unified API for GPU performance counters across vendors. In the compute realm, something analogous to the CPU's PAPI (Performance API) for GPUs would be valuable. We see early steps: the Khronos Group's OpenCL and Vulkan APIs have performance query extensions (like Vulkan's VK_KHR_performance_query) that let applications gather some counters in a standardized way. Also, Intel's oneAPI aims to provide a uniform interface (Level Zero) for accelerators, including tools support. While oneAPI is Intel-centric, it sets a precedent for abstracting profiling: an application could, in theory, use oneAPI to profile code on CPUs, GPUs, and FPGAs with a single tool – but only if other vendors adopt or adapt to it. Another push is in OpenTelemetry for hardware – currently OpenTelemetry (popular in cloud for tracing requests) doesn't cover internal hardware events, but conceivably it could be extended to span spans across CPU, accelerator, and network events for distributed tracing of heterogeneous workloads.
Integration of AI in Performance Analysis: With the complexity of performance data (especially from fine-grained traces), there's interest in using machine learning to assist performance analysis. Research projects have looked at learning models to predict performance from partial traces or to automatically classify bottlenecks from counter signatures. For example, an ML model might learn the patterns in counters that indicate memory bandwidth bottleneck vs computation-bound, automating what human experts do manually with roofline models. While not mainstream in tools yet, some profilers (like Intel Advisor's automated roofline analysis, or NVIDIA's Guided Analysis in Nsight Compute) incorporate heuristic guidance that could evolve into ML-driven suggestions. The goal is to help developers interpret the deluge of profiling data more easily.
Causal Tracing and Causal Profiling: Traditional profiling observes performance passively. A newer research direction is causal profiling, where the profiler experiments by perturbing execution to gauge impact on performance (e.g., artificially slow down one component to see if others idle, determining causality of bottlenecks). Omnitrace's mention of causal profiling support hints that such techniques are being implemented for GPU/CPU combos. Causal tracing could identify, for instance, that GPU kernel X finishing late is what delays CPU task Y, by seeing how timings change if X is made faster or slower. This is an advanced capability with potential to untangle complex dependencies in asynchronous pipelines.
Multi-Accelerator and Distributed Coordination: As systems integrate CPUs + various accelerators (GPUs, DPUs, FPGAs, TPUs, etc.), one research challenge is coordinating profiling across them. How do we get a coherent timeline or profile when parts of the work happen on different chips with their own clocks and trace buffers? Tools like Extrae/Paraver in HPC can merge traces from CPU and GPU by aligning timestamps (assuming synchronized clocks) and allow analyzing them together. We expect further development here, possibly with standard timestamping (PTP – precision time protocol – could be used to sync time between host and DPU, for example). Also, orchestrating triggers across tools – e.g., start a GPU profiler when a network event happens – is being explored. Some current tools allow triggers (Nsight can start/stop based on CUDA API calls or markers); extending this across devices (start GPU trace when DPU's packet queue overflows) could be extremely useful for diagnosing cross-stack performance issues (like a slow GPU causing packet backlog on a SmartNIC).
Better Profiling for New Accelerator Types: While this report focuses on GPUs, DPUs, and APUs, the universe of accelerators includes FPGAs, AI ASICs (e.g., Google's TPUs), and more. Each is spawning its own tooling – Xilinx (AMD) FPGAs have the Vivado and Vitis analyzer for hardware kernels, Google TPUs have a profiler in TensorBoard, etc. A clear direction is to bring these together. If a cloud has CPUs, GPUs, DPUs, TPUs all working in tandem, the dream is a single pane of glass to observe them. Industry consortia may eventually collaborate on open standards for accelerator telemetry (analogous to how OpenCL was a standard for compute). Until then, research often steps in: for example, academic work on monitoring FPGAs in datacenters via in-fabric monitors, or using eBPF-like techniques on other devices.

In summary, profiling/tracing research is moving toward making these tools more pervasive, intelligent, and unified. The aim is to reduce the burden on developers to manually instrument and correlate performance data from disparate sources, and instead provide smarter tools that work across the complex heterogeneous systems of today's data centers.

Limitations and Challenges of Existing Toolchains

Despite the plethora of tools discussed, practitioners face several persistent challenges when profiling GPUs, DPUs, and APUs:

Fragmentation and Vendor Lock-In: As noted, each vendor's accelerators often come with a siloed toolchain. This means expertise in one doesn't translate easily to another, and mixing hardware leads to multiple tools. It also risks lock-in: optimizations done with a proprietary tool might rely on vendor-specific features. There is no universal standard like "perf" that universally covers all accelerator types (though on CPUs, perf itself is limited to Linux). The lack of common performance counter interfaces across GPU vendors is a prime example – developers must use CUDA-specific or ROCm-specific APIs, making portable performance analysis difficult. For DPUs, which are relatively new, there isn't even a widely adopted third-party profiler – you use whatever the DPU vendor provides. This fragmentation not only complicates life for developers, but also for researchers trying to compare platforms fairly.
Limited Visibility ("Black Box" issues): Some aspects of accelerator performance are effectively black boxes to current profilers. For instance, GPU internal scheduling (how warps are scheduled, how L2 cache lines are evicted) might not be fully exposed by counters. Similarly, if a DPU's packet accelerator is an FPGA or ASIC, external tools might only see a throughput number, not why it saturates at X Gbps (e.g., a microarchitecture detail inside the NIC). Even with something like NVIDIA's counters, there are often undocumented metrics or ones that are hard to interpret. This limited visibility is often by design (IP protection), but hampers optimization. Another visibility issue arises in closed-source workloads: if you are profiling a third-party library on the GPU, you might see a kernel name and time but not what it did internally. Tools like HPCToolkit help by attributing costs to instructions even without source, but generally it's hard to optimize what you can't inspect.
Overhead vs. Intrusiveness: Many precise tracing tools perturb the very performance they measure. Instrumentation-based tracing (inserting hooks on each kernel launch or each packet) can cause Heisenberg effects where the act of measuring changes timing. While sampling-based methods alleviate this, they trade detail for lower overhead. There is always a challenge to find the right balance: how much data to collect and at what frequency to get a representative profile without drowning in overhead or data volume. High-overhead tools must be limited to test environments, meaning you might not catch issues that only manifest at scale or in production. Conversely, ultra-light monitors might only flag "GPU utilization low" but not explain the cause. Bridging this gap remains an ongoing challenge.
Concurrency and Ordering Issues: Profiling multi-threaded, multi-stream workloads can run into issues aligning events. For example, timeline traces from different GPUs or different devices may have clock skew, making it tricky to know the true sequence of events. Even on one GPU, the hardware can execute many kernels concurrently (on different SMs or using async streams), and visualizing or understanding overlapping activities is complex. Tools try (Nsight Systems, for instance, shows overlapping kernels on a timeline), but as systems scale, one confronts what we might call the "explosion of events" problem: too many events to reason about. Filtering and focusing on the right subset is difficult. Tools are only beginning to introduce smarter filtering (like only trace kernels longer than X microseconds, or only trace GPU activity when GPU utilization drops, etc.).
Lack of Multi-Accelerator Coordination: Today's tools largely operate in isolation per device. If you have a heterogeneous node with CPU, GPU, DPU, FPGA, each might be profiled separately, yielding separate reports that the user must correlate. Suppose a performance issue is due to a mismatch between GPU throughput and NIC throughput – a GPU profiler might just show the GPU is idle 20% of time (waiting for data), and a NIC monitor shows 100% utilization on a queue – it's up to the engineer to correlate those and deduce the cause (network-bound). Ideally, a profiler would capture such cross-device dependency automatically (e.g., a visual cue that GPU idleness correlates with NIC saturation). Achieving that requires a holistic view and perhaps standardized trace events that can link across devices (like an event on NIC "frame delivered" could be tied to an event "frame processed on GPU"). Without common standards, such correlation is mostly manual or via ad-hoc instrumentation (inserting timestamps in app code).
Scaling and Big Data Problems: When profiling large-scale workloads (think 1000 GPUs or a DPU handling millions of packets per second), the volume of profiling data can be enormous. Storing and analyzing trace logs from even a few seconds of operation may be non-trivial. There is a challenge in data reduction – how to summarize performance data meaningfully. Current tools offer some summaries (like average kernel time, top 10 memory consumers, etc.), but more automated summarization is needed, possibly with hierarchical or statistical techniques to condense traces. HPC centers have dealt with this by selective tracing (capturing only on a few ranks, etc.) or using sampling. Future tools might incorporate on-line analysis, where the tool itself does some processing of data as it's collected (for example, computing distributions instead of logging every event).
Education and Usability: Finally, it's worth noting a practical challenge: the learning curve. Each tool often comes with its own GUI or output format, and understanding metrics like "warp serialize" or "bank conflict" or "DPU cache hit" requires some architecture knowledge. Performance analysis on these accelerators is somewhat a dark art, and although tools provide data, making sense of it is not always straightforward. Efforts like roofline models are attempts to simplify interpretation, but users still struggle to go from profiler output to concrete optimizations. This is partially an educational challenge – documentation and training need to accompany tools. It's also a design challenge for tool builders to present data in intuitive ways (e.g., Nvidia now often integrates AI performance metrics like "utilization of Tensor Cores" directly, to tell ML engineers how well they used the GPU).

In summary, while current tracing and profiling tools for GPUs, DPUs, and APUs are powerful and essential, they operate in a landscape that is highly specialized and fragmented, with visibility gaps and integration shortcomings. Overcoming these limitations will require collaborative efforts – between hardware vendors (to open up interfaces), tool developers (to create smarter, standard tools), and the research community (to pioneer new techniques for low-overhead and combined tracing).

Conclusion

Tracing and profiling accelerators has become as critical as profiling CPUs was in earlier eras. We now have a broad arsenal of tools tailored to different hardware and use cases: from NVIDIA's Nsight suite and AMD's ROCm tools for deep GPU analysis, to open-source frameworks like HPCToolkit for holistic profiling, to emerging eBPF-based approaches enabling continuous monitoring. GPUs enjoy the most mature tool support (reflecting their longer history in computing), while DPUs and other domain-specific accelerators are catching up with their own nascent profilers and telemetry systems. APUs illustrate the need for tools that can seamlessly profile across traditional processor boundaries, as computing moves toward tightly integrated heterogeneous designs.

This review has highlighted that both open-source and commercial solutions play important roles: open tools foster cross-platform agility and innovation, whereas vendor tools leverage proprietary knowledge for maximum insight on their hardware. The best outcomes often arise from using them in combination. In specialized domains like ML, networking, and graphics, domain-specific profiling capabilities augment general tools to provide the needed perspective (e.g., viewing a GPU timeline in terms of neural network layers, or network throughput in context of CPU cycles).

Looking ahead, the trends point to more integration (unified timelines across accelerators), automation (intelligent analysis), and low-overhead observability becoming standard. Research is actively addressing many current gaps, from standardizing performance metrics to using novel techniques like causal profiling and ML-driven analysis to interpret performance data. At the same time, challenges like vendor lock-in and black-box hardware will require industry collaboration and perhaps a push for more open hardware telemetry interfaces.

In conclusion, developers and engineers aiming to optimize accelerator-powered systems should take a hybrid approach: leverage the rich features of vendor-specific profilers for detailed analysis, use open-source and cross-platform tools to get the "big picture" across diverse hardware, and keep an eye on emerging tools that can be adopted to improve continuous performance monitoring. By combining these tools and techniques, one can obtain a comprehensive understanding of performance for GPUs, DPUs, and APUs across any workload – from a single GPU kernel's instruction stalls up to the end-to-end behavior of an entire heterogeneous pipeline. Such deep and broad profiling capability will be essential to fully exploit the computational power of modern accelerators in general-purpose and domain-specific applications alike.

References

Share on Share on