Profiling and Tracing Tools Across System Layers and Architectures
Profiling and tracing are complementary techniques for analyzing software performance and behavior. Profiling typically measures where a program spends its time or resources, aggregating data (e.g. CPU time per function, memory usage per module) to identify performance bottlenecks and hot spots. In contrast, tracing records a timeline of events or operations (e.g. function calls, kernel events, network requests) to reconstruct execution flows. Both are motivated by the need to understand and optimize complex systems: using these tools is “vital for understanding program behavior, performance bottlenecks and optimisation potentials”. Common use cases include finding CPU hotspots in code, diagnosing memory leaks or I/O delays, understanding concurrency issues (e.g. lock contention), and tracking requests across distributed services. By applying profiling or tracing, developers and researchers can observe the internal state and performance of software, leading to faster troubleshooting and more efficient, reliable systems.
In modern environments, profiling/tracing spans multiple layers – from low-level CPU and OS events up to application, runtime, and distributed system behaviors. The sections below present a comprehensive catalog of tools and frameworks at each level (covering both open-source and commercial solutions), organized by context. We compare their capabilities across major architectures (x86, ARM, RISC-V, GPUs, TPUs, DSPs, etc.), and discuss the design challenges (overhead, observability, correlation, etc.) that shape current research. Typical scenarios and use cases are highlighted to illustrate why these tools are indispensable – for example, improving end-user experience by optimizing slow code paths, catching sporadic memory leaks via continuous monitoring, or tracing microservice interactions to pinpoint latency in a complex transaction. The goal is to give both a broad overview and deep insights into the state of the art in profiling and tracing technology.
OS-Level Tracing and Kernel Profiling Tools
At the operating system level, tracing tools capture events from the kernel and drivers (e.g. system calls, scheduler context switches, IRQs) to diagnose system-level performance. These tools often need to run with minimal overhead and high privileges, as they interact with the kernel internals. Key OS-level tracers include:
-
ftrace (Linux) – The built-in function tracer in Linux, which can hook kernel functions (via static tracepoints or dynamic instrumentation like kprobes) and record events to in-memory buffers. Ftrace is considered “a kernel hacker’s best friend” for debugging; it can trace functions, record event timestamps, and even measure durations, all configurable via the
/sys/kernel/tracing
interface. However, ftrace by itself is driven by the kernel and lacks a high-level language for complex logic (it cannot do arbitrary computations on events without additional help). Tools like trace-cmd (a front-end by ftrace’s author) and Perf (described below) can consume ftrace data to make it easier to use. Ftrace works on any architecture supported by Linux (x86, ARM, etc., and support is continually added for new arch like RISC-V, which now supports kernel probes and tracepoints). -
perf (Linux Perf Events) – The Linux performance events subsystem, accessed via the
perf
tool, is “the official tracer/profiler for Linux” users. It is built into the kernel (no external modules needed) and provides a wide range of capabilities: CPU sampling profiler, hardware performance counters interface, tracepoint collector, and more. For example,perf
can sample the program counter at intervals to produce a CPU usage profile (which can be visualized as a flame graph). It can also record kernel tracepoint events (similar to ftrace) into a perf.data file for post-analysis. Because it runs in-kernel (with efficient buffering) and uses hardware support, perf is relatively low-overhead for profiling purposes. It supports profiling and tracing on multiple architectures – Linux’s perf subsystem has drivers for x86 (with its complex performance monitoring units), ARM, POWER, and more, and as of Linux 5.x, even RISC-V has basic perf counter support and eBPF JIT support for dynamic tracing. Perf is generally regarded as a safe, multi-user tool (it enforces permissions for accessing certain events) and is suitable for production use in many cases. On Linux, perf has largely replaced older profiling tools like OProfile. It cannot do everything natively (e.g. custom in-kernel analysis), but it’s a versatile foundation and is rapidly evolving with each kernel release. -
eBPF-based Tools (Linux) – Modern Linux has extended Berkeley Packet Filter (eBPF) as a core building block for tracing. eBPF is an in-kernel virtual machine that can run user-defined programs safely and efficiently in response to events (with JIT compilation). It has essentially made Linux programmable for tracing: instead of just dumping events, eBPF programs can filter, aggregate, and analyze events in kernel context with minimal overhead. This enabled a new generation of tools. For instance, BCC (BPF Compiler Collection) provides libraries and ready-made scripts to trace events via eBPF, and bpftrace offers a high-level language to write tracing one-liners (inspired by DTrace’s syntax). These can attach to kernel tracepoints, kprobes (dynamic function probes), and even user-level probes, capturing everything from filesystem latency to TCP packet drops. The advantage of eBPF is that it runs in-kernel without needing custom modules: it’s “a unified tracing interface for both kernel and userspace”, using dynamic instrumentation (kprobes/uprobes) so you don’t need to rebuild your kernel or apps to add tracepoints. eBPF’s safety is enforced by the kernel verifier to prevent crashes or loops. Over the past few years, eBPF has grown to cover not just tracing, but also networking, security monitoring, etc., making it a broad observability technology. Tools like bcc/bpftrace have many pre-built tracing tools (profilers, syscall tracers, etc.) and allow custom scripts. Major cloud vendors and monitoring systems (e.g. Pixie, Datadog) leverage eBPF under the hood for low-impact production tracing. eBPF is supported on all major architectures in Linux now (x86_64, arm64, s390x, etc., and recently RISC-V JIT was marked “OK” in Linux 6.x).
-
LTTng (Linux Trace Toolkit Next Generation) – LTTng is an open-source tracer designed for efficient low-overhead tracing of both kernel and user space. It uses static instrumentation points: the kernel module hooks into static tracepoints (and can also hook dynamic events), and user applications can be compiled with USDT (Userland Statically Defined Tracepoints) that LTTng can collect. LTTng records events to binary trace files which can be analyzed with tooling (e.g. Trace Compass GUI). Its design emphasizes minimal disturbance to the system – Mathieu Desnoyers (the author) has published benchmarks showing very low per-event logging latency. LTTng requires loading a kernel module on Linux (unlike perf which is built-in, eBPF which is native). This means a bit of setup, but also means it doesn’t have the safety guardrails that eBPF has (a buggy LTTng probe could theoretically crash the system). LTTng is targeted purely at tracing (not profiling aggregates) and is often used in scenarios like tracing rare concurrency bugs or timing issues, or in telecom/embedded Linux where lightweight trace of many events is needed. Companies have successfully used LTTng for production tracing where extremely low overhead and high throughput logging was required. Compared to eBPF, LTTng’s scope is narrower (focused on trace collection rather than general in-kernel programming), but it is a mature solution with support for multi-core and high-frequency events. It supports multiple architectures as long as the kernel module is available; it’s commonly used on x86 and ARM Linux systems.
-
SystemTap – SystemTap is another powerful tracer for Linux, providing a scripting language to write custom instrumentation (SystemTap scripts are compiled into kernel modules). SystemTap can tap into tracepoints, kprobes/uprobes, and even perform programmable analysis in-kernel (it was effectively doing some of what eBPF does now, but by compiling to modules). It is often described as “the most powerful tracer… [it] can do everything: profiling, tracepoints, kprobes, uprobes, USDT, in-kernel programming, etc.”. However, SystemTap’s kernel module approach historically had stability issues (kernel panics in early versions) and it requires kernel debug symbols for full use, which can be a barrier. SystemTap remains a strong tool for experts who need deep custom tracing on Linux (especially on systems where eBPF may not be available or when doing extremely advanced probing). It’s used in some enterprise environments, but its adoption has been somewhat superseded by eBPF-based frameworks that are easier to deploy. Like LTTng, SystemTap is out-of-tree; it works on major architectures with support, but needs specific tapsets and often root access with debug info.
-
DTrace – Originally developed for Solaris, DTrace is a dynamic tracing framework that allows powerful scripted probes on both kernel and user-land. DTrace was revolutionary in mid-2000s for enabling safe, on-the-fly tracing in production systems. It has since been ported to other OSes: FreeBSD and Mac OS X (and an older port to Linux by Oracle, though Linux support remains limited compared to native tools). DTrace provides a high-level language (D) to specify probes and actions, with built-in aggregations. For example, one can write a D script to count which functions are called most often or to trace every file open with its timestamp. It’s heavily used on Solaris systems and was integrated in macOS (though Apple’s recent versions favor their own Instruments and have restricted DTrace usage under SIP). DTrace is known for a large toolkit (DTraceToolkit by Brendan Gregg) of ready-made scripts for common problems (CPU, disk, networking, etc.). In terms of architecture support: DTrace works on Solaris (SPARC, x86) and ports exist for x86 Linux and macOS (x86 and Apple ARM). It is “not a profiler but an excellent tracing tool” useful for real-time diagnosis of live systems. On Linux, eBPF/BPFtrace has largely taken the role DTrace played, but DTrace’s influence is seen in all modern tracers.
-
Windows ETW (Event Tracing for Windows) – On Windows, the analog to kernel tracing is ETW, a built-in high-performance tracing framework. The Windows kernel and drivers have many instrumented events (for scheduling, I/O, registry access, etc.), and user applications or frameworks can log events to ETW as well. These events can be captured via tools like Xperf/WPA (Windows Performance Analyzer) or PerfView (for .NET), giving a timeline view of system and app events. ETW is extremely efficient (it uses per-CPU buffers and binary event schemas defined by providers) such that it’s used in production for diagnosing issues. For profiling, Windows also provides Performance Counters and sampling via tools like Visual Studio Profiler. While not as open in ecosystem as Linux’s tools, Windows has a solid tracing infrastructure for those in that environment. ETW works on x86 and ARM editions of Windows (e.g. Windows 10 on ARM can still log ETW events). Windows also has KD/WinDbg trace capabilities and Intel VTune supports Windows for low-level profiling on Intel CPUs.
Comparison: In summary, OS-level tracing tools range from built-in facilities (ftrace, perf, ETW) to add-on frameworks (LTTng, SystemTap, DTrace). Linux’s perf and eBPF represent the current state-of-the-art in combining low overhead with flexibility – perf for general profiling and event collection (sampling-based) and eBPF for programmable custom tracing. Research shows that dynamic instrumentation via eBPF provides a powerful alternative to static instrumentation, avoiding the need to recompile kernels or applications. Table 1 compares a few key OS-level tracers on select criteria:
Tracer | Instrumentation | Overhead | Platforms | Notes |
---|---|---|---|---|
Linux perf | Static & dynamic (tracepoints, sampling, HW counters) | Low (in-kernel buffering, sampling at Hz frequencies) | Linux (x86, ARM, RISC-V, etc.) | Built-in profiler/tracer; good for CPU profiles, hardware events. |
Linux ftrace | Static & dynamic (tracepoints, kprobes) | Low-Med (per event ~ microseconds) | Linux (most arch) | In-kernel trace framework, single-user focus; raw but powerful (trace-cmd as front-end). |
eBPF/BCC/Bpftrace | Dynamic (kprobes/uprobes, perf events) | Low (JIT compiled, runs in kernel) | Linux (x86, ARM, etc.; eBPF JIT on RISC-V by 2023) | Safe in-kernel programmable tracing; broad use (also networking). |
LTTng | Static (kernel module + static tracepoints; user USDT) | Very Low (designed for minimal perturbation) | Linux (x86, ARM; kernel 2.6+ w/ module) | High-throughput tracing to disk; often used in production debug. |
SystemTap | Dynamic (module with probes & scripts) | Low-Med (compiled code in kernel) | Linux (x86, ARM, others with debuginfo) | Powerful (can trace anything, run computations) but requires care (module can crash if buggy). |
DTrace | Dynamic (provider-based probes, D scripts) | Low (ring buffers in kernel) | Solaris, FreeBSD, macOS; Linux (limited) | Pioneering tracer, safe design; not native on Linux but concepts carried into eBPF. |
Windows ETW | Static (instrumented events, logman providers) | Low (binary logging, async) | Windows (x86, ARM) | Used for system-wide tracing via tools like WPA; integrates with .NET, etc. |
Table 1: Comparison of OS-level tracing tools (kernel-oriented). All aim for minimal interference, but their mechanisms differ (static vs dynamic instrumentation, in-kernel analysis vs post-processing). Modern Linux leans toward in-kernel dynamic tracing with safety (eBPF), whereas other systems rely on either static tracepoints (Windows, LTTng) or loading custom modules (SystemTap, DTrace on some OS).
Application-Level Profiling (User-Space Performance Tools)
Moving up the stack, application-level profiling tools focus on measuring performance within user-mode software: how much CPU time each function consumes, memory allocations, I/O operations, etc. These tools often instrument or sample the application code. Unlike OS tracers that see system events, application profilers attribute costs to specific lines or functions in the program, helping developers optimize their code.
Key categories and examples include:
-
CPU Profilers (Statistical): These use periodic sampling to attribute execution time to program locations with low overhead. For example, the classic GNU
gprof
(from Binutils) uses a timer interrupt to sample the program counter and a runtime instrumentation to count function calls. Programs compiled with-pg
generate a profile at exit showing which functions used the most time. Modern successors use system profilers: on Linux, one can useperf
on a single process or threads, or Google’s CPU profiler in gperftools (for C++), etc. Sampling profilers are popular because they impose very little overhead (they don’t measure every single call, just sample say 1000 times per second). Research and practice show that capturing 1000 samples/sec imposes negligible slowdown yet yields a representative profile of hotspots. Many runtime environments (JVM, Go, etc.) have sampling profilers built-in (e.g. Go’s pprof, discussed later). On Windows, tools like Visual Studio Profiler or AMD’s CodeXL (when it existed) could do sampling. The advantage is these can often run in production continuously. For instance, Google’s production Continuous Profiling system (Google-Wide Profiling) uses sampling across thousands of machines to build profiles with very low overhead. Continuous profilers (like Google’s and open-source efforts such as Parca) allow always-on insight into CPU, memory usage over time, a big shift from ad-hoc profiling runs. -
CPU Profilers (Instrumenting/tracing): These record every function entry/exit or every call, providing precise measurements at the cost of high overhead. Examples: gprof in instrumentation mode (on some systems, it inserts call counters), or function tracing with tools like DTrace’s profile probes or Intel’s PIN (a dynamic instrumentation framework). Intel’s VTune Amplifier, a commercial profiler, can instrument code to measure microarchitectural events and provides very detailed per-function metrics. Instrumentation gives complete call counts and exact timings but can slow programs dramatically – e.g. using Valgrind’s callgrind (an instrumentation-based profiler) can be 30x or more slowdown in execution. Because of this, full instrumentation is usually used in development builds or short runs, not in production. There are some hybrid approaches like Intel VTune’s user-mode sampling that interrupts threads at high frequency and unwinds their stacks – this gives nearly full call graphs with lower overhead. Another approach is tracing profilers that log each function call to a buffer (for example, Chrome’s built-in tracer or Java’s old -Xtrace). These produce a trace of execution which can be visualized (e.g. flame charts or sequence diagrams), but again, overhead and data volume are concerns.
-
Memory and I/O Profilers: Tools focusing on memory usage (heap profiling, leak detection) or I/O patterns. Valgrind with Memcheck is a famous memory error detector and heap profiler – it uses binary instrumentation to catch every memory access and allocation, finding leaks or invalid uses. Memcheck can slow a program by ~10x or more, but it provides invaluable insight for C/C++ programs (checking each memory reference). There are faster heap profilers like Google’s tcmalloc heap profiler which samples allocations to record a subset and approximate memory usage by call stack with low overhead. Another example is Darshan for HPC I/O profiling, which intercepts file I/O calls in MPI applications to summarize I/O behavior. For disk I/O tracing on a single system, one might use OS tools (like strace or eBPF) to log calls, but user-space I/O profilers focus on logical operations (like “which part of code is doing the most disk writes”). IOProfiler and others exist for specialized needs. Generally, profiling I/O often leverages tracing (to see all operations) rather than sampling, because I/O events may be infrequent.
-
Dynamic Binary Instrumentation frameworks: These are research/advanced tools that allow writing custom profilers or analyses by instrumenting every instruction or function dynamically. Examples: Intel Pin, DynamoRIO, Frida, and Valgrind’s core. These frameworks let you script “when this function runs, record X” at a very granular level. They enable powerful analyses (memory checking, cache simulation, etc.), but as noted, the overhead is very high (often 10x–50x slowdown). They’re used for detailed analysis in research or for building domain-specific tools (for instance, Pin was used to build cache profilers, data race detectors, etc.). In practice, most users rely on higher-level tools built on these frameworks (like Pin-based profilers) rather than coding directly against them.
Notable tools and frameworks:
-
GNU gprof & gprofng: The original gprof (GNU Profiler) is a basic tool that instruments function entries (via compiler) and uses timer interrupts to sample, producing a flat profile and call graph at program end. It’s somewhat outdated (doesn’t handle multithreading well, etc.). gprofng (“next generation”), introduced recently, is a newer GNU tool that supports multi-threaded apps, multiple languages (C/C++/Java) and has a GUI. gprofng can profile on Linux and is being extended to support more event types – it’s actually suggested as a profiling solution for RISC-V Linux in absence of vendor tools. While not yet as popular as other tools, it represents a push toward modern open-source application profilers.
-
Intel VTune Profiler: A commercial tool supporting Intel architectures primarily (CPU, GPU, FPGA). VTune can do both sampling (using hardware PMUs to profile CPU, memory, thread contention) and instrumentation (to measure specific code blocks). It is very powerful for low-level performance tuning, providing insights like CPU pipeline stalls, cache misses, and vectorization efficiency. VTune supports Linux, Windows and can profile apps on x86 and Intel GPUs; it also has limited support for ARM/Linux in recent oneAPI releases (and Intel’s acquisition of Arm tools). For x86 CPU code, VTune is unmatched in detail – but as a commercial product, not everyone has access. It’s used heavily in HPC and performance engineering. Analogous vendor-specific profilers exist too (AMD’s old CodeXL which included CPU profiling, now replaced by AMD uProf for CPUs, and NVIDIA Nsight for GPUs – see next sections).
-
OProfile (legacy): An older system-wide profiler for Linux that sampled CPU performance counters across the whole system. It predates perf and is largely replaced by perf now. OProfile could profile kernel and user code by sampling without needing instrumentation. It’s mostly of historical interest, as perf subsumed its functionality in Linux 2.6+. Similarly, on Solaris, tools like
trapstat
and older prof (prof(1)) existed. -
PerfInsights, Windows Perf Analyzer: On Windows, beyond ETW, developers use Visual Studio’s profiling tools or the Windows Performance Toolkit (WPT). WPT’s Xperf can sample CPU and trace OS events, and the WPA GUI allows correlating them. For .NET applications, PerfView (an open-source tool from Microsoft) is widely used: it collects ETW events including CLR events to profile .NET code (CPU samples, GC pauses, lock times). We will discuss runtime-specific tools in the next section. There are also specialized user-space profilers for graphics (like NVIDIA’s nsight for DirectX/OpenGL which profiles at the API level), but those fit in architecture-specific discussions.
Application Profiling vs OS Tracing: It’s worth noting the boundary is blurred – many OS-level tools also profile apps (e.g. perf
profiles user code, DTrace can aggregate in user stacks). The difference is application-level profilers attribute cost within the program’s source (function names, lines), whereas OS tracers often deal with system-centric events. In practice, one often uses them together: e.g. perf
might show a function is hot, then one could instrument it further with a manual logging or use Valgrind to dig into memory usage. The trend, however, is toward integrated observability – as evidenced by Google’s continuous profiler which ties into their monitoring, or products that combine traces and profiles. In fact, linking profiling data with tracing data can greatly ease diagnosis: for example, Grafana’s recent continuous profiling can automatically link a slow request trace to a flame graph of CPU time for that request. This combined approach is becoming more common.
Runtime and Language-Specific Profilers
Many programming language runtimes and frameworks provide their own profiling and tracing facilities, tailored to the abstractions of that environment (e.g. objects, managed memory, JIT compilation). We cover a few major ones:
-
Java (JVM) – The Java Virtual Machine has a rich set of profiling tools. Historically, Java could be profiled with JVM TI (Tool Interface) agents or older HProf, but the modern approach is Java Flight Recorder (JFR). JFR is a built-in low-overhead profiler and event recorder in the JVM. It “is integrated into the JVM and causes almost no performance overhead…even in heavily loaded production” (typically <1% overhead). This is achieved by highly efficient logging to thread-local buffers and binary encoding of data. JFR can continuously record thread CPU samples, lock contention, garbage collection events, etc., to a circular buffer, and you can dump the data on demand if an issue occurs (much like a black box). It’s designed for always-on production use, something traditional profilers couldn’t do. JFR’s data can be analyzed in Java Mission Control (JMC) to see flame graphs, GC pauses, allocation hotspots, etc. Aside from JFR, Java developers also use sampling profilers like Async Profiler (which uses perf events under the hood to sample Java stacks with minimal safepoints) for more detailed CPU profiles, and various commercial tools (YourKit, JProfiler) that instrument or use JVMTI to get method timings (these can be heavier). The JVM also exposes a Tracing API (Java Trace Framework) for method entry/exit, and has logging for things like JIT compiler decisions (for advanced tuning). But JFR is the flagship – its low overhead stems from clever design (e.g. events are aggregated in binary form and post-processed; there are thresholds to ignore short events, etc.). This shows how runtime-specific knowledge (JVM knows its threads, GC, etc.) can enable efficient profiling targeted to that environment.
-
.NET (CLR) – The .NET runtime similarly provides profiling APIs and ETW events. Tools like PerfView use ETW under the hood to collect .NET events – for example, CPU samples with .NET-managed stack traces, or GC collection durations, JIT events, exceptions. This gives a timeline of what the CLR is doing. Microsoft Visual Studio includes profilers (sampling and instrumentation) for .NET code as well, and JetBrains offers dotTrace for deep dive (which can attach and record detailed call counts). In .NET Core, there is an event pipe mechanism and a tool
dotnet-trace
to collect runtime events cross-platform (similar to ETW but for Linux). Generally, .NET relies on sampling as well for low overhead – one can run a continuous CPU profiler in production via the OpenTelemetry .NET Auto-Instrumentation which uses event pipes to sample. Memory profiling in .NET can leverage built-in GC stats or take heap dumps. The CLR Profiler API allows building custom profilers that intercept every method call or allocation (but this incurs significant overhead, used mainly by specialized tools or APM agents in sampling modes). Microsoft’s documentation often encourages ETW-based tracing as it’s highly optimized. So, .NET’s story is an interplay of OS-level events (ETW) and runtime instrumentation. An interesting development is that .NET’s profiling APIs allow attaching profilers post-startup (unlike older that required launching under a profiler), giving more flexibility to profile long-running services on demand. -
Python – Python being an interpreted language has its own profiling modules. The built-in cProfile (and profile) modules use instrumented function calls – essentially, every function entry/exit in Python can invoke a callback in the profiler, allowing it to count calls and time spent. This is easy to use but can slow Python code drastically (5x slowdown or more), especially because Python is already not very fast. For smoother results, developers use sampling profilers for Python. One popular one is Py-Spy, which runs as a separate process and samples the Python program’s call stacks by reading its memory. Py-Spy is written in Rust and uses OS APIs to suspend the process threads and walk the stack, so it can profile Python code with negligible overhead while the program runs normally. Py-Spy’s sampling (100 Hz by default) doesn’t require modifying the code or interpreter, making it safe for production use. Another is Austin or Pyinstrument – similar low-overhead samplers. There’s also Scalene, a profiler that uses sampling and bytecode counters to profile CPU and memory usage in Python, trying to attribute time either to Python code or native code (useful in mixed Python/C scenarios). For tracing long-running Python programs (e.g. web servers), some APM tools use instrumented sampling – periodically interrupting the interpreter to record what each thread is doing, then aggregating. Python’s logging can also serve as tracing (for instance, enabling debug logs in Django to trace each query). The challenge with Python profiling is the GIL (global interpreter lock) – only one thread runs Python bytecode at a time, so CPU profiling is easier (one active thread), but it also means multithreaded programs have a lot of idle threads that need consideration. Py-Spy cleverly looks at the GIL owner to know which thread to profile. In sum, Python developers often start with cProfile for quick checks (e.g. in dev to get a simple report), but for real-time or production profiling, sampling tools like Py-Spy are preferred due to their minimal impact on the app’s performance.
-
Other Languages:
-
Go: The Go language has pprof built in – it can sample CPU, track allocations, etc., with very low overhead (using OS timers and stack unwinding in the runtime). Go’s philosophy embraces continuous profiling – it’s common to run a Go program with the pprof HTTP endpoint enabled, so you can fetch a profile at any time. This has led to Go being at the forefront of always-on profiling in production.
- JavaScript/Node.js: V8 (the JS engine) has profiling hooks; Node can be run with
--prof
to produce a tick CPU profile. There are also diagnostic reports and tracing in Node (the Chrome DevTools inspector protocol can collect performance data). In production, Node profiling is tricky due to the event loop – often people use Linux perf or again eBPF to sample across native and JS frames (with some support to translate JITed code symbols). -
Ruby, PHP, etc.: These dynamic languages have their own profilers (e.g. XHProf for PHP, rbtrace or StackProf for Ruby which does sampling). Many are inspired by the same techniques – either instrument every function (slow) or sample regularly (fast).
-
Managed runtimes tracing: Beyond profiling for performance, many runtimes also offer tracing of requests/transactions at a higher level. For instance, Java has log frameworks with tracing, .NET has System.Diagnostics
Activity
for distributed tracing, Python has decorators for tracing calls. These feed into distributed tracing systems (discussed next). Language-specific APM (Application Performance Monitoring) libraries instrument common frameworks (like web request handlers, database calls) to emit traces. This blurs into the distributed tracing topic, but it’s worth noting under runtime context: a lot of observability code is language-specific (instrumentation libraries that hook into the runtime or framework to emit events).
Distributed Systems Tracing (Microservices Observability)
In distributed systems (microservices, cloud services), the performance of a single process is only part of the picture – a single user request might flow through dozens of services. Distributed tracing tools are designed to follow a request across process and network boundaries, providing an end-to-end timeline of what happened in a transaction. This is crucial for microservice architectures where slowdowns or errors need to be traced to the responsible service or communication link.
The standard approach is to instrument each service to emit trace spans (with a trace ID propagated through calls). Each span records the service name, operation, start and end time, and metadata (e.g. customer ID or payload size). A trace backend collects these spans and reconstructs the trace graph of the request. Key technologies and tools include:
-
OpenTelemetry – This is a CNCF-backed open standard and toolkit that has become the umbrella for all things tracing (and also metrics and logs) in cloud-native systems. OpenTelemetry provides language SDKs to instrument applications (or auto-instrument via hooks for common frameworks), generating telemetry data. Crucially, OpenTelemetry is vendor-agnostic: it defines the format for traces, metrics, etc., and you can export that data to various backends. It essentially covers the instrumentation and data collection aspect. OpenTelemetry encompasses what was formerly OpenTracing and OpenCensus. By using OpenTelemetry APIs, developers can mark spans in their code (like “service A received request”, “called service B”, “made DB query”) and the library handles context propagation (passing trace IDs to downstream calls). OpenTelemetry also defines resources and attributes to standardize common data (HTTP method, status codes, etc.). It supports all major languages and is considered the future-proof way to add tracing to an app.
-
Jaeger and Zipkin – These are popular open-source distributed tracing backends. Jaeger, originally by Uber, and Zipkin, originally by Twitter, both store and visualize traces. For example, Jaeger provides a UI to search traces by operation or duration and view a Gantt chart of the services involved. Jaeger and Zipkin have their own wire formats but nowadays both can ingest OpenTelemetry data (OpenTelemetry can export to Jaeger or Zipkin format). Importantly, these backends do sampling of traces (since logging every request might be too much data). Typically, a sampler might keep say 1% of traces or adjust dynamically. Jaeger focuses purely on tracing (no metrics or logs integration in the UI). It provides a simple UI and storage (Cassandra or Elasticsearch storage for trace data). Zipkin is similar, often using Elasticsearch or MySQL for storage. Many cloud vendors also offer managed versions (AWS X-Ray, etc.). In a comparison: “Jaeger is specialized in distributed tracing, while OpenTelemetry covers generating all telemetry (logs, metrics, traces)”. One might use OpenTelemetry SDK in apps and Jaeger as the backend to view traces.
-
SigNoz, LightStep, etc.: Newer tools build on OpenTelemetry – for instance, SigNoz is an open-source alternative to Jaeger that provides a more advanced UI and also shows metrics. LightStep (now part of Splunk) was a pioneer in high-scale tracing (focusing on sampling the most interesting traces). There’s also OpenTracing (now merged into OTel) which was an API standard, and OpenCensus (also merged) which had libraries and a collector.
-
Trace Context and Propagation: A critical aspect is passing trace IDs over the network. This is standardized by W3C Trace Context now (so a service adds an HTTP header with trace-id, span-id, etc.). Without this, each service’s logs or traces would be siloed. All major tracing systems adhere to a propagation format so that, for example, a Node.js service can call a Go service and the trace still links up.
-
Metrics and Logging Integration: Distributed tracing is one pillar of observability (often cited alongside metrics and logs). While traces show one example flow, metrics (like request rate, error rate) show aggregated behavior, and logs capture details. Tools like OpenTelemetry encourage collecting all three. For instance, an OpenTelemetry SDK might also record a span’s duration as a metric or allow linking logs to the current trace context. The industry is moving toward unifying these – e.g. Splunk, Datadog, New Relic offer one agent that gathers logs, metrics, traces. This helps in correlation: you can go from an anomaly in metrics to a relevant trace to detailed logs.
-
Distributed Profiling*: A nascent area is applying profiling in distributed contexts. Companies like Google and Facebook have done *continuous profiling across fleets (Google’s aforementioned Google-Wide Profiling samples all servers’ stacks to find system-wide hotspots). More directly, projects like Parca (Polar Signals) aim to have an agent on every node doing CPU sampling and storing profiles over time (as a sort of fourth pillar of observability). Grafana’s Phlare (in preview) does similar. These profiles are not exactly “traces”, but they can be linked with traces by context (e.g. capture which service, which pod). Some APM vendors now let you go from a service trace to see profiles of that service during that time – this is powerful for diagnosing, say, a certain request is slow because a particular function is running excessively on the CPU.
In practice, a typical cloud-native observability stack might use: OpenTelemetry SDK for instrumentation, Jaeger (or SigNoz, etc.) as the tracing backend, Prometheus for metrics, and ELK or Loki for logs. Emerging all-in-one solutions and standards are trying to reduce the burden of running many tools.
One challenge in distributed tracing is sampling and data volume. Capturing every trace in a high-traffic microservice system can produce enormous data. So smart sampling is used (e.g. capture all traces for requests that had an error or were extremely slow, but sample 1/100 of the rest). Another challenge is overhead: instrumenting each inter-service call adds some latency (usually negligible, a few microseconds) and CPU work to log the event. OpenTelemetry has been designed to be efficient, and in-process span creation is typically very fast, but network export of spans is often done asynchronously to minimize impact.
Use cases: Distributed tracing is essential for root cause analysis in microservices. For example, if an e-commerce “checkout” request involves 5 services and the user sees a timeout, a trace can show that service #3 was slow and that within #3, a database query took most of the time. It’s also used for dependency analysis (understanding which services talk to which) and performance optimization across services (finding the longest critical path in a transaction). In scheduling or orchestration, tracing can even help visualize how jobs move through a system.
Distributed tracing doesn’t replace local profiling – they serve different scopes. In fact, as mentioned, the trend is to combine them: use tracing to find where in the system a slowdown occurs (which service, which endpoint), then use profiling to dig into why that service is slow (which line of code or function). This “zoom in” approach is being embraced in modern performance engineering.
Embedded Systems and RTOS Tracing
Embedded systems (including those running on microcontrollers, DSPs, and real-time operating systems) have their own profiling/tracing needs and constraints. These systems often have limited resources (CPU, memory) and sometimes real-time constraints that make intrusive instrumentation risky. Tools in this domain focus heavily on timeline tracing (recording task scheduling, interrupts, etc.) to debug real-time behavior, and utilize both software and hardware methods.
-
RTOS Tracing Tools: Many RTOSes (FreeRTOS, ThreadX, Zephyr, etc.) provide trace hooks to log context switches, ISR events, and user-defined events. A leading example is Percepio Tracealyzer, a tool that supports FreeRTOS, Zephyr and others. It uses software instrumentation: the RTOS is instrumented with trace points (often via macros) that record events (task start/stop, mutex lock/unlock, etc.) into a buffer. This buffer can be periodically sent out (streaming via USB/serial) or saved for snapshot. Tracealyzer then visualizes a timeline of tasks, CPU usage, and resource conflicts. This kind of RTOS-aware trace is crucial because debugging a concurrent embedded system by only looking at source code is insufficient – “to fully understand the runtime behavior of an RTOS-based system, you need to observe it at the RTOS level…a tracing tool with RTOS awareness provides a timeline that greatly aids debugging, validation, and optimization”. Without it, one is effectively blind to how tasks interleave in time. Tracealyzer and similar tools (Segger’s SystemView, Microsoft’s TraceX for ThreadX/Azure RTOS) give developers a “slow-motion” view of their system’s real-time execution.
-
Hardware Trace (ETM, ITM): Many modern microcontrollers and processors have built-in trace hardware. ARM’s CoreSight technology, for instance, includes an ETM (Embedded Trace Macrocell) that can output instruction execution trace, and an ITM (Instrumentation Trace Macrocell) for sending software-defined trace messages and RTOS events out through a debug interface. These hardware traces are non-intrusive – the CPU emits packets on a special debug interface (like JTAG/SWD) that can be captured with a debug probe (e.g. Lauterbach, Segger J-Trace). Hardware trace is extremely detailed (you can capture every branch taken), but it requires specialized tools and can generate huge volumes of data, so typically it’s used sparingly (for hard-to-debug issues or code coverage analysis). It’s noted that hardware tracing produces vast amounts of low-level data and often lacks high-level context like task names. It’s great for deep debugging (e.g. tracking down a memory corruption by seeing every instruction executed up to the crash) and for coverage or profiling in lab settings. However, due to the data volume and need for a probe, hardware trace isn’t usually used in-field or for long-running profiling. Instead, RTOS-level software trace (as above) is used to log just important events. Still, hardware trace has its place: in safety-critical systems, one might use it to verify timing (e.g. that an interrupt handler always executed within deadline by analyzing trace). The overhead of hardware trace on the system is near-zero (since it offloads data externally), but the challenge is managing that data. Often a circular buffer and filtering (like only tracing specific functions) is used.
-
Low-overhead RTOS instrumentation: Because embedded CPUs may be slow and have tight real-time budgets, any instrumentation must be efficient. The RTOS trace macros are typically designed to be minimal – e.g. just writing a few integers to a buffer in RAM. Studies and vendor claims indicate this can be quite low overhead: “RTOS-level tracing on a modern 32-bit MCU requires only a few percent of CPU time” if done intelligently (and only key events are traced). For instance, if you log only task switches and not every single API call, you get a good view for only a 2-5% hit. And if needed, tracing can be turned off in performance-critical builds.
-
Embedded Linux: On larger embedded devices running Linux, the same Linux tools (perf, LTTng, eBPF, etc.) are applicable. Often, embedded products use LTTng because it was designed by and for the embedded community (Montreal’s EfficiOS) to have low overhead. Trace Compass (an Eclipse project) is a GUI that can merge kernel traces (e.g. from LTTng) and possibly some userspace traces to show a unified view – helpful in systems like an embedded Linux controlling machinery, to correlate sensor I/O events with kernel scheduling, for example. Additionally, one might use kgdb or JTAG debuggers for low-level profiling if needed. But on resource-constrained embedded Linux, developers prefer capturing traces to uploading gigabytes of logs; hence, tools like LTTng which can filter and reduce data are popular.
-
DSPs and specialized processors: DSP cores (like TI’s C6000 series, Qualcomm Hexagon, etc.) often run either bare-metal or a small RTOS. They have vendor-specific profilers – for example, TI Code Composer Studio includes profilers that use hardware counters on the DSP to measure cycles per function. When DSP code is statically scheduled (typical in DSP pipelines), profiling often happens offline or via simulation. Some DSP vendors provide trace units similar to ARM’s, to capture instruction traces. The challenge is again managing overhead: if a DSP is processing real-time data (audio, video), halting it for profiling may break real-time behavior. So methods like statistical PC sampling via debug hardware are used (e.g. sampling the program counter on a running DSP without stopping it, if supported). Also, code coverage and execution profiling via simulators is common in DSP development, since cycle accuracy simulators can tell exactly how many cycles each function uses (but those are offline, not on the device).
-
Automotive and Safety Systems: These often use tracing for both debugging and safety monitoring. The AUTOSAR standard even defines some tracing requirements. Tools like Lauterbach’s Trace32 are used in automotive to capture program flow via the Nexus or ETM interfaces for profiling and timing analysis. They can show worst-case execution times measured, which is important for real-time guarantees. Because instrumentation in such systems (say, braking ECU) may not be acceptable, reliance on external hardware trace is higher.
In summary, embedded and RTOS profiling/tracing tools prioritize observability of real-time behavior with minimal interference. Whether through software logs or hardware trace, they give developers insight into issues that would be impossible to catch with a debugger or static analysis alone – e.g. sporadic timing glitches, priority inversions, or missed deadlines. The future is likely more integration of these traces with higher-level tools (for instance, bridging an RTOS trace with an IoT cloud backend trace, to see end-to-end from device to cloud). For now, it’s often siloed: embedded engineers use specialized tools (Tracealyzer, etc.) locally, while cloud engineers use OpenTelemetry – but efforts are ongoing to unify these views for full-system observability.
HPC and Data Center Performance Tools
High-Performance Computing (HPC) and large-scale data center systems present unique profiling challenges: applications running on thousands of cores or nodes, using parallel libraries (MPI, OpenMP, CUDA), and the need to optimize at scale. HPC performance tools often distinguish between profiling (collecting aggregated metrics for each function or process) and tracing (collecting time-stamped events for fine-grained analysis). Both are used: a common workflow is to profile first to find hotspots, then trace a smaller run to analyze communication or synchronization issues in detail.
Notable HPC tools and frameworks include:
-
HPCToolkit (Rice University) – An open-source toolkit designed for profiling large parallel applications with low overhead. HPCToolkit uses a sampling-based approach for CPU: each process/thread is interrupted at intervals to sample the call stack, which is recorded. This yields a statistical calling-context profile across the entire run. Uniquely, HPCToolkit can also trace GPU usage and attribute it back to CPU code: it hooks into GPU runtimes (CUDA, HIP, Level Zero) to record events like kernel launches and completions. All this data (CPU samples + GPU event traces) is merged post-mortem to correlate, for example, that function
foo()
on the CPU launched a GPU kernel that took 5ms, and that time is charged tofoo()
in the profile. HPCToolkit also does cool things like GPU PC sampling – using NVIDIA’s CUPTI to sample running GPU warps’ program counters, getting an instruction-level profile of GPU kernels. This is important because instrumenting GPU code is expensive (you can’t easily insert probes in massively parallel code without slowing it down 10x). By sampling, HPCToolkit gets insight with low overhead. The tool emits data into an “experiment directory” which one analyzes with their hpcviewer GUI or CLI to see inclusive/exclusive times per function across the whole program, and also GPU metrics. It’s vendor-neutral – works with NVIDIA, AMD, Intel GPUs via their respective tracing APIs. HPC practitioners like HPCToolkit for its scalability (you can profile a 1000-rank MPI program and only slow it minimally, since sampling might be 100 Hz per process) and the rich data (combined CPU+GPU profiles, and memory and I/O data too if configured). -
MPI Tracing/Profiling Tools (Scalasca, TAU, Score-P): The MPI (Message Passing Interface) library (used for multi-node parallelism) offers hooks (PMPI and MPI_T) to intercept calls. Tools like Scalasca and TAU use either source instrumentation or PMPI wrappers to record each MPI call (e.g. when rank 5 sends a message to rank 7, log the size and time). This allows analyzing communication patterns, load imbalance, etc. Scalasca (from Jülich) focuses on automated analysis of traces to find wait states (e.g. where processes spent time waiting). TAU (Tuning and Analysis Utility) is a comprehensive performance system from U. Oregon that can do profiling or tracing; it supports multi-language (C/C++/Fortran, Python) and has a ton of plugins. TAU can use sampling or instrumentation (or both). It also integrates with Score-P, which is a community measurement infrastructure that many of these tools share: Score-P provides a common instrumentation, tracing, and I/O for data, which tools like Scalasca, TAU, and Vampir can utilize. Vampir is a visualization tool (timeline viewer) often used in conjunction with a Score-P trace. So a typical HPC workflow: use Score-P to instrument code (automatically via compiler wrappers) – run the app – get either profiles (summary) or a giant event trace file – then use Scalasca to detect patterns or Vampir to manually examine the timeline. Because HPC runs can be huge, these tools support selective tracing: e.g., you might profile the whole run to pick out a slow phase, then trace only a subset of ranks or a time window, to limit data. Also, HPC traces are often so large that tools provide automatic aggregation (like merging events, or using statistical tracing). Automatic Program Analysis (APA) in Cray’s tools, for example, profiled first to find hot MPI calls, then traced only those calls in a subsequent run.
-
Vendor HPC Tools: Each supercomputer vendor has their suite. Intel’s VTune and Advisor help with node-level performance (vectorization, memory bandwidth, threading). NVIDIA’s Nsight Systems and Nsight Compute are heavily used when GPU acceleration is involved: Nsight Systems gives a timeline of CPU and GPU across nodes (somewhat like a distributed trace but for a parallel job), and Nsight Compute gives deep GPU kernel profiles. On IBM systems, tools like perfexplorer for POWER, or on Cray systems, CrayPat (Cray Performance Analysis Toolkit) was used. CrayPat allowed profiling and tracing and had an interesting “lite” mode that would use some instrumentation to automatically suggest what to trace next. AMD’s new HPC GPUs (Instinct MI series) come with rocProfiler/rocTracer for collecting GPU kernel traces and counters, and AMD’s Linux CPUs can use uProf. These vendor tools are often optimized for their hardware (e.g. VTune can tell you if an Intel CPU is not turbo-boosting due to power limits, etc., and Nsight can show if GPU warps are stalled on memory). The downside is they might not integrate across the entire heterogeneous system, whereas tools like HPCToolkit or TAU try to cover the whole thing.
-
Data Center “warehouse-scale” profiling: In large-scale web/datacenter environments (think Google, Facebook), continuous profiling is used (as mentioned, Google’s GWP). Google’s system sampled CPU and other events across all servers continuously, and then aggregate results per function across the fleet. The benefit is finding inefficiencies that only manifest at scale (e.g. a slight misuse of a library might only be 1% slower, but across 10,000 machines that’s a lot of CPU). Facebook and others have similar profilers (often built on Linux perf or BPF). Some publish findings – for example, an OSCON talk described profiling all of Facebook’s servers to identify performance regressions. In cloud provider environments, they may also use tracing/profiling on the infrastructure: e.g. profiling hypervisor or network stacks in production to tune them. Tools like perf and ebpf are used here because they can be deployed at scale with low overhead. There’s also research on sampling at the CPU microarchitecture level in data centers to feed auto-tuners (like identifying that many cycles are lost to cache misses and reconfiguring accordingly).
-
Performance Counters and Monitoring: HPC and data centers both rely on hardware counters (PMUs). A common library is PAPI (Performance API), which abstracts CPU counters across platforms. Many HPC tools use PAPI underneath to get data like FLOPs, cache misses, etc., from each rank. However, not all counters can be recorded at once (there are limited registers), so HPC tools might run the program multiple times to gather different metrics (or use multiplexing). NVIDIA GPU’s equivalent counters (through CUPTI) or AMD’s through ROCm are also used to get flops, memory throughput, etc. Tools like LIKWID (Lightweight Performance Tools) in HPC allow profiling of specific regions by reading counters directly. These low-level metrics complement the higher-level profiling by answering “why is this code slow? ah, it’s memory-bound due to cache misses”.
Profiling at Scale Challenges: HPC programs can produce huge trace files if every event of every process is logged. E.g., a 4096-core run logging every MPI send can generate gigabytes of trace data in seconds. Tools mitigate this with trace compression, clustering, and parallel analysis. For example, the trace might be post-processed by parallel programs to compute metrics instead of loading it into memory. In large data centers, the challenge is storing continuous profiles or traces – solutions include storing only aggregated data or time-series of metrics extracted from profiles, rather than raw profiles. For instance, Google’s GWP stores histograms of CPU samples rather than every sample event.
Trend: HPC and big data environments are converging with cloud techniques. There’s interest in using OpenTelemetry in HPC (for example, to trace an HPC workflow that involves services or to instrument HPC codes with spans for each phase). Conversely, HPC profiling tools are adopting modern UIs and analysis techniques (like machine learning to detect anomalies in performance). Also, as AI and ML workloads become common in HPC centers, profiling tools are adapting to cover new accelerators (GPUs, TPUs, etc.). In fact, HPC Toolkit, TAU, etc., now support profiling GPU-accelerated codes thoroughly, and even things like Kokkos (an HPC C++ library) have hooks to tie into profiling tools.
In HPC, performance is often part of correctness – scaling inefficiencies or load imbalance are bugs to fix. Thus, profiling/tracing is part of the development cycle. Data center operators likewise treat continuous profiling as essential for optimizing cost and performance at scale. In short, whether it’s a supercomputer simulation or a fleet of microservices, profiling and tracing provide the feedback loop to tune the system’s efficiency.
Architecture-Specific Considerations (x86, ARM, RISC-V, GPUs, TPUs, DSPs)
The landscape of tools can differ significantly across CPU architectures and specialized processors. Each architecture offers different features for observability (performance counters, trace mechanisms), and tool support varies by ecosystem and vendor. Here we highlight some architecture-specific factors:
-
x86 (Intel & AMD CPUs): The x86 architecture (Intel/AMD 64-bit) is very well-supported by profiling tools. Intel in particular has invested in hardware features like PEBs/PMUs (Performance Monitoring Units with dozens of counters) and Intel Processor Trace (Intel PT). Intel PT can capture the control flow (branches taken) of a program with low overhead by compressing it into trace packets – used primarily for debugging, but also can be used for profiling code coverage or hotspots when combined with analysis tools. Most open-source tools (perf, OProfile, etc.) originated on x86 and then were ported elsewhere, so they tend to have the most features on x86. For example, Linux
perf
on x86 can measure precise event sampling (PEBS), last branch records, etc., whereas on some other arch these features might be limited. Intel VTune is mainly for x86 and leverages all these features, giving detailed platform-specific metrics (like memory bandwidth utilization on a particular Intel microarchitecture, or vectorization efficiency using AVX units). AMD x86 CPUs similarly have counters and their tool uProf provides access to them (and AMD’s uProf even profiles code on Linux/Windows for AMD CPUs). The complexity of out-of-order superscalar pipelines in x86 makes these tools important – e.g., distinguishing time lost to branch mispredicts vs. memory stalls. On the software side, x86 has the benefit of decades of tooling: even niche tools like cachegrind (in Valgrind) model an x86 cache, and Intel’s PIN is specifically for x86. For Windows, x86 is historically the primary arch, so all Windows profiling tools (Visual Studio, ETW, VTune for Windows) are robust on x86. In summary, x86 enjoys the richest tooling ecosystem for both user and kernel profiling. Challenges might include the sheer volume of data from features like Intel PT – profiling every branch can overwhelm storage, so using such features requires good filtering or sample-based tracing. -
ARM (ARM32, ARM64): ARM, especially 64-bit ARM (AArch64), has become prominent in mobile and even servers. ARM’s performance monitoring support is solid but traditionally not as extensively used as Intel’s until recently. Linux perf works on ARM architecture (both 32 and 64-bit) for sampling and counters. Many eBPF tools also work (ensuring the ARM kernel has eBPF JIT, which it does). ARM’s CoreSight provides hardware trace (ETM/PTM for instruction trace, and ITM for software events). On development boards, one can use that with tools like ARM DS-5 (now Arm Development Studio) to trace code. For higher-level profiling, ARM acquired Allinea a few years back: Arm Forge (formerly Allinea MAP) is an HPC tool that profiles codes on ARM and other CPUs, focusing on MPI/OpenMP – it does sampling to give a timeline/profile of an HPC code on ARM-based supercomputers. For mobile (smartphone/tablet), both Android and iOS have profiling tools: Android’s Perfetto/Systrace uses ftrace under the hood to trace apps, and can use CPU counter sampling on ARM SoCs. Qualcomm has tools for their Snapdragon (which has custom ARM cores plus DSPs) – e.g., Snapdragon Profiler to profile CPU/GPU usage of an app. Apple’s ARM chips are profiled by their Instruments tool (which uses DTrace and custom sampling internally) – for instance, Time Profiler in Instruments does sampling of code on the Apple M1/M2. The challenges on ARM often involve proprietary cores – e.g. if an ARM core implements custom microarchitectural events, only vendor tools know them. Also, enabling hardware trace (ETM) requires specific board support and can be finicky with high-speed trace ports. But overall, ARM support has caught up immensely – with major OS and tool support. We also see efforts like running eBPF on Android (which is ARM), and even Windows on ARM now has some tooling (though not as much as x86 Windows). A challenge comparing to x86: fewer well-known commercial profilers target ARM, but that’s changing as servers like AWS Graviton (ARM-based) become common – now tools like Datadog’s profiler or Google’s cloud profiler support ARM instances, usually via eBPF or sampling.
-
RISC-V: As an emerging open ISA, RISC-V’s ecosystem is still developing. Linux runs on RISC-V and thus Linux-based tools (perf, eBPF, etc.) are being ported. As of Linux 6.x, RISC-V supports eBPF JIT and core perf events (it has a simple PMU for basic counters). However, advanced features are still sparse – for instance, performance counter events on RISC-V might not have as rich a set as x86/ARM yet (and tools like PAPI were only recently extended to know about RISC-V events). In terms of tracing, some RISC-V cores (especially those aimed at embedded) have their own trace modules (for example, the RISC-V Nexus trace spec for certain designs, or various vendor-specific JTAG trace). There’s also an initiative to use OpenSBI and RISC-V performance ISA to standardize certain counters. The GNU gprofng team has explicitly mentioned adding RISC-V support, which suggests that will be one of the early user-space profilers available. For now, RISC-V developers often rely on simulators (like spike or gem5) to profile, since those can introspect everything. In hardware, if using Linux, basic profiling is fine; if bare-metal, one might use a JTAG and possibly the CPU’s machine mode counters manually. The fact that RISC-V is open could mean interesting research – e.g. customizing a RISC-V core with additional tracing hardware or new counters for certain events. But those aren’t mainstream yet. In summary, RISC-V is catching up: expect full support in open-source tools soon, but currently it’s not as plug-and-play as x86/ARM. Also, lacking big commercial vendor involvement (except some companies like SiFive), there aren’t fancy proprietary profilers – which makes open solutions (like gprofng, or porting Valgrind) very important.
-
GPUs (Graphics Processing Units): GPUs are essential for HPC and AI, and profiling them is a field of its own. Each GPU vendor has its own tools: NVIDIA offers the Nsight suite – Nsight Systems for timeline tracing and Nsight Compute for kernel profiling. Nsight Systems can record a unified timeline of CPU threads and GPU events (kernels, memcopies) by using NVIDIA’s CUPTI interface in the driver. It correlates CPU and GPU via timestamps, giving a view of concurrency and how GPU work overlaps with CPU work. Nsight Compute is more fine-grained – it measures individual GPU kernel launches in detail, often running them multiple times to gather all hardware counters (since GPUs allow only a subset of counters per run). This yields metrics like occupancy, memory throughput, stall reasons, etc. Overhead is large (because of replaying kernels), so it’s for offline analysis and tuning of specific kernels. NVIDIA also has tools like Visual Profiler (nvvp) and older nvprof, but those are deprecated by Nsight. For AMD GPUs, as noted earlier, the ROCm platform provides rocProfiler and rocTracer, and new tools like Omniperf (for deep dive on kernels) and Omnitrace (which does CPU+GPU tracing, akin to Nsight Systems). Omnitrace is interesting as it’s open-source (developed with AMD Research); it supports instrumentation and sampling on CPU and GPU, and even “causal profiling” to see how speeding up one function would affect overall runtime. On the Intel GPU side (like their Xe GPUs and integrated graphics), Intel’s VTune and the oneAPI VTune can profile those, plus Intel has Graphics Performance Analyzers (GPA) for graphics workloads. There’s also a common API in oneAPI for GPU counters.
GPU profiling has to consider kernel-level parallelism and different metrics (like warps, shared memory use, etc.). Tools often split into development profilers (lots of detail, high overhead, used on small runs) vs monitoring tools (low overhead, continuous). For example, NVIDIA’s DCGM (Datacenter GPU Manager) provides high-level GPU metrics continuously (utilization, memory usage) with negligible overhead – good for monitoring a cluster. But for a detailed single-kernel analysis, you’d use Nsight Compute which might slow that kernel drastically. An open-source development is GPUPerfAPI by AMD and Cupti by NVIDIA which allow third-party tools to access counters. HPC tools like HPCToolkit and TAU utilize those to integrate GPU data.
A challenge for GPU tracing is correlating with CPU. Nsight Systems and HPCToolkit handle this by syncing clocks and merging timelines. Another challenge is volume: tracing every GPU kernel launch in a long HPC run could be millions of events; tools have to allow filtering or summarizing (e.g., group by kernel name). GPU memory behavior (transfers between host and device) is also crucial and tools like Nsight show those on the timeline, and even interconnect (NVLink) usage.
GPUs also have hardware profilers – e.g., AMD’s Radeon GPU Profiler (RGP) for graphics pipelines, focusing on things like frame rendering on the GPU with APIs like Vulkan/DirectX. Those are specialized to game developers’ needs. Similarly, NVIDIA Nsight Graphics profiles frame rendering in games. While not the focus of this survey, they underscore that “profiling” a GPU can mean different things in different domains (graphics vs compute).
-
TPUs (Tensor Processing Units) and AI Accelerators: Google’s TPUs are specialized for neural network workloads. They are often accessed via high-level frameworks (TensorFlow, JAX) which provide integrated profiling. For example, TensorFlow Profiler can profile model training on TPUs by instrumenting the TensorFlow runtime. It collects stats like how busy each TPU core is, memory use, and even traces of operations. These show up in TensorBoard (Google’s visualization tool) along with Python timelines. Google’s Cloud TPU tools allow capturing profiles from a TPU pod over a period (a few seconds) and visualizing the step time breakdown (computation vs data transfer, etc.). Because TPUs are not user-programmable outside the ML framework, the profiling is tightly integrated with the framework. Google published some info on their TPU profiling in the past – essentially, the compiler inserts instrumentation counters around ops, and the hardware has some tracing for their interconnect (MXU utilization, etc.). Outside Google, OpenXLA (open source compilation for accelerators) has a project called Xprof, aiming to provide profiling for XLA-supported devices (like TPUs via JAX, or other accelerators). For other AI chips (Graphcore IPUs, Amazon Inferentia, etc.), each has its own tool: e.g., Graphcore has PopVision, which gives detailed timelines of program execution on their IPUs. These often look like a mix of hardware counter dumps and timeline visualization. The field is young, and tools often lag the hardware. One big challenge is observability in multi-tenant AI accelerators – if multiple users share a big accelerator, profiling must be sandboxed and not reveal others’ work, which is an area of active development (security concerns).
-
DSPs and FPGAs: We touched on DSPs earlier in embedded, but as accelerators, they show up in phones (audio DSPs, sensor hubs) and sometimes in servers (SmartNICs with DSP cores). Profiling them might rely on vendor tools – e.g., Qualcomm’s audio DSP can be profiled with QMIC (if available) or just by measuring CPU load on that DSP via logs. FPGAs used for computation (like in Microsoft’s datacenters for Bing) don’t have “profiles” in the same sense; you measure their throughput or performance counters in logic. But now with high-level synthesis and FPGA as CPU offload (like Xilinx Alveo cards), tools such as Xilinx’s Vitis Analyzer can show kernel execution times on FPGA and data transfer times (akin to GPU profilers). It’s not standardized because FPGA designs are custom by nature, so you instrument your design with whatever counters you need.
-
Data Processing Units (DPUs) and SmartNICs: These are a class of accelerators combining CPUs and fixed-function logic for networking/storage (like NVIDIA BlueField DPU). Profiling/tracing them involves both standard CPU profiling (they often run embedded ARM cores that you can use Linux perf on) and tracing packet flows. NVIDIA’s Nsight Systems has early support for BlueField (treating it like another system to trace, with their NVTX annotations possibly). For now, DPU observability is mostly an ad-hoc mix of existing tools (ethtool for networking, perf for on-DPU CPUs). But expect new tools focusing on network function profiling.
Summary of challenges by arch: On CPUs (x86/ARM/RISC-V), the main differences are in available hardware events and maturity of tools. x86 leads, ARM is close behind, RISC-V is upcoming. On GPUs, each vendor’s closed ecosystem means cross-platform tools are limited (though HPCToolkit and some academic tools try to support all via vendor APIs). On heterogeneous systems, correlating across CPU and GPU (or others) is a big challenge – requiring aligned timestamps and combined UIs, as discussed. Another issue is calibration of metrics across arch: e.g., a CPU’s “utilization” is not the same as a GPU’s utilization (GPU can have thousands of threads “idle” waiting for data while showing 100% busy on a couple of compute units). So understanding performance requires architecture-specific knowledge, which is why vendor tools still dominate in the GPU/TPU space with domain-specific metrics (like “SM occupancy” for NVIDIA, or “MAC utilization” for a TPU).
The trend is toward unifying these where possible: an example is the integration of CUPTI (NVIDIA’s GPU trace API) data into standard timelines (Nsight, HPCToolkit), and OpenTelemetry considering adding support for GPU spans/events in traces. Likewise, we see efforts to bring eBPF to work on Windows and other OS, which could allow one common approach to instrumenting kernels across architectures.
In architecture terms: performance analysis is always a two-layer cake – part generic (OS and language tools that work similarly on all CPUs), part specific (taking advantage of hardware counters, special trace IP, etc.). The combination of these yields the best results.
Research Challenges and Design Issues in Profiling/Tracing
Despite the rich set of tools available, several ongoing challenges drive research and development in profiling/tracing:
-
Overhead vs. Insight Trade-off: The fundamental challenge is to collect detailed performance data while minimizing overhead on the target system. Instrumentation can perturb timing or slow execution (the classic “observer effect”). Research often explores ways to reduce overhead: sampling is a primary method (as discussed, sampling at 100-1000 Hz has negligible impact but yields statistically sound profiles). Another approach is using specialized hardware: performance counters give counts with no overhead, hardware trace (Intel PT, ARM ETM) gives execution logs offloaded from the CPU. However, using these raw hardware features requires interpreting lots of data. Selective tracing is another technique – record only a subset of events (e.g. every 100th occurrence, or only events longer than X). Tools like JFR in Java allow setting thresholds (e.g. only record locks held >10ms) to avoid flooding with data. There’s also research on coarse-grained instrumentation that refines itself – e.g. instrument functions at a high level, find a culprit, then dynamically instrument inside that function only. A key area is compiler-assisted profiling: compilers can insert low-overhead counters or use profile-guided optimization feedback, but here we talk about dynamic profiling – still, ideas like instrumentation sampling (turn on instrumentation only for short random periods during execution) have been studied to get detailed traces with low average overhead. The goal is to approach “always-on” profiling: systems like Linux perf and eBPF aim to be safe for production by keeping overhead low (a few percent at most) so that one can leave profiling running continuously (Facebook has done this on production machines at 1% CPU cost for profiling overhead, which is acceptable for the value of the data). In HPC, low overhead means you can profile a 10,000-core job without altering its scaling behavior – tools like HPCToolkit pride themselves on being “low-overhead, non-invasive” to jobs. Nevertheless, capturing extremely high frequency events (like every cache miss) is impossible without slowing the system. So, smart filtering, sampling, and offloading (to hardware or to another core) are active research areas. One interesting research concept is edge profiling where instrumentation is inserted in code to count how often each branch runs (used in compiler optimizations); this incurs overhead but one can statistically sample edges instead of all of them to reduce cost, an idea from research papers on efficient profiling.
-
Real-time and Online Profiling: Real-time systems need profiling that doesn’t violate deadlines. This is tricky – adding instrumentation might change the scheduling enough to invalidate the real-time behavior being measured. Solutions include using spare cycles (profiling only when CPU is idle), hardware trace (since it doesn’t perturb execution timing significantly), or extremely lightweight logging. There’s also the concept of online analysis: processing trace data on the fly to avoid storing huge logs and possibly to detect anomalies in real-time. For example, an online analysis might raise an alert when execution strays from a typical profile, or when certain events (like deadline misses) occur, without logging all events to disk. Academia has explored streaming trace processing where a stream of events is analyzed by a state machine or even using stream databases to identify patterns. The Data Volume challenge is acute here – a continuous trace of a busy system can be GBs per second; you can’t save all that, so either you summarize as you go or use ring buffers to keep only recent history. This is akin to how JFR’s circular buffer works (keeping recent data in memory and only saving upon trigger). Some research systems employ cooperative profiling – e.g. Google’s CPUs have performance counters that can be sampled by a hypervisor across many VMs, making profiling “always on” at the virtualization layer with minimal overhead. Real-time constraints also mean the profiler should ideally run on a separate core or have bounded execution time. eBPF helps by allowing certain analysis in kernel, avoiding excessive context switches or data movement. But verifying that profiling tasks themselves meet deadlines is tricky – you don’t want the act of profiling to cause a missed deadline. Thus, typically, real-time systems use hardware trace or very minimal periodic sampling (which can be accounted for in worst-case execution time analysis if needed).
-
Observability in Production Environments: Deploying tracing/profiling in production raises concerns of safety, performance, and security. Production systems often cannot afford downtime or crashes, so tools must be robust. eBPF was a response to this – by running in a sandbox in the kernel, it “protects the user from panicking the kernel or getting it stuck” unlike older kernel modules. Similarly, DTrace was designed to be safe to enable on live systems (with runtime checks to prevent bad probes). Another challenge is enabling profiling on demand – production issues might occur unexpectedly, so one needs the ability to start a trace after the fact or have always-on lightweight monitors. Continuous profilers (Google, Datadog, etc.) address this by always collecting at low levels. But for tracing, always logging every request is too costly, so instead systems like Netflix’s Visor (hypothetical name) might sample distributed traces only during anomalies. Production observability also intersects with privacy and security: traces might contain sensitive data (user IDs, payloads). Tools should allow scrubbing or not collecting certain info. For example, OpenTelemetry encourages not putting sensitive info in trace attributes unless necessary. eBPF has some limitations by design – it cannot access user memory unless explicitly allowed, and even then only the portion it needs, to avoid inadvertently leaking data from the kernel. Another production concern is multi-tenancy: especially in cloud or container environments, one container’s profiling shouldn’t disturb others or leak their data. Kubernetes and similar orchestration want per-container observability. Technologies like eBPF are evolving to tag events by container ID and only expose to authorized users. Also, there’s cost: profiling at scale can generate a lot of data that needs to be stored or transmitted. Companies weigh the cost of always profiling vs. the benefit – this drives development of efficient encodings (binary traces, compression) and adaptive sampling (profile less when system is stable, more when an issue arises). The vision is to treat observability data like any other telemetry: manage it, budget it (maybe only 1% overhead), and ensure it’s secure. We see this in the rise of baked-in monitoring in infrastructure (e.g. cloud VMs often come with an agent that does basic profiling by default). One research direction is auto-tuning the observability: e.g., if a service’s latency spikes, automatically increase trace sampling rate for that service to gather more detail, then reduce when normal. This dynamic adjustment is an open area.
-
Correlation of Multi-Layer and Multi-Component Events: Modern systems span many layers – an application function call might trigger a kernel syscall, which leads to network packets, which are handled by another machine’s kernel, and then a database query, etc. To truly diagnose performance, one needs to correlate events across these layers. This is non-trivial because each layer has its own timestamps, event formats, and contexts. Distributed tracing solves correlation across services by propagating trace IDs, but within a single host, correlating user-level profiles with kernel events can be tricky. Tools like LTTng address this by allowing combined kernel and user traces – if an application uses USDT (user static tracepoints), those events share a timestamp base with kernel tracepoints, so you can see in one timeline that “user function X called, then a page fault in kernel, etc.”. Trace Compass is a tool that can take LTTng kernel trace and user trace and align them, showing e.g. which process was running on which core at each time, and what it was doing in user mode vs kernel. eBPF can also help here: you can attach an eBPF probe to a user-level function (via uprobe) and another to a kernel event, and because they both run in kernel context, you can timestamp them and even join data (e.g., record that “when this user function is called, the current kernel stack has XYZ”). Collating these multi-layer events often requires a common timeline (synchronized clocks or using relative timestamps from a common source like the CPU TSC). In distributed systems, correlation means linking traces to metrics and profiles as discussed – it’s about connecting the dots from high-level symptom to low-level cause. For instance, a microservice trace might identify Service A is slow because it’s waiting on Service B; then one might profile Service B’s CPU to find it’s 100% busy, and a kernel trace to find it’s context-switching heavily due to thread thrashing – multiple layers, each giving a piece. Automating this correlation is a challenge: some research is using ML to analyze traces and profiles together to suggest causes (like “CPU saturated, likely cause of latency” or “lots of minor page faults observed correlating with slow request”).
In HPC, correlating multi-layer events might mean linking an MPI message delay to low-level network contention or OS jitter; tools like OS noise profilers combined with application traces have been studied. In cloud, correlation might mean linking a spike in CPU usage (from a profile) to a deployment that happened (from logs) – crossing into the realm of “observability” in full. This is where observability platforms try to unify logs, metrics, traces, profiles – so users can hop between them with shared identifiers (trace IDs, span IDs, etc.). Grafana’s approach of linking traces and profiles is one example: when a trace tells which service instance handled a request at what time, a continuous profiler can retrieve the profile of that instance at that time to show what it was doing. We can expect more of this cross-linking in future tools.
-
Security and Privacy in Tracing: Profiling/tracing tools operating at kernel-level have to be careful not to open vulnerabilities. eBPF has a verifier to ensure eBPF programs can’t crash or hang the kernel and can’t arbitrarily read/write kernel memory. DTrace likewise had safety features (like only allowing root or users with privileges to use certain probes, and disallowing certain combinations that could be dangerous). However, enabling broad tracing can inadvertently capture sensitive information (e.g., logging all system calls might log file names that include private data). Therefore, in production use, there are often guardrails: frameworks allow filtering out PII, and only certain trusted users can initiate a trace. There’s also a challenge that malware could potentially use high-resolution timers or perf events to infer information (there have been side-channel attacks via performance counters). As a result, some secure environments disable or restrict performance counters and tracing. For instance, on shared cloud VMs, unprivileged perf usage might be disabled to prevent timing attacks. Containerization is adding features to namespace performance events (so one container can’t see another’s events). Data volume is another security aspect: large traces might not be feasible to store long-term due to cost, so one must choose what to keep (maybe aggregate data which loses detail but also perhaps less sensitive). Also, consider intellectual property: tracing GPU drivers might reveal proprietary GPU microcode or behavior, so vendors sometimes lock down certain trace capabilities.
-
Instrumenting Heterogeneous and Black-Box Components: Another implementation challenge is profiling components that are not easily instrumentable – e.g., third-party binaries, or hardware blocks without interfaces. Dynamic binary instrumentation can help with closed-source binaries (at cost of overhead). For hardware blocks (like an accelerator ASIC), we rely on whatever counters or telemetry it exposes. There’s research into generic interfaces for accelerators: for example, proposal of a standard “performance counter” interface that any accelerator attached to a system could expose so that system-level profilers (like Linux perf or OpenTelemetry) can scrape them. Without standards, each new hardware (TPU, DPU, etc.) needs custom tooling, which slows down observability. The industry and research are pushing for more openness in metrics – e.g. RISC-V being open might allow a more standardized profiling interface across vendors. Similarly, in software, observability by construction is a theme: new frameworks are built with tracing hooks from the get-go.
-
Debugging vs Profiling vs Tracing convergence: Traditionally, debugging (for functional issues) and profiling (for performance) were separate. But with the rise of observability, the lines blur. A slow performance might be due to a functional bug (like an infinite retry loop), so traces can help debug logic as well. Conversely, debug traces (like log files) might inadvertently profile an app (timestamps in logs give some sense of performance). There’s interest in unifying these – e.g. using the same instrumentation to serve both debugging and profiling needs depending on sampling rate or mode. For instance, an instrumentation could log detailed info at dev time (for debugging) but only count events in prod (for profiling). This area isn’t fully resolved but is conceptually attractive: reduce redundant instrumentation.
In summary, research is actively addressing how to make profiling/tracing lower overhead, more automated, more holistic, and safer. The ideal future scenario is that all layers of the system can be observed with negligible performance impact, and all that data can be seamlessly combined to diagnose any issue. We’re not there yet, but incremental improvements (like eBPF, OpenTelemetry, continuous profilers) are paving the way.
Future Directions and Emerging Trends
Looking ahead, both industry and academia are working on new ideas and improvements for profiling and tracing. Some key future directions include:
-
Unified Observability and Tool Consolidation: There’s a clear trend toward unifying metrics, traces, logs, and profiles into a coherent observability platform. This means the tools of the future might not be so siloed (one for tracing, one for profiling, etc.) but rather integrated. OpenTelemetry already moves in this direction by covering metrics and traces under one standard, and ongoing work may incorporate profiling as an observability signal as well (there are discussions in CNCF about adding “continuous profiling” as a pillar next to MELT: Metrics, Events, Logs, Traces). We see products like Datadog and Dynatrace offering combined tracing and profiling in one UI – you can capture a trace and then see which function consumed CPU within that trace. Expect open-source stacks to catch up: e.g., Grafana’s acquisition of Pyroscope and K6 indicates they aim to integrate continuous profiling and load testing with tracing/metrics. The challenge to unify is mainly data handling (different volumes and formats) and UIs (how to present, say, a flame graph alongside a trace timeline). But it’s actively being worked on, and in a few years, it might be common to have one agent on a server that takes care of collecting all these forms of telemetry in a balanced way.
-
Cross-Platform and Cross-Architecture Portability: Efforts are underway to make powerful tracing capabilities available on all platforms. For example, eBPF for Windows is a project to bring eBPF-like functionality to Windows (there’s an MSR project and contributions from Microsoft and others to an eBPF subsystem on Windows). This could allow similar observability tools (like BPFTrace equivalents) to run on Windows, bringing parity with Linux. Also, eBPF on other OS like FreeBSD and even in user-space context (there is talk of eBPF moves into user space for certain uses). Similarly, Apple’s systems currently use DTrace and their own instruments – perhaps eBPF or a form of it could unify tracing across Unix-like OS. On the architecture side, ensuring new arch (RISC-V, etc.) have the same level of support in tools is important – open-source projects are actively porting and testing on RISC-V, and the RISC-V community is likely to develop some RISC-V-specific performance tools especially for their ecosystem (we might see a RISC-V Foundation project on profiling).
-
Enhanced Hardware Support: Future CPUs and accelerators are expected to include more on-chip instrumentation for performance. For example, processors might embed always-on, low-bandwidth performance telemetry that can be queried live (some Intel Xeon servers have “Intel RDT” which gives real-time cache and memory bandwidth usage per core). RISC-V could innovate here by defining standard performance counters that measure things like energy consumption per context or detailed memory access patterns, accessible in a uniform way. Also, hardware could assist in tracing: there’s research on using branch trace buffers and compressing them with machine learning to record execution paths with less data. Another area is power and thermal profiling – as energy efficiency is critical, we might see tools focusing on profiling energy usage of code (some research uses performance counters to estimate energy, or external power meters correlated with code execution). Future chips might expose more internal sensors (like per-core energy usage in AMD Epyc CPUs is already accessible).
-
Profiling in Heterogeneous and Multi-layer Systems: The growth of heterogeneous computing (CPU + GPU + FPGA + DPU in one system) calls for profiling tools that can handle multiple types of processors in one timeline. The survey reference from eunomia.dev points out the need for tracing across CPUs, GPUs, DPUs, and even integrated combos (APUs). We expect frameworks that coordinate data from different vendor tools. One path is through open standards: if all vendors support output to a common trace format (like an extension of OpenTelemetry or the Chrome Trace Format), then a single viewer could show an entire timeline. There’s progress: OpenCL, oneAPI Level Zero etc., have made it easier for tools to hook multiple devices. Still, a lot of work remains to seamlessly trace, say, an application that uses an NVIDIA GPU and then a Xilinx FPGA – currently, you’d need to run separate profilers and then try to merge results manually. Future research may develop mediator layers or abstraction APIs that tool developers can use to trace any kind of coprocessor in a uniform way.
-
Intelligent and Automated Analysis: As the volume of profiling/tracing data grows (in HPC, in microservices), there's a need for automated analysis and even AI-driven insight. We’re starting to see anomaly detection in traces (e.g., identifying that a certain trace’s pattern differs significantly and might indicate a bug). ML could be applied to profiles to detect performance regressions (some tools already do simple statistical diff of profiles between software versions). There’s also the idea of “Performance Copilot” tools that watch an app’s profile while it runs and suggest optimizations (e.g., “function X is hot and could benefit from memoization” or “you have a lot of cache misses in this loop, maybe the access pattern is poor”). While compilers do static optimizations, a runtime assistant could do dynamic ones. A notable research from a few years ago is Coz: the causal profiler, which experiments by slowing down parts of a program to see how it affects overall throughput, thereby finding which parts, if optimized, would yield the most benefit. This “what-if” analysis is a different take on profiling (it finds optimization opportunities, not just where time is spent) and is being incorporated conceptually into some industry tools (AMD’s Omnitrace claims to include causal profiling). We might see more of that: profilers that not only measure but also guide optimization by simulating improvements.
-
Catering to Security and Privacy Needs: Given increasing focus on data privacy, future tracing might incorporate data tagging and scrubbing. For instance, a trace tool might allow you to mark certain memory ranges or variables as sensitive and it will avoid recording them or will hash them. On the security side, profiling tools might be used for anomaly detection in security (profiling system calls could detect an intrusion if a process deviates from its normal profile). This dual-use means profiling frameworks might integrate with security monitoring (e.g., eBPF programs for tracing are already used to detect suspicious behavior by analyzing patterns of system calls, essentially profiling the behavior of programs for security). So the line between performance tracing and security tracing may blur, with unified frameworks doing both (trace events can trigger security alerts and vice versa). The rise of eBPF itself is an example where a single mechanism is used for performance, networking, and security monitoring in the kernel.
-
Developer Experience and Democratization: Historically, using these tools (especially low-level ones) required expertise. There’s a push to make performance analysis more accessible. This includes better visualizations (flame graphs were a huge step in simplifying profile interpretation, and they’re now common). We may get new visualization paradigms, like performance maps or integrated IDE hints. Also, simplified interfaces: for example, instead of writing a bpftrace script, a future UI might allow clicking on a running program’s function in an IDE and saying “profile this for the next 10 seconds” and get results, all without leaving the development environment. For distributed traces, it might mean automatically instrumenting code without user effort (some frameworks already auto-instrument a lot, but more could be done, especially in custom in-house code). The goal is to bring these powerful tools into everyday use for all developers, not just performance specialists. Educational efforts (like Brendan Gregg’s books, and community sites) are part of it, but tool creators are also trying to reduce complexity (OpenTelemetry standardizes things to avoid having to piecemeal multiple different tracing APIs).
-
Profiling in New Domains: As computing moves to new domains (edge computing, IoT, AR/VR, etc.), profiling/tracing will follow. Edge and IoT devices have constraints similar to embedded; expect lightweight, distributed profiling where data is aggregated from many devices to a central server (with care for bandwidth). AR/VR and realtime interactive systems might require profiling with time budgets – ensuring each frame computes within 16ms, etc., so profilers might focus on worst-case times rather than average. Also, with more parallelism (many-core processors), profiling tools need to handle hundreds of threads and show concurrency issues; timeline views are already used for that, but as core counts grow (think 128-core CPUs), tools might need to summarize or cluster threads to be understandable.
-
Open-Source and Community Innovation: We see a pattern where academic research prototyped ideas (like causal profiling, sampling techniques), and then open-source tools integrate them, and eventually industry adopts them in products. With the healthy open-source ecosystem around eBPF, OpenTelemetry, etc., innovation is accelerating. Projects like parca (for continuous profiling), pyroscope, ebb (efficient basic block profiling), etc., often spring up and either get folded into bigger projects or inspire features. The future might bring an “Open Profiling Standard” analogous to OpenTelemetry, to standardize profile data format and collection (there is already a format called PPROF used by Go and adopted by some, including Pyroscope, which could serve as a basis). Standardization would allow profiles from different languages or sources to be combined or compared easily.
-
Performance and Energy as a First-Class Metric: With the industry’s focus on energy efficiency and carbon footprint, profiling tools might incorporate energy modeling – e.g., showing joules consumed by parts of code. Some research tools exist (Intel’s SoCWatch, ARM’s energy probes), but integration into mainstream dev workflows is limited. Future compilers and profilers could let developers optimize for energy, not just time. This might involve tracing power states or DVFS (frequency scaling) events and correlating with code. For battery-powered devices, an “energy profiler” is extremely valuable (ex: Android’s Batterystats and Trepn profiler show which processes drain battery – essentially a profile in terms of energy). Expect more fine-grained versions of these.
In summary, the future of profiling and tracing is more integration, more intelligence, more ubiquity. The ideal outcome is a world where any system, from a tiny IoT node to a massive cloud service, can be observed in detail with negligible overhead and without compromising safety – and all that data can be leveraged by automated tools to optimize and troubleshoot in ways humans alone cannot. We’re steadily moving in that direction, as evidenced by the rapid advancements in tools over the past decade (e.g. the rise of eBPF, the adoption of distributed tracing, continuous profiling in production). Profiling and tracing, once niche expert activities, are becoming mainstream aspects of the software lifecycle, empowering developers and operators to build faster and more reliable systems.
References: The information in this survey is drawn from a range of authoritative sources, including academic papers, tool documentation, and industry experts. For instance, the importance of modern profiling tools for understanding bottlenecks is highlighted by the PRACE HPC whitepaper. Technical specifics on Linux tracers and eBPF were referenced from kernel documentation and analyses, while insights on performance tools like perf and DTrace were supported by expert commentary. Profiling in managed runtimes was exemplified by Oracle’s documentation on Java Flight Recorder’s low overhead design. The discussion on distributed tracing leveraged comparisons between OpenTelemetry and Jaeger. Embedded and RTOS tracing concepts were informed by industry whitepapers on RTOS debugging. HPC tool capabilities were detailed with reference to surveys and tool docs (e.g., HPCToolkit’s GPU tracing via CUPTI). Finally, emerging trends and future directions synthesize viewpoints from recent literature and summits in observability (for example, the push for continuous profiling in observability stacks and the expansion of eBPF beyond Linux). These references, indicated throughout the text, provide further reading for those interested in the specifics of each tool or technique mentioned.
References
- Profiling and Tracing Tools for Performance Analysis of Large Scale ... (PDF)
- How profiling and tracing work together | Grafana documentation
- Choosing a Linux Tracer (2015)
- Feature status on RISC-V architecture — The Linux Kernel documentation
- The rise of eBPF for non-intrusive performance monitoring (PDF)
- What are the main differences between eBPF and LTTng? – Stack Overflow
- More than 90 Profiling Tools for Desktop to Large Supercomputers
- GPU Profiling Under the Hood: An Implementation-Focused Survey of Modern Accelerator Tracing Tools
- GOOGLE-WIDE PROFILING (PDF)
- Finding races and memory errors with LLVM instrumentation (PDF)
- gprofng: The Next Generation GNU Profiling Tool – Oracle Blogs
- Performance analysis software gap assessment
- About Java Flight Recorder
- Spying on Python with py-spy
- benfred/py-spy: Sampling profiler for Python programs – GitHub
- OpenTelemetry and Jaeger | Key Features & Differences [2025] | SigNoz
- What is Observability? Beyond Logs, Metrics, and Traces | StrongDM
- Percepio – Stop Guessing (PDF)
- Measurement tools – HPC Wiki
- The Accelerator Toolkit: A Review of Profiling and Tracing for GPUs and other co-processor
- Profile your model on Cloud TPU VMs
- How to Profile TPU Programs – JAX
- openxla/xprof: A profiling and performance analysis tool – GitHub
- Linux tracing systems & how they fit together – Julia Evans