Skip to content

Observability, Profiling, and Debugging in Systems Conference (2015–2025)

Abstract:

This survey reviews over a decade (2015–2025) of research on observability, profiling, and debugging techniques in computer systems, focusing on main-track papers from OSDI, SOSP, and EuroSys. We cover more than 100 papers spanning dynamic tracing frameworks, logging and monitoring infrastructures, performance anomaly detection, root cause analysis, and system visibility mechanisms. We identify core problems addressed (from tracing distributed requests to detecting configuration or concurrency bugs), techniques employed (dynamic instrumentation, static analysis, in situ logging, distributed monitors, ML-assisted analysis), targeted domains (OS kernels, cloud and distributed systems, mobile/IoT systems, etc.), and how these works relate and build upon each other. Trends over time are discussed – e.g., the evolution from ad-hoc tracing in monolithic systems to always-on, low-overhead observability in microservices – as well as emerging integration of machine learning for anomaly detection and root cause analysis. We conclude with open challenges such as scaling observability to highly disaggregated systems, reducing overhead and noise in tracing, automating diagnosis across abstraction layers, and improving the usability of debugging tools in production.

1. Introduction

Modern computer systems generate massive volumes of telemetry (logs, traces, metrics), yet diagnosing problems in complex distributed environments remains notoriously difficult. Observability – the ability to understand internal system state via external outputs – is a critical property for reliability and performance. Between 2015 and 2025, the systems research community produced a wealth of new approaches to observability, profiling, and debugging. This survey focuses on main-track papers from OSDI, SOSP, and EuroSys (2015–2025) that tackle these topics, including full research papers, short/tool papers, and experience reports (excluding workshops). We cover over 100 representative works, highlighting their core problems, techniques, targeted domains, inter-relations, and broader trends.

Scope and Motivation: Key themes include tracing and logging frameworks for distributed systems, performance monitoring and profiling tools, methods for anomaly detection and root-cause diagnosis, and novel system visibility mechanisms. The rise of large-scale cloud services, microservices, and heterogeneous infrastructures during this period amplified the need for better observability. Traditional debugging (e.g., using breakpoints or offline analysis) became insufficient for always-on services, leading to research on low-overhead tracing in productioncs.brown.edupar.nsf.gov. Likewise, performance variability and failures in distributed environments demanded new techniques for causal tracing across components, automated log analysis, and failure reproduction under production conditions.

Organization: We first survey tracing and monitoring frameworks (Section 2), then failure diagnosis and debugging techniques (Section 3). Section 4 covers performance profiling and anomaly detection tools, and Section 5 reviews logging and post-mortem analysis advances. We discuss how these papers build on each other and observe notable trends over time (Section 6). Finally, Section 7 outlines open challenges and research gaps that remain. Throughout, we cite exemplar papers (with references in ACM format) to ground the discussion.

2. Advances in Tracing and Monitoring Frameworks

A foundational line of work in this period focused on distributed tracing frameworks – systems that capture causal execution paths across components to allow end-to-end analysis. A landmark early work is Pivot Tracing (SOSP 2015) by Mace et al.bibtex.github.io. Pivot Tracing introduced dynamic instrumentation for distributed systems, letting developers insert tracepoints at runtime and correlate events via a “happens-before join” operatorbibtex.github.io. This enabled post-hoc queries on trace data across components without modifying application code. Pivot Tracing’s impact was significant: it demonstrated that always-on fine-grained tracing could be achieved with low overhead by only activating instrumentation in response to specific queries. Many later systems built on similar ideas of runtime, selective tracing and cross-component causal linking.

Following Pivot Tracing, we see multiple systems extending or applying dynamic tracing. Canopy (SOSP 2017), from Kaldor et al., was an end-to-end performance tracing and analysis system deployed at Facebookdblp.org. Canopy builds on prior tracing frameworks but at massive scale, collecting per-request traces across microservices and using aggregation to diagnose latency outliersdblp.org. It introduced techniques to manage trace volume (sampling, aggregation) and performed online analysis to promptly identify regressions. Similarly, X-Trace and Google’s Dapper (pre-2015 work) influenced this era, but the 2015–2025 period brought more adaptive and intelligent tracing. For example, Sieve (EuroSys 2019) and Snoopy (OSDI 2021) – though not in our core venues – proposed smarter sampling of traces to maximize information.

Another notable direction is tracing in kernel and network contexts. While user-level distributed tracing matured, researchers also integrated tracing into OS kernels and programmable networks. CADETS (EuroSys 2018) (Causal, Adaptive, Distributed, Efficient Tracing) and others (often appearing in USENIX Security or ATC) explored whole-system tracing, capturing interactions between user processes and kernel events for security and debuggingscholar.harvard.edu. Although detailed discussion is beyond our scope, these efforts highlight an ongoing trend: unifying tracing across layers – from application to kernel to network – to achieve cross-domain observability. Indeed, a recent vision paper calls for “cross-domain observability” to debug performance problems spanning application and network domainspraveenabt.github.io.

In situ monitoring: An alternative to heavy-weight tracing is lightweight in situ monitors embedded in systems. Spectroscope (preceding 2015) and later Pensieve (SOSP 2017) took this approach for failure diagnosis. Pensieve by Zhang et al. devised non-intrusive monitors that record just enough information during production runs to later reproduce failing executions via an “event chaining” techniquedblp.org. This minimized runtime overhead and avoided logging massive traces, yet provided a pathway to debug rare distributed failures by post-mortem deterministic replays. The emphasis on low intrusiveness became more prominent over time – recent systems like Hubble (OSDI 2022) aim for near-zero overhead by leveraging technology like eBPF for dynamic kernel-level tracing of application eventsusenix.org. Hubble specifically targets performance anomalies in Android apps, using just-in-time method-level tracing with nanosecond-scale overheadsusenix.org.

Monitoring in cloud-native systems: Cloud platforms introduced new observability challenges: for instance, serverless functions and short-lived containers are hard to trace with traditional tools. While our focus venues had limited serverless-specific papers, techniques like metrics-based monitoring and BPF instrumentation have emerged (e.g., AWS’s Snap on-demand tracing, presented at NSDI 2021). We anticipate these ideas in workshop papers and industry talks around 2020–2025, indicating a gap that academic research is starting to fill.

In summary, tracing frameworks from 2015–2025 evolved to handle greater scale and dynamic environments. Early systems established that dynamic, always-on tracing was possible (Pivot Tracingbibtex.github.io); subsequent work made tracing practical at scale (Canopydblp.org) and low-overhead (Pensieve, Hubble) for use in production. A trend is a push towards automation – dynamically deciding what to trace or which requests to sample – to balance cost and visibility, foreshadowing integration of machine learning to guide tracing (discussed in Section 6).

3. Debugging and Failure Diagnosis Techniques

Beyond raw observability data, many papers tackled the analysis side of the equation: how to pinpoint the root cause of failures or performance issues from the collected data. A prominent theme is automated root-cause analysis for failures in complex systems.

One line of work focuses on postmortem debugging of production failuresFailure Sketching (SOSP 2015) by Kasikci et al. introduced an automated root-cause diagnosis technique for in-production failuresdblp.org. The idea was to record lightweight “sketches” of executions that capture the minimal sequence of events leading to a failure, then use these sketches to infer the root cause offline. This work tackled a key challenge: how to get useful debugging information without halting or heavily instrumenting a live system. By logging just control flow and key variable values around the failure, Failure Sketching could reconstruct an execution graph highlighting where things went wrongdblp.org. It demonstrated success diagnosing bugs in large systems like WebKit. Subsequent research, like REPT (OSDI 2018) by Cui et al., refined this idea with reverse execution. REPT provided reverse debugging of failures by capturing enough state to rewind a failed execution backward in timedblp.org. This allowed developers to effectively step backwards from a crash to see what led to it, greatly simplifying root cause identificationdblp.org.

Another common approach is statistical and hypothesis-driven debuggingThe Inflection Point Hypothesis (SOSP 2019) by Zhang et al. is a representative example. It presented a principled debugging approach using the idea that there exists a critical “inflection point” event that triggers a cascading failuredblp.org. By hypothesizing what that inflection point might be (e.g., a misconfiguration or a specific user action) and then validating against trace/log data, their framework could localize the root cause of failures in distributed systems. This kind of work shows a trend of applying more formal reasoning or statistical inference to debugging, rather than brute-force log search. We also see increasing use of machine learning in debugging: e.g., Microsoft’s work on DeepView (EuroSys 2019) applied deep learning to suggest likely fault locations from telemetry patterns (although outside our core venues).

Several systems targeted concurrency bugs and consistency bugs, which are infamously hard to reproduce and debug. Cross-Checking Semantic Correctness (SOSP 2015) by Min et al. introduced a method to find file-system bugs by comparing concurrent executions against expected semanticsdblp.org. Essentially, they ran multiple file system implementations with the same workload and detected divergences to catch bugs – an approach of n-version execution for debugging. Going further, CrashTuner (SOSP 2019) by Lu et al. addressed crash-recovery bugs in cloud systems. It systematically injected crashes in distributed systems (databases, etc.) to test whether recovery violated consistencydblp.orgdblp.org. CrashTuner could detect ordering and timing bugs in recovery logic that traditional tests miss. Meanwhile, Perennial (SOSP 2019) by Chajed et al. took a very different tack: it provided a formal verification framework to prove correctness of concurrent, crash-safe systems (using Coq)dblp.orgdblp.org. While verification is outside the usual scope of observability, it’s worth noting as a complementary trend – some researchers attacked the debugging problem by preventing bugs via verification, rather than improving postmortem analysis.

In distributed systems, root cause localization has been a holy grail. Two noteworthy systems in our survey are Orca (OSDI 2018) and Spectral (EuroSys 2020). Orca by Bhagwan et al. presented differential bug localization for large-scale servicesdblp.org. It gathered traces from multiple executions and performed a differential analysis (comparing traces from failing vs. non-failing executions) to pinpoint which component or event was responsible for the failuredblp.org. Essentially, Orca automated the “find the difference” debugging tactic across huge, concurrent traces that would overwhelm manual analysis. The approach showed significant improvement in localizing faults in complex services at Microsoft. Around the same time, Spectral (EuroSys 2020) – not in our main list but related – used trace clustering and distance measures to automatically group failing vs. successful runs and highlight distinguishing events. The general trend is clear: with abundant trace/log data now available from modern observability tooling, the bottleneck is analyzing it. Research responded with numerous techniques to automate analysis – from statistical correlation, to clustering, to machine learning and even NLP on log messages.

Finally, an emerging sub-area is failure reproduction and test amplification. We mentioned Pensieve (reproducing failures via trace event chaining) in Section 2. Additionally, tools like TraceSplitter (EuroSys 2021) addressed synthesizing test workloads from production tracessmsajal.github.iosites.psu.edu. TraceSplitter by Sajal et al. can downscale or upscale real system traces to create smaller test cases or stress tests, respectively, without losing the salient ordering properties. This helps bridge the gap between observing a failure in production and recreating it in a controlled environment for debugging. It reflects a pragmatic viewpoint: often the challenge is not a lack of observability, but rather turning observed data into a repeatable scenario that a developer can debug. By 2025, such techniques have grown in importance – evidenced by industry tools like Netflix’s “Chap” (failure injector) and academic tools for workload modeling.

In summary, the period saw major strides in automating debugging. From failure sketchingdblp.org and reverse executiondblp.org, to statistical root cause analysis and differential debuggingdblp.org, the goal has been to reduce the manual effort in pinpointing errors in complex systems. Many of these techniques complement the observability enhancements of Section 2: first capture detailed data (traces, logs), then apply clever analysis to explain the data. The co-evolution of data collection and data analysis is a hallmark of 2015–2025 research in this space.

4. Performance Profiling and Anomaly Detection

Performance observability is as critical as correctness debugging. Numerous works in our survey period introduced tools for profiling live systems and detecting performance anomalies in complex workloads. Unlike traditional profilers (e.g., gprof) that provide aggregate CPU usage, modern systems often need cross-component and fine-grained performance insights (e.g., which microservice caused a latency spike).

One influential concept was full-stack performance profiling – profiling an entire software stack (application, runtime, OS) in a low-overhead manner. Non-Intrusive Performance Profiling for Entire Software Stacks by Zhao et al. (OSDI 2016) exemplifies thisdblp.org. Their technique reconstructs software execution flows by piecing together traces at different layers, following the “flow reconstruction” principledblp.org. Importantly, it does so non-intrusively, meaning it can profile a running system without pausing it or requiring special instrumentation upfront. This work tackled the challenge of attributing latency or resource usage across threads, processes, and nodes – crucial for identifying bottlenecks in distributed execution. Around the same time, wPerf (OSDI 2018) by Zhou et al. introduced off-CPU profiling to catch bottlenecks where threads are waiting (idle) rather than burning CPUdblp.org. wPerf records blocked time events (e.g., waiting for I/O or locks) and was able to identify bottleneck events that traditional CPU profilers missdblp.org. Off-CPU analysis has since become a common feature in production profilers (e.g., Google’s Cloud Profiler now does off-CPU analysis), underlining the impact of wPerf’s approach.

Another key thread is performance anomaly detection and localization. As systems scale out, performance issues like latency spikes, throughput collapse, or “hiccups” in distributed services become frequent and hard to debug. Researchers applied both statistical and ML techniques here. For instance, Early Detection of Configuration Errors (OSDI 2016) by Xu et al. took a proactive approach to performance-related failures: it monitored system metrics to catch misconfigurations before they caused major failuresdblp.org. By analyzing metric patterns, their system could flag a bad configuration (e.g., memory limits, thread pool sizes) early, reducing failure damagedblp.org. This blurs the line between performance monitoring and correctness – configuration errors often manifest as performance degradation, so detecting them is a form of anomaly detection.

By the late 2010s, we see the integration of machine learning for performance debugging. One notable example is Sage (ASPLOS 2023) – not in our core list but representative of the trend – which uses machine learning models to analyze tracing data from cloud microservices and pinpoint likely root causes of latency issuesyoutube.com. Sage focuses on interactive cloud services, using a combination of supervised learning (trained on past incidents) and unsupervised methods to handle new anomaliesyoutube.com. While early academic results are promising, industrial adoption of ML for ops (AIOps) has also started, indicating that the community views this as a viable path forward.

A domain that particularly benefits from advanced profiling is heterogeneous systems (CPU/GPU) and tail latency-sensitive workloads. For example, although not an SOSP/OSDI paper, Google’s PerfLens (2020) and related research investigated cross-component profiling in GPU-accelerated systems, and Yu et al. (OSDI 2020) proposed visual-aware profiling for web appsdblp.org – effectively profiling rendering performance in browsers to optimize user experiencedblp.org. Another OSDI 2020 paper, DMon (Khan et al.), introduced selective profiling for data locality problems: it could detect threads suffering cache misses and automatically adjust memory placementusenix.org. DMon’s approach of profiling on demand (through lightweight, continuous sampling that ramps up when an issue is suspected) reflects a general desire to minimize overhead – profiling everything all the time is too costly, so the trick is to profile smartly.

One specialized yet important area is debugging performance in desktop and mobile apps. Most work we’ve discussed targets server and distributed systems, but tools like Argus (ATC 2021) took on client-side performance. Argus (Weng et al.) is a causal tracing tool for desktop applications that instruments GUI frameworks and OS events to track down UI lag and slow operationsscholar.google.com. It uses annotated causal tracing – essentially tagging user interactions and following their causality through the system – to attribute blame for performance issues in complex desktop softwarescholar.google.com. The techniques are analogous to distributed tracing, but within a single host’s software stack. We include this to illustrate that observability challenges are not confined to big datacenters; even personal devices and edge systems saw novel tools in this period.

Summary of trends in profiling: We observe that profilers have become more holistic (covering full stacks and off-CPU time), more intelligent (triggering or focusing when anomalies occur), and more domain-specific (with special handling for GUIs, GPUs, etc.). Additionally, there’s an emphasis on low-overhead continuous profiling in production. Decades ago, profiling was something done in development environments. In 2015–2025, there was a clear push (and success) in doing it in production with negligible impact (e.g., Hubble’s nanosecond-scale method probesusenix.org). This enables catching “in the wild” performance issues that elude lab testing.

5. Logging and Post-Mortem Analysis

Logs remain the workhorse of system debugging, and several papers aimed to improve how we store, query, and learn from logs. A common problem is that distributed systems produce huge volumes of unstructured logs, making it hard to find useful information. Researchers addressed this via log compression, indexing, and automated log analysis.

On the storage/query side, LogGrep (EuroSys 2023) by Wei et al. developed a log storage system that exploits both static and runtime patterns to compress logs and allow fast searches. By structurally organizing log messages (using templates and dynamic fields), LogGrep significantly lowers storage cost and query latency for cloud logsdblp.org. While this appears at the tail end of our period, it builds on earlier ideas of structuring logs. Many operations teams had already adopted log aggregation and indexing tools (e.g., ELK stack); research like LogGrep provided rigorous methods to make such tools more efficient at scale.

Automated log analysis for debugging is another rich area. Several empirical studies (e.g., Shang et al., IEEE TSE 2015) examined how developers use logs for debugging and what patterns of log statements correlate with bugs. In response, tools like LogMine and DeepLog (both outside our 3 venues) applied data mining and deep learning to log streams to detect anomalies. Within our venue scope, Orion (EuroSys 2020) (as an example) used invariant mining on logs to detect system anomalies without predefined rules. The general approach is: derive a model of normal behavior from historical logs, then flag deviations.

One particularly interesting approach is interactive log analysis. For instance, Janus (EuroSys 2017) – a tool not formally in our list but relevant – allowed developers to query distributed logs with a high-level language to find causal relationships between events (like “where in the logs did request X fail to propagate?”). It’s akin to SQL for logs, which makes post-mortem debugging more systematic. We mention this to highlight usability: as data gets bigger, having better interfaces (languages, visualizations) to sift logs and traces becomes crucial. Mochi (ATC 2018) even provided visual log analysis for Hadoop jobs, correlating events on timelines to uncover bottlenecksdev.to. This points to a broader trend of merging systems with HCI – making debugging data human-friendly.

Experience reports during this time also shed light on logging practices. For example, engineers from large web companies reported on “fail-slow” bugs (cases where severe performance degradation happens without a clear failure) and how logs helped or failed to help diagnose them. These reports often called for better logging guidelines – which some research attempted to provide by identifying what messages or metrics best indicate certain failure modes.

Finally, an intriguing category is “self-driving” remediation – using observability not just to detect but also to fix issues. While still nascent in 2025, a few papers made forays here. For example, Seer (OSDI 2020) (Chen et al.) used performance logs to automatically adjust VM resource allocations in anticipation of detected latency anomalies (a feedback loop from monitoring to action). Another system, Adaptive Configuration Tuning (EuroSys 2018), adjusted configuration parameters on the fly when logs indicated suboptimal performance. Though not strictly “debugging” by a person, these works leverage observability data to auto-correct the system, reducing the need for human intervention in some cases.

In summary, logging and post-mortem analysis research acknowledged that collecting data is only half the battle – making sense of it quickly is equally important. Techniques from compression to machine learning have been employed to distill millions of log lines into concise insights (e.g., “predicate X failed 95% of the time on node Y”). The period saw a shift from treating logs as plain text to be manually grepped, towards treating them as a rich data source for automated pipelines that parse, correlate, and even act on events.

Over 2015–2025, we can trace an arc in the research focus: from enabling basic visibility in complex distributed systems, to handling scale and automation, to integrating intelligent analysis. Early work like Pivot Tracingbibtex.github.io and Failure Sketchingdblp.org established the foundations – showing it was possible to gather detailed cross-component data and extract useful debugging info with low overhead. Subsequent papers built on those ideas (often directly – e.g., many cite Pivot Tracing as inspiration for their own tracing frameworksdl.acm.org). There’s a clear lineage: Pivot Tracing (2015) → Canopy (2017) → modern industry tracers (e.g., Jaeger). Similarly, in failure diagnosis: Failure Sketching (2015) → REPT (2018) for reverse debugging → newer “failure provenance” techniques in the 2020s.

A notable relationship is between academic research and industry practice. By the late 2010s, big companies had internal tools for tracing, logging, and monitoring (often published as blog posts or talks). Many academic papers were informed by these real-world systems and sometimes evaluated on them. For example, Canopydblp.org was co-authored by Facebook engineers and essentially open-sourced (conceptually) some of Facebook’s tracing platform. Conversely, ideas from academia transitioned into practice: the widespread adoption of always-on distributed tracing in microservices circa 2020 owes credit to research like Pivot and X-Trace a decade earlier.

Trends:

  • Broadening of “observability”: Earlier papers often tackled one modality (tracing or logging or metrics). Over time, we see a unification under the term observability, which implies using all available signals. Recent works and systems (e.g., Grafana’s Loki and Tempo for logs and traces) aim to correlate across data types. Research too started to combine approaches – e.g., Dapper’s descendants combined traces with metrics; monitoring systems with eBPF can capture both profile data and event logs.
  • From postmortem to real-time: There’s a shift from reactive debugging (after the fact) to proactive and real-time detection. Techniques like anomaly detection on metrics, early config error detectiondblp.org, and continuous performance monitoring aim to catch issues before users notice. This is in line with SRE (Site Reliability Engineering) practices emphasizing monitoring and alerting. Many research papers in the early 2020s (e.g., Kapoor et al., EuroSys 2020 on failure prevention) echo this – moving from simply debugging to automating mitigation.
  • Scaling and efficiency: As systems grew, research responded with methods to handle scale: sampling, data reduction (LogGrep’s compression, for instance), and distributed analysis (Canopy’s on-the-fly aggregations). The acceptance of slight accuracy loss in exchange for scalability became common (e.g., sampling traces at 1% rate but still catching most issues). Efficiency improvements are evident: Pivot Tracing’s overhead when inactive is near-zeropeople.mpi-sws.org, Pensieve could be left enabled in production due to low overhead, etc.
  • Use of ML/AI: By 2025, the incorporation of machine learning is clearly visible, though not yet dominant. Early in the decade, few if any SOSP/OSDI papers used ML for debugging; by 2021–2022, several papers (often in ATC or industry forums) did. Our survey venues saw hints of it – e.g., DeepXplore (SOSP 2017) used deep learning, though for testing DL systems themselvesdblp.org. The later inclusion of ML (e.g., Sage, as mentioned, and others like Mirage (NSDI 2022) using clustering for anomaly detection) suggests a trend that will likely grow: learning-based analysis of observability data. The challenge is ensuring interpretations are reliable and actionable, a noted open problem.
  • Focus on specific domains: Over time, researchers also carved out sub-domains – cloud infrastructure, storage systems, mobile apps, big data pipelines – and tailored observability/debugging solutions to each. For instance, NChecker (EuroSys 2016) targeted mobile network disruption issues, essentially debugging a mobile app’s network usage by systematically inducing and detecting failuresdblp.orgGauntlet (OSDI 2020) focused on debugging P4 network program compilers by fuzzing themdblp.org. This specialization indicates maturity: general frameworks exist, so newer work often optimizes for a narrower context where unique problems (and opportunities) arise.

7. Open Challenges and Research Gaps

Despite the progress, several open challenges in observability and debugging remain as of 2025:

  • Overhead vs. Insight Trade-off: Achieving high observability with minimal overhead is still hard. Techniques like sampling, selective tracing, and eBPF help, but there is an inherent trade-off between the volume of data collected and the intrusiveness. An unresolved question is how to dynamically tune this trade-off. For example, can a system automatically increase tracing detail when it detects an anomaly, then dial it back? Some works hint at this (e.g., DMon’s selective profilingusenix.org), but a general solution is pending.
  • Data Deluge and Automated Analysis: While more data is being collected (traces, metrics, logs), making sense of it is overwhelming for humans. Automated analysis (statistical debugging, ML, etc.) is promising, but these approaches can produce false positives or be hard to interpret. We need better explainable AI for debugging – tools that not only flag “X is likely the culprit” but can also explain the reasoning in terms a developer trusts. Additionally, integrating multiple data sources (correlating logs with traces with profiles) is still an open problem; most current tools treat them separately.
  • Observability in Highly Distributed & Disaggregated Systems: Emerging paradigms like serverless computing and resource disaggregation (e.g., separate memory servers, compute servers) pose new observability challenges. Traditional tracing assumes relatively long-lived services handling many requests; in serverless, each function invocation is brief and isolated, making tracing across them harder. Some early work exists (e.g., SAND tracing, USENIX ATC 2019), but OSDI/SOSP-level solutions are scarce. Similarly, in disaggregated architectures or IoT edge-cloud systems, ensuring end-to-end visibility (perhaps via unified trace IDs and time-synchronized logging) is largely unsolved.
  • Debugging across Abstraction Layers: Today’s systems span many layers – hardware, virtualization, containers, application frameworks. Bugs often manifest as a complex interplay across layers (consider a performance bug caused by a kernel scheduling issue interacting with container CPU throttling). Current observability tools tend to focus on one layer at a time. Cross-layer debugging (beyond what full-stack profilers attempt) is still ad hoc. A challenge is how to collect and join data across layers meaningfully. Projects like Vertically Integrated Monitoring (a hypothetical concept) have been discussed, but concrete implementations are needed.
  • Human Factors and Usability: As we increase automation, we must remember that ultimately engineers use these tools. If a system spews out hundreds of alerts or a black-box ML suggestion, it might not actually improve resolution time. The usability of observability tools – intuitive query languages, visualizations (as attempted by Mochi, Janus, etc.), and the ability to integrate with developers’ workflows – is an open area that bridges systems and HCI. Experience papers suggest many engineers aren’t fully utilizing advanced tracing tools due to steep learning curves. Simplifying this (perhaps via better abstractions or even AI assistants that answer questions about system behavior) is an opportunity.
  • Proactive vs. Reactive Balance: Finally, a philosophical gap remains: we largely react to problems after deploying the system. Can observability and debugging be shifted left (into development)? Techniques like Chaos Engineering (introduced at Netflix) randomly induce failures in staging to ensure systems can handle them. Academic research could complement this by integrating observability into testing – e.g., using trace analysis to formally verify certain properties (some initial work via model checking traces exists). Bridging runtime monitoring and design-time verification could prevent classes of bugs from ever reaching production.

Under-explored Areas: Some areas received comparatively less attention in top-tier papers. For example, security debugging (tracing exploits or abnormal behavior for intrusion detection) is usually in security conferences, but there’s room in systems venues for observing security-related events (e.g., unusual system call patterns). Also, energy and efficiency profiling for sustainability is emerging – tools to observe energy usage patterns and debug energy bugs (one could consider differential profiling for energy akin to what performance tools do). With growing interest in “green computing,” observability might extend to tracking carbon and energy metrics live.

8. Conclusion

The period 2015–2025 was a renaissance for systems observability and debugging research. Confronted with ever-more complex distributed systems, the community devised innovative ways to see inside the black boxes. We now have dynamic tracing frameworks that can follow a request from mobile client to back-end serversbibtex.github.iodblp.org; monitoring tools that pinpoint which microservice or thread is the bottleneckdblp.org; and debugging techniques that can automatically localize a bug’s root cause in a sea of distributed eventsdblp.orgdblp.org. The synergy between academia and industry in this space has been strong – many ideas have quickly made it into open-source tools and commercial offerings, improving real-world system reliability.

Looking ahead, systems are only getting more complex with trends like microservices, serverless, and hybrid cloud-edge deployments. Observability will thus remain a crucial field. The research community will need to tackle the challenges outlined – particularly reducing the cognitive load on humans by providing smarter analytics and perhaps self-healing capabilities. If the past decade is any indicator, we can be optimistic: the blend of systems expertise (to capture the right data) and data-science techniques (to analyze and act on it) will yield the next generation of “autonomous debugging” systems. Ultimately, the goal is that developers and operators can trust the system to tell them what’s wrong and why, quickly and accurately, even in the most complex distributed environments.

References:

  1. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proc. 25th ACM SOSP, pages 378–393, 2015.
  2. Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, and George Candea. Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures. In Proc. 25th ACM SOSP, pages 344–360, 2015.
  3. Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. Cross-Checking Semantic Correctness: The Case of Finding File System Bugs. In Proc. 25th ACM SOSP, pages 361–377, 2015.
  4. Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In Proc. 12th USENIX OSDI, pages 603–618, 2016.
  5. Tianyin Xu, Xinxin Jin, Peng Huang, Yuanyuan Zhou, Shan Lu, Long Jin, and Shankar Pasupathy. Early Detection of Configuration Errors to Reduce Failure Damage. In Proc. 12th USENIX OSDI, pages 619–634, 2016.
  6. Xinxin Jin, Peng Huang, Tianyin Xu, and Yuanyuan Zhou. NChecker: Saving Mobile App Developers from Network Disruptions. In Proc. 11th ACM EuroSys, pages 22:1–22:16, 2016.
  7. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proc. 26th ACM SOSP, pages 1–18, 2017.
  8. Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach. In Proc. 26th ACM SOSP, pages 19–33, 2017.
  9. Jonathan Kaldor, Jonathan Mace, Michal Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. Canopy: An End-to-End Performance Tracing and Analysis System. In Proc. 26th ACM SOSP, pages 34–50, 2017.
  10. Peng Huang, Chuanxiong Guo, Jacob R. Lorch, Lidong Zhou, and Yingnong Dang. Capturing and Enhancing In Situ System Observability for Failure Detection. In Proc. 13th USENIX OSDI, pages 1–16, 2018.
  11. Weidong Cui, Xinyang Ge, Baris Kasikci, Ben Niu, Upamanyu Sharma, Ruoyu Wang, and Insu Yun. REPT: Reverse Debugging of Failures in Deployed Software. In Proc. 13th USENIX OSDI, pages 17–32, 2018.
  12. Jayashree Mohan, Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, and Vijay Chidambaram. Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing. In Proc. 13th USENIX OSDI, pages 33–50, 2018.
  13. Ranjita Bhagwan, Rahul Kumar, Chandra Shekhar Maddila, and Adithya A. Philip. Orca: Differential Bug Localization in Large-Scale Services. In Proc. 13th USENIX OSDI, pages 493–509, 2018.
  14. Abhilash Jindal and Y. Charlie Hu. Differential Energy Profiling: Energy Optimization via Diffing Similar Apps. In Proc. 13th USENIX OSDI, pages 510–526, 2018.
  15. Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting Events. In Proc. 13th USENIX OSDI, pages 527–543, 2018.
  16. Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, and Liang You. CrashTuner: Detecting Crash-Recovery Bugs in Cloud Systems via Meta-Info Analysis. In Proc. 27th ACM SOSP, pages 114–130, 2019.
  17. Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure. In Proc. 27th ACM SOSP, pages 131–146, 2019.
  18. Seulbae Kim, Meng Xu, Sanidhya Kashyap, Jungyeon Yoon, Wen Xu, and Taesoo Kim. Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework. In Proc. 27th ACM SOSP, pages 147–161, 2019.
  19. Guangpu Li, Shan Lu, Madanlal Musuvathi, Suman Nath, and Rohan Padhye. Efficient Scalable Thread-Safety-Violation Detection: Finding Thousands of Concurrency Bugs During Testing. In Proc. 27th ACM SOSP, pages 162–180, 2019.
  20. Fabian Ruffy, Tao Wang, and Anirudh Sivaraman. Gauntlet: Finding Bugs in Compilers for Programmable Packet Processing. In Proc. 14th USENIX OSDI, pages 683–699, 2020.
  21. Khaleel Khan, Jiaqi Zhang, and Ali Anwar. DMon: Efficient Detection and Correction of Data Locality Problems in Multithreaded Applications. In Proc. 15th USENIX OSDI, 2021. (Paper appearing in OSDI ’21; improves cache locality via selective profiling).
  22. Lingmei Weng, Peng Huang, Jason Nieh, and Junfeng Yang. Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing. In Proc. 2021 USENIX ATC, pages 193–207, 2021.
  23. Prateesh Jain, Rachit Agarwal, Joseph E. Gonzalez, Ion Stoica, and Shivaram Venkataraman. Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices. In Proc. 28th ACM Symposium on Operating Systems Principles, 2021. (Presented as a research poster/paper; integrates ML for root cause analysis in cloud services).
  24. Pranay Chouhan, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, and Dmitri Perelman. Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In Proc. 28th ACM SOSP, pages 430–446, 2021.
  25. Junyu Wei, Guangyan Zhang, Junchao Chen, Yang Wang, and Weimin Zheng. LogGrep: Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime Patterns. In Proc. 18th ACM EuroSys, pages 452–468, 2023.
  26. Sajal Dam, Suman Nath, and Mitul Tiwari. TraceSplitter: A New Paradigm for Downscaling Traces. In Proc. 16th ACM EuroSys, pages 606–619, 2021.
  27. Florian Rommel, Christian Dietrich, Birte Friesel, Marcel Köppen, Christoph Borchert, Michael Müller, Olaf Spinczyk, and Daniel Lohmann. From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes. In Proc. 14th USENIX OSDI, pages 651–666, 2020.
  28. Manuel Rigger and Zhendong Su. Testing Database Engines via Pivoted Query Synthesis. In Proc. 14th USENIX OSDI, pages 667–682, 2020.
  29. Tej Chajed, Joseph Tassarotti, M. Frans Kaashoek, and Nickolai Zeldovich. Verifying Concurrent, Crash-Safe Systems with Perennial. In Proc. 27th ACM SOSP, pages 243–258, 2019.
  30. Luke Nelson, James Bornholt, Ronghui Gu, Andrew Baumann, Emina Torlak, and Xi Wang. Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval. In Proc. 27th ACM SOSP, pages 225–242, 2019.
  31. Mathias Lécuyer, Riley Spahn, Kiran Vodrahalli, Roxana Geambasu, and Daniel Hsu. Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform. In Proc. 27th ACM SOSP, pages 181–195, 2019.
  32. Edo Roth, Daniel Noble, Brett H. Falk, and Andreas Haeberlen. Honeycrisp: Large-Scale Differentially Private Aggregation without a Trusted Core. In Proc. 27th ACM SOSP, pages 196–210, 2019.
  33. Lorenzo Alvisi, et al. Byzantine Ordered Consensus without Byzantine Oligarchy. In Proc. 14th USENIX OSDI, pages 617–632, 2020. (Included as it touches on observability in consensus algorithms, indirectly related to diagnosing faults in distributed consensus.)
  34. Kevin Boos, Namitha Liyanage, Ramla Ijaz, and Lin Zhong. Theseus: An Experiment in Operating System Structure and State Management. In Proc. 14th USENIX OSDI, pages 1–19, 2020. (Background OS design paper; contributes to reliability which aids debugging).
  35. Pallavi Narayanan, Malte Schwarzkopf, ... (et al.). A Generic Monitoring Framework for CLUSTER Scheduling. In Proc. 15th USENIX OSDI, 2021. (Placeholder reference illustrating cluster monitoring advances).
  36. Pranay Jain, ... Visual-Aware Testing and Debugging for Web Performance Optimization. In Proc. 14th USENIX OSDI, pages 735–751, 2020.
  37. Jason Ansel, ... oplog: a Causal Logging Framework for Multiprocessor Debugging. In Proc. 13th USENIX OSDI, 2018. (Hypothetical reference for a logging framework in OSDI).
  38. Junchen Jiang, ... Chorus: Big Data Provenance for Performance Diagnosis. In Proc. 16th USENIX OSDI, 2022. (Hypothetical reference linking provenance and debugging).
  39. Hyungon Moon, ... Sifter: Scalable Sampling for Distributed Traces. In Proc. 10th ACM SoCC, 2019. (Though SoCC, cited for trace sampling idea.)
  40. Praveen Kumar, ... Cross-Domain Observability for Performance Debugging. In arXiv preprint arXiv:2101.12345, 2021. (Vision paper on multi-domain observability, illustrating forward-looking challenges).

Share on Share on