Skip to content

ASPLOS 2025: Paper Summaries and Insights

The Association for Computing Machinery's Architectural Support for Programming Languages and Operating Systems (ASPLOS) conference is a premier venue where researchers present cutting-edge work spanning computer architecture, programming languages, operating systems, and their intersections. ASPLOS 2025 showcased significant advances across these domains, with particular emphasis on AI systems, heterogeneous computing, and novel memory architectures. This summary analyzes key papers and trends from the conference, highlighting their implications for both academia and industry.

Hot Research Directions and Emerging Topics: Many ASPLOS 2025 papers converge on making massive-scale AI practical – from efficient LLM serving and training of MoE experts, to new systems for privacy-preserving ML (homomorphic encryption on GPUs, zero-knowledge proofs acceleration). Another clear thrust is memory disaggregation and CXL: multiple works assume an era of composable memory (CXL-enabled) and propose new OS, file, and transaction system designs to exploit it. Security remains vibrant – but the focus is shifting from classic OS/network security to hardware-centric and microarchitectural issues like RowHammer, speculative execution, and side-channel-resistant scheduling. Serverless and cloud-native optimizations are also prominent: there’s interest in bringing high-performance techniques (GPU sharing, low-latency networking, global scheduling) into serverless and containerized environments, indicating an industry push toward combining performance with the flexibility of cloud abstractions.

Unsolved Problems and Gaps: Despite progress, many papers underscore enduring challenges. Memory wall and data movement issues loom large – whether in inter-GPU communication for training or in persistent memory overheads. Several works (e.g., AnyKey KV-SSD, Fusion object store) illustrate that storage is still a bottleneck and custom solutions are needed to bridge the gap between CPU and data speeds. Security mitigations vs. performance trade-offs remain a tightrope walk: new RowHammer defenses and speculation tests must be ultra-lightweight to be adopted. The complexity of distributed systems management appears in work like Coach (multi-resource oversubscription) and Embracing Imbalance – suggesting that as cloud systems scale, automated, fine-grained control is still an open problem (especially ensuring those systems remain robust and stable under all conditions). In architectures, a gap exists between theoretical accelerators (e.g., analog in-memory computing, CGRAs) and full software integration – papers like Be CIM or Be Memory and PartIR try to fill this, but it’s clear that making exotic hardware easy to program is still in progress.

Surprising Challenges Surfaced: One surprise is how critical system-level orchestration has become in non-traditional areas: e.g., the side-channel work on scheduling reveals the scheduler can inadvertently aid attacks, a challenge that OS designers didn’t originally anticipate. Another striking theme is the challenge of observability in modern systems – not just logging, but doing so without perturbing the system (EXIST’s ultra-low overhead tracing, BTrace on devices). It appears that simply seeing what our complex systems are doing (especially microservices and edge devices) is hard, and researchers are treating observability as a first-class problem. Consistency and correctness in new contexts came up too: e.g., fuzzing persistent memory programs – something classical fuzzing never dealt with – showing that once we add persistence or heterogeneity, even basic development tools need rethinking.

Another theme is formal verification creeping into systems: ElasticMiter’s formal circuit rewrite and AMuLeT’s formal spec testing show that even traditionally pragmatic fields (hardware design, architecture) are accepting formal methods to tame complexity – a notable shift in methodology, likely driven by industry’s need for reliability (especially in security and safety-critical arenas).

Shifts in Assumptions and Design Principles: A subtle but important shift is the assumption that “software will have to work around hardware limits, and hardware will expose more to software.” For example, many works assume hardware will provide new hooks (counters per DRAM row, or CXL’s huge address space) and that software must intelligently exploit them. The line between components is blurring: kernel and application boundaries (as seen with user-driven preemption timing attacks or user-level thread scheduling in Coach) are being reexamined, meaning systems are increasingly co-designed end-to-end. There’s also a clear recognition that tailoring to context (context-aware optimizations) is key: static, one-size-fits-all solutions are out of favor. Instead, systems now monitor themselves and adapt (be it shifting microservice loads, or adjusting GPU quantization on the fly). This reflects a design principle of adaptivity – the best system is one that can sense and respond to workload changes in real time, often guided by ML or telemetry.

Implications for Future Research and Opportunities: The breadth of ASPLOS 2025’s contributions points to an exciting convergence of fields. Future research will likely explore automation and AI in systems – e.g., using learning to manage resources (some papers already hint at it with learned invariant detection or auto-tuners like DarwinGame). The continued arms race in microarchitectural security will spur more cross-layer ideas (like combining software scheduling and hardware fixes). With CXL and disaggregation maturing, we can expect more research on OS and database techniques for a world where memory and storage are network-attached (the community will need to address consistency, security, and performance in those). The interest in privacy-preserving computation (HE, ZK proofs) indicates a nascent but growing area: making these techniques fast enough for real-world deployment is an open challenge that intersects architecture, PL, and cryptography – future work may include specialized chips for ZK or better compilers for encrypted computing.

The emphasis on observability and correctness suggests that as systems become more complex (distributed, heterogeneous, self-optimizing), we need new fundamentals in debugging and verification – research may produce smarter debuggers or self-healing systems (taking inspiration from papers like dynamic deadlock recovery). Lastly, sustainability and edge computing appear in the intermittent computing works – as computing expands beyond big datacenters into tiny, energy-scavenging devices, research will focus on making algorithms and systems that function under extreme resource intermittency. All told, ASPLOS 2025 paints a picture of systems research that is more interdisciplinary than ever, blending hardware and software, and it lays groundwork that industry can build on to make computing faster, safer, and more adaptable.

Machine Learning and AI Systems

  • Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management (Jinwoo Jeong, Jeongseob Ahn). Proposes a system to speed up large language model (LLM) inference for interactive conversations by dynamically allocating and scheduling GPU memory and compute resources. This allows multi-turn chatbot sessions to run more efficiently, reducing latency and resource waste. Relevance: Improves the responsiveness and cost-efficiency of industry chatbot and virtual assistant services dealing with heavy AI workloads.

  • COMET: Towards Practical W4A4KV4 LLMs Serving (Lian Liu, Long Cheng, Haimeng Ren, Zhaohui Xu, Yudong Pan, Mengdi Wang, Xiaowei Li, Yinhe Han, Ying Wang). Introduces a quantization and serving scheme for LLMs using 4-bit weights, activations, keys, and values to significantly shrink model size. The technique maintains model accuracy while enabling faster inference with limited precision, making large models more practical to deploy. Relevance: Allows industry to serve large AI models (like GPT variants) with lower memory and compute, cutting hardware costs and energy use.

  • Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow (Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak). Develops a scheduling framework that partitions LLM inference across multiple GPUs connected by network, modeling the scheduling as a max-flow problem. Helix “stretches” model execution over GPUs of different capabilities by efficiently routing data, achieving low latency. Relevance: Addresses an industry need to serve huge models using existing multi-GPU servers (possibly with dissimilar GPUs), maximizing hardware utilization in data centers.

  • MoE-Lightning: High-Throughput MoE Inference on Memory-Constrained GPUs (Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica). Optimizes inference for Mixture-of-Experts (MoE) models under limited GPU memory by intelligent caching and load-balancing across experts. The system sustains high throughput by activating only needed experts and efficiently managing memory between them. Relevance: Helps deploy large MoE-based AI models (popular in NLP) in production, even on GPUs with modest memory, which is valuable for cloud AI providers.

  • FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models (Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu). Presents a distributed training framework for MoE models that flexibly scales and adapts to different expert sparsity patterns. It balances load among expert submodels and supports dynamic growth of experts, improving training efficiency. Relevance: Addresses training of massive MoE models – a cutting-edge industry practice for efficient deep learning – by reducing communication overhead and training time on multi-node GPU clusters.

  • Cascade: A Dependency-aware Efficient Training Framework for Temporal Graph Neural Network (Yue Dai, Xulong Tang, Youtao Zhang). Introduces Cascade, a training system for temporal GNNs that leverages the observation of dependency patterns in sequential graph data. It reorders and parallelizes computations while respecting temporal dependencies, reducing idle time and speeding up training. Relevance: Beneficial for industries analyzing temporal graph data (social networks, event streams) by cutting training costs and enabling quicker model updates.

  • Frugal: Efficient and Economic Embedding Model Training with Commodity GPUs (Minhui Xie, Shaoxun Zeng, Hao Guo, Shiwei Gao, Youyou Lu). Develops techniques to train large embedding-based models (common in recommendation systems) on commodity GPUs with limited memory. It partitions and offloads embedding tables cleverly between GPU and CPU memory, minimizing data transfer. Relevance: Directly targets industrial recommendation systems (e.g. e-commerce, ads) where training massive embedding tables is costly – offering a way to use cheaper hardware without major performance loss.

  • GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism (Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, Zhihao Jia). Proposes GraphPipe, a system that partitions deep neural network training across GPUs using graph pipeline parallelism. Instead of traditional layer-wise splitting, it treats the training computation graph as a whole, cutting it into pipeline stages to maximize parallel GPU usage and overlap of computation/communication. Relevance: Improves training throughput for very large models on multi-GPU clusters, which is directly useful for tech companies training frontier AI models.

  • PartIR: Composing SPMD Partitioning Strategies for Machine Learning (Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Molloy, Tom Natan, Tamara Norman, Xiaoyue Pan, Adam Paszke, Norman A. Rink, Michael Schaarschmidt, Timur Sitdikov, Agnieszka Swietlik, Dimitrios Vytiniotis, Joel Wee). Introduces PartIR, an intermediate representation and framework that composes different Single Program, Multiple Data (SPMD) parallelization strategies for ML models. It allows combining data, model, pipeline, and tensor parallelism in a unified way for training large models. Relevance: Helps industry ML compilers (like XLA or PyTorch engines) to automatically apply multiple parallelism forms together, enabling training of enormous models across many devices with minimal manual effort.

  • Nazar: Monitoring and Adapting ML Models on Mobile Devices (Wei Hao, Zixi Wang, Lauren Hong, Lingxiao Li, Nader Karayanni, AnMei Dasbach-Prisk, Chengzhi Mao, Junfeng Yang, Asaf Cidon). Presents a system for continuous monitoring of on-device ML model performance and runtime adaptation on smartphones. It detects when a model’s accuracy degrades (e.g., due to data drift or context change) and triggers on-device model updates or specialized processing. Relevance: Supports more reliable AI features in mobile apps (like on-device vision or speech) by ensuring models remain accurate without constant cloud retraining – a practical concern for deploying AI at the edge.

  • CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory (Jiashun Suo, Xiaojian Liao, Limin Xiao, Li Ruan, Jinquan Wang, Xiao Su, Zhisheng Huo). Proposes an inference scheme for Collaboration-of-Experts models (a variant of mixture-of-experts) that judiciously activates subsets of “expert” submodels to fit within limited GPU memory. It loads and evicts experts on the fly and reuses overlapping computations to save memory. Relevance: Useful for deploying complex AI models composed of multiple sub-networks on memory-constrained devices or VMs – enabling high accuracy through model specialization without requiring high-end hardware.

  • DynaX: Sparse Attention Acceleration with Dynamic X% Fine-Grained Structured Pruning (Xiao Xiong, Zhaorui Chen, Yue Liang, Minghao Tian, Jiaxing Shang, Jiang Zhong, Dajiang Liu). Develops an accelerator and compiler approach that performs dynamic structured pruning on Transformer attention matrices. DynaX adaptively prunes a fine-grained percentage of attention heads and tokens at runtime (“X%” pruning) to skip computation on insignificant parts, while hardware support exploits the resulting sparsity. Relevance: Offers a way to speed up Transformer models (common in NLP and vision) in production by intelligently trading off a tiny bit of accuracy for much lower latency – appealing for real-time AI services.

Memory and Storage Systems

  • AnyKey: A Key-Value SSD for All Workload Types (Chanyoung Park, Jungho Lee, Chun-Yi Liu, Kyungtae Kang, Mahmut T. Kandemir, Wonil Choi). Introduces a flash storage device optimized for key–value store operations. Unlike conventional SSDs, AnyKey embeds key–value logic and adaptively handles different access patterns (e.g., small random gets vs. large scans) with custom firmware, improving throughput for diverse workloads. Relevance: Key–value stores back many web services; an SSD that natively supports them can simplify datacenter caching layers and boost performance for databases, caches, and object stores used in industry.

  • ByteFS: System Support for (CXL-based) Memory-Semantic Solid-State Drives (Shaobo Li, Yirui E. Zhou, Hao Ren, Jian Huang). Proposes a new file system and OS support for emerging memory-semantic SSDs connected via CXL (Compute Express Link). These devices blur the line between storage and memory. ByteFS allows programmers to access persistent data on such SSDs with load/store instructions as if it were regular memory, handling consistency and caching under the hood. Relevance: As CXL memory expanders and storage-class memory become available, this work guides industry on how to integrate them for near-memory-speed data access, benefiting high-performance databases and in-memory analytics.

  • EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation (Weigao Su, Vishal Shrivastav). Designs a specialized Ethernet-based network fabric to enable cluster-wide memory disaggregation with extremely low latency. EDM reduces software overhead and protocol latency for remote memory access, using lightweight NIC support and congestion control tuned for small memory accesses. Relevance: Memory disaggregation (accessing RAM across servers) is of great interest in cloud data centers. This fabric could let cloud providers pool memory across machines efficiently, cutting costs by improving utilization and enabling “pay-as-you-go” memory scaling.

  • CTXNL: A Software-Hardware Co-designed Solution for Efficient CXL-Based Transaction Processing (Zhao Wang, Yiqi Chen, Cong Li, Yijin Guan, Dimin Niu, Tianchan Guan, Zhaoyang Du, Xingda Wei, Guangyu Sun). Proposes a co-design of database software and hardware controller to exploit CXL-attached memory in transaction processing systems. By offloading certain logging, indexing, and commit operations to a smart memory controller (accessible via CXL) and tailoring the software algorithms accordingly, CTXNL achieves higher throughput for in-memory transactions. Relevance: With CXL memory modules on the horizon, database vendors can draw on this work to build hybrid memory systems where some transaction tasks are accelerated in hardware, improving performance for financial or e-commerce transaction databases.

  • CXLfork: Fast Remote Fork over CXL Fabrics (Chloe Alverti, Stratos Psomadakis, Burak Ocalan, Shashwat Jaiswal, Tianyin Xu, Josep Torrellas). Develops a mechanism called CXLfork that allows a process on one machine to fork a near-identical copy of itself on another machine via a CXL memory fabric. It leverages CXL’s cache-coherent sharing to quickly replicate memory state remotely, greatly speeding up distributed process spawning. Relevance: Useful for cloud providers and distributed systems – e.g., quickly spawning workers or microservices on another server – without the overhead of full serialization. It demonstrates how to use advanced interconnects (CXL) to blur machine boundaries for faster scaling.

  • Aqua: Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains (Abhishek Vijaya Kumar, Gianni Antichi, Rachee Singh). Proposes Aqua, a system that extends GPU memory for large models by using ultra-fast network links to offload oversize model layers to remote memory (or host memory). It uses RDMA/network NIC support to fetch offloaded model chunks just-in-time during LLM inference, minimizing stall time. Relevance: Addresses a common limitation in deploying large AI models – limited GPU memory. Aqua’s approach could be applied in AI serving infrastructure to run bigger models on given hardware by transparently offloading data, rather than requiring expensive GPUs with huge memory.

  • Fusion: An Analytics Object Store Optimized for Query Pushdown (Jianan Lu, Ashwini Raina, Asaf Cidon, Michael J. Freedman). Describes Fusion, an object storage system (for data lakes) that pushes parts of analytical query processing down into the storage layer. It co-designs data layout and storage APIs so that filtering, projections, and aggregations can be done within the storage servers, reducing data shipped to compute. Relevance: In industry data analytics platforms (like Spark or Redshift Spectrum), reducing data movement is key. This work aligns with trends of “smart storage” and can make cloud analytics and big-data queries faster and cheaper by leveraging storage nodes for computation.

  • Medusa: Accelerating Serverless LLM Inference with Materialization (Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, Youyou Lu). Targets serverless platforms and speeds up FaaS (Function as a Service) inference of large models by materializing intermediate results across invocations. Medusa caches and reuses common sub-computations of LLM inference between function executions, reducing redundant work in a stateless serverless environment. Relevance: Bridges large AI models with serverless computing – important for cloud providers offering AI inference as a service. It lets stateless functions handle big models more efficiently, improving scalability and cost of AI inference in serverless architectures.

  • EXIST: Enabling Extremely Efficient Intra-Service Tracing Observability in Datacenters (Xinkai Wang, Xiaofeng Hou, Chao Li, Yuancheng Li, Du Liu, Guoyao Xu, Guodong Yang, Liping Zhang, Yuemin Wu, Xiaopeng Yuan, Quan Chen, Minyi Guo). Proposes a tracing system that captures fine-grained, low-overhead traces inside microservices by using sampling and in-kernel logging mechanisms tuned for data center workloads. EXIST provides high observability of request flows with negligible performance cost by tailoring trace collection to the service’s logic and load patterns. Relevance: Improves on current distributed tracing tools (like Dapper or Jaeger) by drastically cutting overhead. Cloud companies could adopt these techniques to get detailed performance insight into complex microservices in production without slowing them down, aiding in performance tuning and troubleshooting.

Cloud and Distributed Systems

  • Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms (Rohan Yadav, Shiv Sundram, Wonchan Lee, Michael Garland, Michael Bauer, Alex Aiken, Fredrik Kjolstad). Presents Coach, a cloud resource scheduler that safely oversubscribes CPU, memory, and GPU resources by exploiting time-varying usage patterns. It monitors an application’s usage peaks and troughs and temporally “packs” multiple workloads on the same resource, using predictive modeling to avoid interference. Relevance: Cloud providers routinely oversubscribe (especially CPUs) for efficiency – Coach extends this to GPUs and memory, using smart time-domain multiplexing. This can improve utilization and reduce costs in data centers, while maintaining QoS for customers.

  • Cooperative Graceful Degradation in Containerized Clouds (Kapil Agrawal, Sangeetha Abdu Jyothi). Proposes a framework for coordinated graceful degradation of cloud services under severe resource pressure or failures. Instead of individual services failing, the system selectively sheds less-critical workload and lowers service quality in a controlled way across the container ecosystem, using signals between services to decide where to cut back. Relevance: Helps maintain availability during overloads or outages – a practical concern for cloud operators. By trading off quality (like slightly lower resolution or slower refresh) rather than completely failing, it improves user experience and reliability of web services and online applications.

  • Copper and Wire: Bridging Expressiveness and Performance for Service Mesh Policies (Divyanshu Saxena, William Zhang, Shankara Pailoor, Isil Dillig, Aditya Akella). Introduces a new framework for service mesh policy enforcement that achieves both high expressiveness (rich, declarative policies) and high performance. It uses a combination of a high-level policy language (Copper) and an efficient low-level enforcement mechanism (Wire) to specify and implement fine-grained networking policies (like access control, routing rules) with minimal overhead. Relevance: Service meshes (e.g., Istio) are widely used in microservices architectures. This work can influence industry tools by allowing operators to write sophisticated traffic policies or security rules without worrying about slowing down the system, thus improving maintainability and performance in cloud-native applications.

  • Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters (Shutian Luo, Jianxiong Liao, Chenyu Lin, Huanle Xu, Zhi Zhou, Chengzhong Xu). Proposes a runtime that deliberately shifts load between microservices when it detects workload imbalance in a cluster. The system can temporarily slow down or speed up certain less-critical containers (using Linux throttling and resource controls) to free up CPU for others that are overloaded, then shifts back – effectively “lending” capacity across services. Relevance: Addresses the common cloud scenario of multi-tenant clusters where different services experience peak loads at different times. This dynamic balancing act can lead to better overall utilization and more stable latency for high-priority services, which is valuable for cloud platform operators and large-scale web services.

  • Composing Distributed Computations Through Task and Kernel Fusion (Rohan Yadav, Shiv Sundram, Wonchan Lee, Michael Garland, Michael Bauer, Alex Aiken, Fredrik Kjolstad). Proposes compiler and runtime techniques to fuse distributed tasks and GPU kernels across nodes. By combining what would be separate communication steps and compute kernels into larger, joint operations, the system reduces communication overhead and improves data locality in distributed GPU programs. Relevance: Beneficial for high-performance computing and big data frameworks used in industry – e.g., distributed training or graph analytics – by squeezing out network overhead. It makes large-scale computations more efficient and can translate to cost savings and faster results on clusters.

  • Automatic Tracing in Task-Based Runtime Systems (Rohan Yadav, Michael Bauer, David Broman, Michael Garland, Alex Aiken, Fredrik Kjolstad). Integrates automatic execution tracing into task-parallel runtime schedulers. The system logs dependencies and task execution timelines with low overhead, producing traces that help developers analyze performance or debug parallel programs without manually instrumenting code. Relevance: Many industry frameworks (like task schedulers in TensorFlow or oneTBB) lack easy introspection. This contribution provides a way to get detailed traces for performance tuning of parallel applications, which can shorten development cycles and improve the efficiency of complex software like games, simulations, or AI pipelines.

  • Fusion: An Analytics Object Store Optimized for Query Pushdown(See Memory and Storage Systems section for summary.)

  • DarwinGame: Playing Tournaments for Tuning Applications in Noisy Cloud Environments (Rohan Basu Roy, Vijay Gadepally, Devesh Tiwari). Introduces an automated approach to tune application parameters by having configurations “compete” in a tournament-style evaluation under cloud performance variability. Over successive rounds, poorly performing configurations are dropped and winners are perturbed to find even better settings. This approach finds near-optimal configurations (for throughput, latency, etc.) despite unpredictable cloud noise. Relevance: Cloud deployments often require tuning (e.g., thread counts, memory pool sizes) for performance, but consistent benchmarking is hard due to interference. This method gives cloud engineers a way to autotune software in situ, potentially improving service performance without exhaustive manual experimentation.

  • Design and Operation of Shared Machine Learning Clusters on Campus (Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong Liao, Zilong Wang, Junxue Zhang, Kai Chen). Describes the practical experience and system design for a campus-wide shared GPU cluster for ML workloads. It covers scheduling policies, fairness mechanisms, and user interfaces that balance the needs of different research groups and job types (interactive vs. batch). Relevance: University or enterprise ML clusters often face contention among users. The lessons and system solutions from this work (like combining quotas with best-effort usage and providing visibility into cluster use) directly inform how industry R\&D or AI teams can manage expensive GPU resources efficiently while keeping users productive.

  • Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity (Cunchi Lv, Xiao Shi, Zhengyu Lei, Jinyue Huang, Wenting Tan, Xiaohui Zheng, Xiaofang Zhao). Proposes a system called Dilu that allows serverless platforms to dynamically scale GPU allocations for deep learning inference functions. It introspects the GPU usage of each serverless function invocation and elastically expands or contracts the GPU share assigned to it on the fly, rather than fixed allocation per function. Relevance: In FaaS (Function-as-a-Service) offerings, providing GPU acceleration is challenging due to unpredictable demand. Dilu’s approach could enable cloud providers to offer “GPU-on-demand” for serverless functions cost-effectively, improving performance for AI inference workloads without leaving GPUs underutilized.

  • Enabling Efficient Mobile Tracing with BTrace (Jiawei Wang, Nian Liu, Arnau Casadevall-Saiz, Yutao Liu, Diogo Behrens, Ming Fu, Ning Jia, Hermann Härtig, Haibo Chen). Introduces BTrace, a lightweight tracing framework for mobile devices that records execution events and interactions efficiently on phones. By using techniques like in-memory ring buffers, selective event filtering, and compression, BTrace can collect rich trace data (for debugging or performance analysis) on Android with minimal overhead and without draining battery. Relevance: Mobile app developers and phone OS vendors often struggle to diagnose issues on real devices. BTrace provides an on-device observability tool, enabling more effective debugging of mobile apps and systems in the field, which can improve reliability and user experience in smartphone software.

Security, Privacy, and Cryptography

  • AMuLeT: Automated Design-Time Testing of Secure Speculation Countermeasures (Bo Fu, Leo Tenenbaum, David Adler, Assaf Klein, Arpit Gogia, Alaa R. Alameldeen, Marco Guarnieri, Mark Silberstein, Oleksii Oleksenko, Gururaj Saileshwar). Proposes a tool that automatically evaluates CPU designs for vulnerabilities to transient execution (Spectre-like) attacks. AMuLeT symbolically tests whether a given speculation barrier or mitigation actually stops information leaks, before hardware is built. It finds subtle “overlapping” cases where speculation can still bypass protections. Relevance: CPU vendors and researchers get a way to validate security of new speculative execution designs early. Given industry’s ongoing trouble with Spectre/Meltdown, this could improve processor security by catching design flaws (and avoiding costly patches) ahead of fabrication.

  • Marionette: A RowHammer Attack via Row Coupling (Seungmin Baek, Minbok Wi, Seonyong Park, Hwayong Nam, Michael Jaemin Kim, Nam Sung Kim, Jung Ho Ahn). Demonstrates a new RowHammer attack variant that exploits electromagnetic coupling between DRAM rows. Marionette shows how an attacker can indirectly induce bit flips in a victim row by hammering neighbor-of-neighbor rows in a specific pattern (not just immediate neighbors). This bypasses some existing RowHammer mitigations. Relevance: Points out a novel hardware vulnerability relevant to memory manufacturers and cloud providers (who must isolate tenants). It underscores that RowHammer defenses need to consider more complex coupling patterns, informing the design of next-gen DRAM or memory controllers to harden systems against real-world attacks.

  • MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters (Moinuddin Qureshi, Salman Qazi). Proposes MOAT, a low-cost hardware defense for RowHammer that tracks memory access counts per row and prevents any row from being activated too frequently. By adding tiny counters for each DRAM row (or group of rows) and throttling accesses that exceed a threshold, it blocks RowHammer-induced bit flips without heavily slowing normal memory traffic. Relevance: A practical security enhancement for DRAM – MOAT’s scheme could be adopted by memory controller designers or DIMM manufacturers to protect everything from servers to mobile devices against RowHammer, an attack of ongoing concern in industry (especially cloud and multicore environments).

  • Controlled Preemption: Amplifying Side-Channel Attacks from Userspace (Yongye Zhu, Boru Chen, Zirui Neil Zhao, Christopher W. Fletcher). Reveals a technique where a malicious userspace program deliberately controls its own preemption timing (via subtle thread priority and yield tricks) to enhance cache side-channel leakage. By aligning preemptions with victim activity, the attacker gets more precise cache snapshots, effectively boosting the resolution of Flush+Reload or Prime+Probe attacks. Relevance: A warning to OS and cloud security – it shows that scheduling policies can inadvertently aid side-channel attacks. This insight can prompt OS developers to harden the scheduler or introduce noise, and cloud providers to better isolate or monitor tenant behavior to prevent such timing manipulation.

  • ClosureX: Compiler Support for Correct Persistent Fuzzing (Rishi Ranjan, Ian Paterson, Matthew Hicks). Introduces compiler techniques to enable fuzz testing of persistent applications (those that crash and recover using persistent memory). ClosureX ensures that after a program with NVM (non-volatile memory) crashes during fuzzing, it restarts in a consistent state for the next test case, avoiding false positives or missed bugs due to leftover state. Relevance: With persistent memory technologies emerging, software bugs in crash-recovery logic are critical (e.g., in databases). This work gives industry developers a tool to more effectively fuzz test such software (like persistent key-value stores), ultimately leading to more reliable storage systems.

  • Affinity-based Optimizations for TFHE on Processing-in-DRAM (Kevin Nam, Heon Hui Jung, Hyunyoung Oh, Yunheung Paek). Speeds up Fully Homomorphic Encryption (FHE) – specifically TFHE, a popular scheme – by offloading key parts of the computation into a Processing-in-Memory (PIM) device. It identifies data “affinities” (which pieces of cipher data are best kept and processed in DRAM vs. on CPU) and modifies the TFHE algorithms to exploit the high bandwidth of PIM for these parts. Relevance: FHE allows computation on encrypted data (useful for privacy in cloud services) but is extremely slow. This approach could drastically accelerate FHE, making privacy-preserving data processing more viable in industry (e.g., encrypted cloud databases or confidential machine learning) by utilizing novel memory hardware.

  • CIPHERMATCH: Accelerating Homomorphic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing (Mayank Kabra, Rakesh Nadig, Harshita Gupta, Rahul Bera, Manos Frouzakis, Vamanan Arulchelvan, Yu Liang, Haiyu Mao, Mohammad Sadrosadati, Onur Mutlu). Designs a system for fast privacy-preserving string search over encrypted data using FHE. CIPHERMATCH introduces techniques to pack multiple characters and patterns into single ciphertexts to reduce FHE operations, and even suggests offloading parts of the homomorphic computation to a smart SSD (processing near data). Relevance: Searching encrypted text (like secure malware scanning or private DNA/protein search) is valuable but slow. This work could influence secure data search products by significantly improving throughput, combining algorithmic and hardware advances (near-data processing) to make encrypted search more practical.

  • Accelerating Number Theoretic Transform with Multi-GPU Systems for Efficient Zero Knowledge Proof (Zhuoran Ji, Jianyu Zhao, Peimin Gao, Xiangkai Yin, Lei Ju). Focuses on speeding up the Number Theoretic Transform (NTT), a core routine in zero-knowledge proof systems and lattice cryptography, by parallelizing it across multiple GPUs. The paper provides an optimized multi-GPU NTT algorithm that reduces communication overhead and balances workload, achieving faster proof generation for ZK-SNARKs and related protocols. Relevance: Zero-knowledge proofs are increasingly used in blockchain and privacy applications, but they are compute-heavy. This multi-GPU NTT optimization directly benefits industry projects (e.g., blockchain nodes verifying transactions or privacy-preserving authentication systems) by cutting down proof generation time, allowing these technologies to scale.

  • BatchZK: A Fully Pipelined GPU-Accelerated System for Batch Generation of Zero-Knowledge Proofs (Tao Lu, Yuxun Chen, Zonghui Wang, Xiaohang Wang, Wenzhi Chen, Jiaheng Zhang). Develops a system to generate many ZK proofs in parallel on GPUs by pipelining the steps and using every part of the GPU efficiently. BatchZK carefully overlaps witness computation, FFT/NTT, and cryptographic multi-exponentiation across different proofs so that the GPU is never idle, substantially increasing throughput when multiple proofs need to be made. Relevance: In blockchain and secure transactions, often hundreds of ZK proofs must be produced (for rollups, anonymization, etc.). This work lets data centers or crypto networks generate these proofs at high speed, which could reduce latency and costs for privacy-preserving financial systems or verifiable computing services.

  • Cinnamon: A Framework for Scale-Out Encrypted AI (Siddharth Jayashankar, Edward Chen, Tom Tang, Wenting Zheng, Dimitrios Skarlatos). Proposes an end-to-end system (nicknamed Cinnamon) to run neural network inference on data encrypted with homomorphic encryption, by distributing the computation across multiple servers. It addresses performance bottlenecks by smart partitioning of the encrypted model between servers and using parallel homomorphic operations. Relevance: Tackles the challenge of doing AI on sensitive data without decrypting it – e.g., analyzing medical or financial data in the cloud securely. Cinnamon’s techniques could shape future cloud offerings where clients can send encrypted data and get results without exposing the raw data, a significant step for privacy in AI services.

Programming Languages and Compilation

  • Exo 2: Growing a Scheduling Language (Yuka Ikarashi, Kevin Qian, Samir Droubi, Alex Reinking, Gilbert L. Bernstein, Jonathan Ragan-Kelley). Presents Exo 2, an evolution of the Exo language that allows developers to write performance-critical kernels in a high-level form and then gradually grow/extend the language with new scheduling constructs. Essentially, instead of a fixed set of scheduling primitives (like in Halide or TVM), Exo 2 lets expert users add domain-specific scheduling transformations as if they were adding new syntax. Relevance: This empowers high-performance computing developers (GPU kernel writers, DSP engineers) to more easily teach the compiler new optimizations without modifying the compiler core. It can accelerate adoption of DSLs for performance in industry, where one often needs custom optimizations – Exo 2 provides a principled way to do that by extending the scheduling language itself.

  • Faster Chaitin-like Register Allocation via Grammatical Decompositions of Control-Flow Graphs (Xuran Cai, Amir K. Goharshady, S. Hitarth, Chun Kit Lam). Improves the classical Chaitin graph-coloring register allocator by using a novel graph decomposition technique. The compiler breaks the program’s control-flow graph into components (using grammar-based splitting), allocates registers within each independently, and then composes the solutions. This yields global register allocation of near-Chaitin quality in much less time. Relevance: Compiler optimizations directly impact application performance. This faster register allocation could be incorporated into production compilers (GCC/LLVM), benefiting industries that rely on compiled languages by slightly speeding up compilation or enabling more aggressive optimization without hitting time limits – particularly useful for large codebases or JIT compilation scenarios.

  • Optimizing Datalog for the GPU (Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, Kristopher K. Micinski). Adapts the Datalog logic programming language (often used for static analysis and declarative queries) to run efficiently on GPUs. The paper introduces a transformation of Datalog’s evaluation (usually iterative and recursive) into a form amenable to GPU parallelism, including methods to handle joins and fixpoint iteration in a massively parallel way. Relevance: This work bridges high-level declarative programming with low-level parallel hardware. For industries dealing with large graph analyses or security rule evaluation (common uses of Datalog) on big data, being able to run these on GPUs can yield significant speed-ups, enabling near-real-time analysis that wasn’t feasible before.

  • ElasticMiter: Formally Verified Dataflow Circuit Rewrites (Ayatallah Elakhras, Jiahui Xu, Martin Erhart, Paolo Ienne, Lana Josipovic). Introduces a framework that uses formal methods to automatically verify and apply algebraic transformations on dataflow circuits (like high-level hardware designs). ElasticMiter specifically targets optimizations such as re-timing, pipelining, or redundancy elimination and ensures they preserve functional correctness by construction. Relevance: Hardware design and high-level synthesis in industry require confidence in aggressive optimizations. A tool that guarantees an optimization is semantics-preserving removes a lot of guesswork and manual verification. This can shorten hardware development cycles and increase trust in automated circuit optimization, benefiting chip designers and FPGA tool flows.

  • H-Houdini: Scalable Invariant Learning (Sushant Dinesh, Yongye Zhu, Christopher W. Fletcher). Proposes H-Houdini, a system to automatically infer likely invariants (properties that hold true) in concurrent or complex software by analyzing execution histories and using learning techniques. It addresses scalability by clever sampling of program states and by leveraging hardware performance counters to guide the search for invariants. Relevance: Invariant detection is useful for program verification and debugging. By scaling it up, this work helps in automatically checking correctness of concurrent systems (e.g., to find potential concurrency bugs or security property violations) which is valuable for industries like aerospace, automotive, or any area where formal correctness of software is paramount and manual invariant discovery is hard.

  • CRUSH: A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS (Jiahui Xu, Lana Josipovic). Proposes a scheduling algorithm for High-Level Synthesis (HLS) that uses a credit-based scheme to share functional units (like adders, multipliers) among operations at runtime. Instead of rigidly binding operations to hardware units, CRUSH dynamically allocates units based on credits to operations, allowing more flexible reuse and higher utilization. Relevance: Improves the quality of hardware circuits generated from high-level code by reducing the area (hardware resources) without sacrificing performance. This means industry engineers can get more efficient FPGA or ASIC designs out of their C/C++ or MATLAB codes – an important factor for reducing cost and power in devices like smartphones, IoT, or automotive electronics.

Architectural and Hardware Acceleration

  • ARC: Warp-level Adaptive Atomic Reduction in GPUs to Accelerate Differentiable Rendering (Sankeerth Durvasula, Adrian Zhao, Fan Chen, Ruofan Liang, Pawan K. Sanjaya, Yushi Guan, Christina Giannoula, Nandita Vijaykumar). Proposes a specialized atomic operation for GPU shaders that adapts the reduction strategy at warp level. In differentiable rendering (used in computer vision and graphics, where images are rendered with gradients for optimization), many threads need to atomically accumulate to the same value. ARC dynamically switches between fine-grained atomics and warp-cooperative accumulation to minimize contention and memory traffic. Relevance: Beneficial for advanced graphics and vision applications (like neural rendering, 3D reconstruction) used in gaming or AR/VR industries. It can be integrated into GPU architectures to boost performance of these emerging workloads without requiring manual tuning.

  • Accelerating Number Theoretic Transform with Multi-GPU Systems for Efficient Zero Knowledge Proof(See Security, Privacy, and Cryptography section for summary.)

  • Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning (Zhaoying Li, Pranav Dangi, Chenyang Yin, Thilini K. Bandara, Rohan Juneja, Cheng Tan, Zhenyu Bai, Tulika Mitra). Investigates how to better coordinate computation and data movement in Coarse-Grained Reconfigurable Arrays (CGRAs). The authors present a method to align the scheduling of operations with the routing of data across the CGRA fabric, so that functional units are kept busy with minimal stalls. This significantly improves throughput and utilization of CGRA pipelines. Relevance: CGRAs are used in domain-specific accelerators (like DSP chips or mobile SoCs for signal processing). Industry designers can apply these insights to get more performance out of CGRA-based IP blocks, which means more efficient chips for things like 5G, image processing, or AR.

  • D-VSync: Decoupled Rendering and Displaying for Smartphone Graphics (Yuanpei Wu, Dong Du, Chao Xu, Yubin Xia, Ming Fu, Binyu Zang, Haibo Chen). Proposes splitting the traditional GPU rendering pipeline on smartphones into two decoupled phases: one that renders content ahead of time and one that handles the final display synchronization. By decoupling rendering from the screen’s VSync signal, the system can render continuously at high throughput and only block minimally to match the display’s refresh, reducing jank and improving GPU utilization. Relevance: Mobile games and UIs could become smoother and more power-efficient. Phone manufacturers or OS developers could adopt this approach to eliminate frame drops under heavy GPU load, enhancing user experience for graphics-intensive apps.

  • MetaSapiens: Real-Time Neural Rendering with Efficiency-Aware Pruning and Accelerated Foveated Rendering (Weikai Lin, Yu Feng, Yuhao Zhu). Combines two techniques – neural network pruning and foveated rendering – to achieve real-time neural rendering (AI-based rendering of scenes) on edge devices. It prunes the neural network based on perceptual importance (easier in unimportant regions due to foveation) and aggressively reduces computation outside the user’s gaze focus. Relevance: In AR/VR and gaming, neural rendering can produce realistic graphics but is computationally heavy. By leveraging how the human eye sees (high detail only at the fovea), this work makes such rendering feasible on consumer-level GPUs, indicating a path for industry to deliver ultra-realistic graphics on next-gen headsets or phones without hitting performance or battery barriers.

  • Einsum Trees: An Abstraction for Optimizing the Execution of Tensor Expressions (Alexander Breuer, Mark Blacher, Max Engel, Joachim Giesen, Alexander Heinecke, Julien Klaus, Stefan Remke). Introduces a mathematical abstraction (“einsum trees”) to reorganize and optimize complex tensor algebra expressions (like those using Einstein summation notation). By representing the expression as a tree and exploring different parenthesizations and execution orders via the tree structure, the system finds much more efficient ways to perform large tensor operations common in scientific computing and machine learning. Relevance: Many HPC and ML workloads boil down to tensor operations (matrix multiplications, contractions). This abstraction could be integrated into compilers for frameworks (like NumPy, PyTorch, or TensorFlow) to automatically speed up user-written tensor computations, thereby accelerating simulations or training in scientific and deep learning applications.

  • NTT and BatchZK(See Security, Privacy, and Cryptography section for summary of GPU acceleration for ZK proofs.)

  • Be CIM or Be Memory: A Dual-mode-aware DNN Compiler for CIM Accelerators (Shixin Zhao, Yuming Li, Bing Li, Yintao He, Mengdi Wang, Yinhe Han, Ying Wang). Presents a compiler for Compute-In-Memory (CIM) neural accelerators that can operate in two modes: as pure memory or as computing units. The compiler analyzes neural network layers and decides which parts to execute in CIM mode (analog computations inside memory arrays) and which to treat in a more conventional digital memory mode, balancing precision and speed. It also inserts transformations to accommodate analog noise. Relevance: As CIM hardware (e.g., analog in-memory matrix multiplication) enters the scene for AI chips, software support is lagging. This work gives chip designers and ML systems engineers a blueprint on how to program such devices effectively – essential for industry to harness CIM’s potential for faster, energy-efficient AI inference.

  • ARC(See above; ARC for GPU rendering is listed at top of this section.)

  • Affinity-Optimized TFHE on PIM(See Security, Privacy, and Cryptography; this work also touches architecture through PIM integration.)

Emerging Technologies and Applications

  • BQSim: GPU-accelerated Batch Quantum Circuit Simulation using Decision Diagram (Shui Jiang, Yi-Hua Chung, Chih-Chun Chang, Tsung-Yi Ho, Tsung-Wei Huang). Develops a novel approach to simulate quantum circuits in batch mode on GPUs using decision diagrams. BQSim uses an efficient graph-based representation of quantum state (a decision diagram) that can be manipulated in parallel, and processes multiple quantum circuits together to amortize GPU operations. This drastically speeds up simulating many quantum programs or variational algorithm iterations. Relevance: Quantum computing research and software companies need fast circuit simulators for testing and verification (since real quantum hardware is scarce and noisy). BQSim’s techniques can be adopted in quantum software toolchains (like IBM Qiskit or Google Cirq) to accelerate simulations on classical hardware, helping engineers debug and optimize quantum algorithms more quickly.

  • Optimizing Quantum Circuits, Fast and Slow (Amanda Xu, Abtin Molavi, Swamit Tannu, Aws Albarghouthi). This work tackles quantum circuit optimization by splitting it into two phases: a fast heuristic pass and a slow exact or semi-exact pass. The fast phase quickly prunes the search space of circuit transformations using approximations, then the slow phase rigorously finds the best equivalent circuit in that reduced space (for gate count, depth, etc.). This yields near-optimal circuits much faster than brute force. Relevance: Reducing quantum circuit depth and gate count is critical for today’s error-prone quantum hardware. The approach could be integrated into quantum compilers used by industry and academia, enabling more efficient use of quantum processors (meaning higher fidelity results from the same hardware).

  • Data Cache for Intermittent Computing Systems with Non-Volatile Main Memory (Sourav Mohapatra, Vito Kortbeek, Marco A. van Eerden, Jochem Broekhoff, Saad Ahmed, Przemyslaw Pawelczak). Proposes a caching mechanism tailored for intermittently powered devices (e.g., batteryless sensors) that use non-volatile memory (NVM) as main memory. The cache tracks energy state and ensures writes are persisted to NVM frequently, and it uses NVM’s persistence to recover cache state after power loss. This prevents data loss and avoids redundant re-computation when power returns. Relevance: Battery-free IoT devices that scavenge energy (from solar, RF, etc.) are becoming practical. This cache system improves their efficiency and reliability. Companies building sustainable IoT (for agriculture, structural monitoring, etc.) can use these ideas to make devices run complex tasks correctly despite frequent power failures.

  • Energy-aware Scheduling and Input Buffer Overflow Prevention for Energy-harvesting Systems (Harsh Desai, Xinye Wang, Brandon Lucia). Presents a scheduling algorithm for tiny devices that harvest energy (like small sensors) which carefully plans task execution based on current energy level and prevents sensor data loss by managing the input buffer. It predicts when energy will be sufficient to run certain tasks and when to go into deep sleep, and coordinates with incoming data rates so that no data is lost when the device is off. Relevance: This is highly relevant for IoT deployments where devices rely on sporadic energy (solar, kinetic). The scheduling approach can be implemented in firmware or real-time OS of such devices, enabling more complex functionality on energy-harvesting hardware by intelligently handling energy and data – a step toward reliable, battery-less IoT in industry.

Observability and Debugging

  • Enabling Efficient Mobile Tracing with BTrace(See Cloud and Distributed Systems; BTrace was summarized as a mobile observability tool.)

  • EXIST: Enabling Extremely Efficient Intra-Service Tracing Observability in Datacenters(See Memory and Storage Systems; EXIST was discussed as a lightweight tracing system for microservices.)

  • Dynamic Partial Deadlock Detection and Recovery via Garbage Collection (Georgian-Vlad Saioc, I-Ting A. Lee, Anders Møller, Milind Chabbi). Proposes an approach to detect and resolve deadlocks in programs at runtime by piggybacking on a garbage collector’s traversal of object graphs. As the GC scans memory, it identifies cycles of waiting threads (potential deadlocks) and then breaks the deadlock by aborting or rolling back one thread in the cycle. This partial recovery avoids needing a full restart. Relevance: Concurrent server applications (e.g., in Java or managed languages) can occasionally deadlock in production, which is catastrophic for availability. This technique could be integrated into JVMs or .NET runtimes to automatically cure deadlocks on the fly, improving reliability of enterprise software without developer intervention.

  • Debugger Toolchain Validation via Cross-Level Debugging (Yibiao Yang, Maolin Sun, Jiangchang Wu, Qingyang Li, Yuming Zhou). Focuses on ensuring that a toolchain’s debugging information (e.g., from compiler to debugger) is correct by performing cross-level debugging checks. It runs specially crafted programs and uses one level of the tool (like source-level debugger) to validate the next level (like machine-level tracer) by comparing their views of program state. The approach finds inconsistencies in how debug info is emitted or interpreted, preventing user-facing debugging errors. Relevance: Development tools in industry (compilers, debuggers, IDEs) rely on accurate debug info. This work provides methods to catch subtle bugs in those tools themselves, meaning developers will face fewer “why is the debugger showing the wrong value” scenarios. Ultimately, it leads to more trustworthy tooling for low-level software development.

  • Embracing Imbalance (Microservice Load Shifting)(See Cloud and Distributed Systems; while primarily about scheduling, it touches on reliability through dynamic adjustment.)

  • Controlled Preemption (Side-Channel Amplification)(See Security and Privacy; it exploits scheduling for attack, highlighting a cross-domain issue.)

  • H-Houdini (Invariant Learning)(See Programming Languages and Compilation; relevant for debugging concurrent programs by discovering invariants.)

  • ElasticMiter (Verified Rewrites)(See Programming Languages and Compilation; ensures correctness of transformations, aiding debugging of hardware designs.)

  • Automatic Tracing in Task Runtimes(See Cloud and Distributed Systems; provides built-in tracing for parallel runtimes, aiding performance debugging.)

Reference

Share on Share on