Skip to content

EuroSys 2025 Paper Summaries and Analysis

This post offers a detailed examination of papers accepted at EuroSys 2025, one of the premier conferences in computer systems research. I've analyzed over 40 papers spanning AI systems, cloud computing, networking, storage, and security to identify emerging trends, technical breakthroughs, and industry implications. For researchers and practitioners alike, this analysis provides a roadmap of where systems research is heading—highlighting both solved problems and remaining challenges. Each paper is summarized with its key contributions and practical relevance, followed by a synthesis of overarching themes and future research directions.

Emerging Hot Directions

This EuroSys 2025 collection underscores systems for AI/ML as a dominant theme. Many papers address the challenges of training and serving large models – from distributed training optimizations (Mist, MEPipe, HybridFlow) to efficient inference serving (HCache, Pensieve, CacheBlend, T-MAC, SpInfer, DeltaZip). The trend is towards bespoke systems support for LLMs: co-designing scheduling, memory management, and compression techniques specifically for giant models. We also see systems using AI techniques within themselves (Deep RL for VM scheduling, learned indexes like LOFT). This cross-pollination (using AI to build systems and vice-versa) is a clear trajectory for future work.

Another hot direction is resource disaggregation and tiering – multiple papers (Deft, Chrono, PET, Adios) tackle the fundamental shifts needed when memory or storage is disaggregated across the network or split into tiers. The community is preparing for a world of CXL-attached memory and pooled resources, ensuring performance doesn’t crumble even as hardware becomes more distributed. Techniques like meticulous hotness tracking (Chrono) and yield-based page fault handling (Adios) illustrate a broad effort to rethink OS services for microsecond-scale networks.

Networking and data center communication remains vibrant, with new congestion control paradigms (Introspective CC, Fork, virtual priorities) and verification tools (Atlas, Marlin) – a response to ever-larger and more complex networks. Notably, there’s emphasis on predictability and verification: ensuring consistent performance (Introspective CC) and using formal or automated checks (Atlas) to avoid outages. This reflects industry’s need for reliability at scale.

Security and isolation innovations are present throughout: from CKI’s secure containers to Achilles’s TEE consensus and many eBPF-related insights. There’s a clear trend of making high-performance systems secure by design. Hardware features (Intel PKS, SGX) are being creatively used to reduce the overhead of security (CKI, RAKIS), indicating a shift where security is no longer an afterthought but integrated into the system’s core architecture.

Unsolved Problems and Gaps

Despite progress, unsolved challenges emerge. One gap is managing complexity: systems like HybridFlow or Pensieve achieve impressive results but at the cost of intricate designs (multi-controller RLHF, multi-tier caching). Taming this complexity – perhaps via better abstractions or automation – is still an open issue. Similarly, developer-friendly tooling is a gap: e.g., writing correct high-performance eBPF is still hard (as “unstable foundations” suggests), and programming disaggregated memory or tiered storage remains complex. We have point solutions (like eNetSTL or visual kernel tracing), but a general easing of programmability in these domains is needed.

Another unsolved problem is comprehensive performance isolation in shared environments. Various papers (Faro, GPU multiplexing, SpotHedge) tackle isolation in one dimension (CPU, GPU, spot VMs), yet the problem is far from fully solved. A general framework that guarantees SLOs across all resources and levels (CPU, memory, network, storage) remains elusive – current solutions are still domain-specific. This is a ripe area for future cross-domain schedulers or isolation techniques that can handle multi-resource contention holistically.

Verification and correctness for modern systems also present gaps. While Atlas and Seal make strides in network and kernel verification, respectively, we lack equivalent tools for, say, ML systems correctness (ensuring distributed training converges correctly under failures) or disaggregated systems correctness. As systems become more complex (with hardware acceleration, distributed components, etc.), verifying their correctness (both functional and performance) is increasingly challenging. The proceedings hint at verification (Seal, Atlas), but this is likely to grow into a bigger research thrust.

Surprising and Novel Insights

One surprising theme is proactiveness – several works flip the usual reactive approach on its head. PET proactively demotes memory, Adios reverts to yields instead of busy-wait, Faro intentionally sloppifies utilities for agility, HybridFlow mixes controllers rather than sticking to one paradigm. This willingness to relax traditional strictness (in consistency, precision, or roles) in order to gain performance or simplicity is a notable mindset shift. The success of these systems suggests that carefully relaxing constraints (with feedback or hardware support to back it) can yield big wins – a possibly counterintuitive insight for system designers accustomed to rigid correctness or structure.

Another notable challenge mentioned is the “last mile” problem in various guises: e.g., the gap between an LFS and PM (Scatter Logging), or between user expectations of container security and reality (CKI’s motivation). These last-mile issues often require interdisciplinary thinking (combining hardware and software techniques, or merging algorithms with systems). The community is identifying these and addressing them case-by-case; an open question is how to systematically close such gaps in general.

Across the papers, there’s a clear trend of co-designing with hardware features: be it using sparse tensor cores (Samoyeds, SpInfer), new CPU instructions (CKI with PKS, T-MAC with lookup tables), or specialized NIC capabilities (Pegasus, virtual priority CC). The methodology is to embrace hardware constraints or features rather than abstract them away. This indicates future researchers will likely need even deeper understanding of hardware – and perhaps collaborate more with hardware designers – as we venture into ML accelerators, CXL fabrics, and new non-volatile memories. We also see a trend of building unified frameworks (AlloyStack for workflows, NeuStream bridging streams and ML, Pegasus unifying local/remote comms). This suggests a move away from siloed designs: instead of having one system for X and another for Y with costly glue between, researchers aim to provide a single solution that covers both seamlessly. This reduces overhead and complexity – a win for both performance and manageability.

Another methodological trend is extensive use of data-driven optimization: Many systems use profiling, ML, or rigorous analysis of traces (e.g., Faro’s prediction, CAPSys’s contention modeling, LOFT’s adaptive learned models, SpotHedge’s statistical approach to instance termination). The classic static heuristics are being replaced by adaptive, data-informed decisions. This aligns with the industry trend of telemetry-driven autoscaling and AIops – systems that can observe and tune themselves in a feedback loop.

Infrastructure and Hardware Assumptions Shifts

The proceedings reflect some shifting assumptions. Where once one might assume all memory is local and homogeneous, now designs assume memory might be remote or tiered (and plan accordingly). There’s an implicit assumption in many papers that latency is a multi-scale problem – nanoseconds (on-chip) to microseconds (RDMA/NVMe) to milliseconds (network/storage). Systems are being built to operate effectively at microsecond granularity, which was previously the domain of specialized HPC, but now is mainstream with NVMe and RDMA. This requires rethinking blocking vs spinning (Adios) and OS scheduling (unithreads).

Hardware-wise, it’s evident that GPUs and accelerators are first-class citizens in system design now. Many works treat GPU scheduling, sharing, and GPU-specific ops (sparsity cores, etc.) as a core part of the system, not an add-on. The presence of two-level scheduling (GPU within CPU scheduling) and handling of GPU memory as a scarce resource (in LLM serving caches, etc.) shows an assumption that heterogeneous computing is here to stay. Future infrastructures will likely consider CPU, GPU, and other accelerators in a unified resource model – a challenge raised by these works.

Advice for Future Research

Researchers should note the momentum in cross-stack optimization – breakthroughs often came from working across traditional boundaries (e.g., using hardware features in OS (CKI), or combining networking and OS design (Pegasus), or ML and DB techniques (learned indexes)). The ability to traverse multiple layers (hardware, OS, runtime, application) to find global optima is increasingly valuable. Following the data-driven approach is also fruitful: systems that leverage real patterns (Faro’s traces, Seal’s mining of patches, BINGO’s analysis of graph changes) tend to achieve robust improvements by tailoring to reality, not worst-case alone.

Another takeaway is the emphasis on efficiency and correctness/security together. Many papers manage to improve performance while also improving isolation or correctness (CKI doesn’t slow down containers much, Achilles improves BFT efficiency and security, PET improves memory use without performance loss, etc.). This suggests future work can’t sacrifice one for the other easily – the bar is to achieve both. Research into new hardware (like secure enclaves, or programmable NICs) combined with clever system software can yield such win-wins.

Finally, a notable shift is the focus on systems scalability in non-traditional dimensions: not just more nodes, but more models (DeltaZip for many models), more workflows (AlloyStack), more eBPF programs (eNetSTL). The trend is to support “more of everything” – and do so automatically and robustly. Future researchers can build on this by designing systems that gracefully scale in problem dimensionality (models, functions, rules, etc.) using techniques like compression, caching, or parallelism, as exemplified in these papers.

Overall, EuroSys 2025’s papers illustrate a landscape where AI-centric systems, disaggregated architectures, and secure, autonomous management are at the forefront. Embracing these trends and addressing the open challenges – from simplifying complex systems to verifying their behavior – will define the next milestones for systems research. Each category of work here provides stepping stones for future exploration, whether it’s scaling ever-larger AI models, making cloud infrastructure more efficient and reliable, or leveraging new hardware to its fullest potential. The clear message is that holistic, cross-layer design and proactive, intelligent adaptation are key to building the next generation of computer systems.

AI and Machine Learning Systems

  • Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-OptimizationZhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, Gennady Pekhimenko. This paper introduces Mist, a distributed training system that co-optimizes memory reduction techniques and parallelism strategies for large language model (LLM) training. Mist orchestrates data, tensor, and pipeline parallelism together with memory-saving optimizations (like activation checkpointing and offloading) through overlap-centric scheduling and symbolic performance modeling. By hierarchically tuning for workload imbalance, Mist finds efficient configurations automatically, achieving up to 1.73× faster training than Megatron-LM (manual baseline) and up to 2.04× speedup over prior automation (Aceso). Industry Relevance: Mist addresses the cost of training giant models by enabling organizations to train LLMs on limited hardware more efficiently, which is crucial for industry labs seeking to reduce GPU memory and time requirements in LLM training.

  • MEPipe: Democratizing LLM Training with Memory-Efficient Slice-Level Pipeline Scheduling on Cost-Effective AcceleratorsZhenbo Sun, Shengqi Chen, Yuanwei Wang, Jian Sha, Guanyu Feng, Wenguang Chen. MEPipe proposes a novel “slice-level” pipeline parallelism to train large models on affordable GPUs (e.g. RTX 4090) that have limited memory. It partitions each training batch into finer slices along the sequence length, carefully scheduling forward and backward passes to overlap computation and reduce memory usage. MEPipe’s design minimizes activation memory and avoids costly inter-GPU communication by using fine-grained weight gradient computation and sequence-level pipelining. In experiments on LLaMA models, MEPipe attains up to 1.68× training speedup (1.35× on average) on 4090 GPU clusters and significantly improves cost-effectiveness (2.5× more cost-efficient than A100-based clusters). Industry Relevance: MEPipe helps “democratize” LLM training by allowing companies or groups with cheaper GPU hardware to train large models efficiently, lowering the barrier for startups or academia to experiment with big models.

  • HybridFlow: A Flexible and Efficient RLHF FrameworkGuangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu. This work targets Reinforcement Learning from Human Feedback (RLHF) pipelines used to align LLMs. HybridFlow introduces a hybrid control paradigm that combines a single centralized controller with multiple distributed controllers for different phases of RLHF. It provides hierarchical APIs to decouple the complex workflow (which includes distributed policy training and generation steps) and an optimized “3D-HybridEngine” for model resharding between training and inference phases. By eliminating redundant coordination and enabling flexible execution of RLHF’s interwoven tasks, HybridFlow achieves 1.5× to 20× throughput improvements on various RLHF algorithms compared to existing frameworks. Industry Relevance: RLHF is crucial for fine-tuning AI assistants (as done at OpenAI, Anthropic, etc.). HybridFlow’s efficient orchestration can speed up alignment training for large models, benefitting industry teams that need to frequently retrain models with human feedback.

  • Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor CoresChenpeng Wu, Qiqi Gu, Heng Shi, Jianguo Yao, Haibing Guan. This paper focuses on Mixture-of-Experts (MoE) LLMs, which trade off increased model size for sparse expert activation. Samoyeds introduces an acceleration framework that exploits structured sparsity in both model parameters and activations. By defining a custom sparse data format and developing a specialized sparse–sparse matrix multiply kernel for NVIDIA’s Sparse Tensor Cores, Samoyeds can execute both weights and activations in sparse form. It also applies system-level optimizations to coordinate two-sided sparsity throughout the MoE execution. Experiments show up to 1.99× kernel-level speedup and 1.58× end-to-end speedup on MoE models, along with substantial memory savings (enabling 4.4× larger batch sizes). Industry Relevance: For companies deploying massive MoE models (e.g. for MT or recommendation), Samoyeds provides a way to leverage upcoming hardware support for sparsity to get faster inference and training without sacrificing model quality.

  • Fast State Restoration in LLM Serving with HCacheShiwei Gao, Youmin Chen, Jiwu Shu. HCache addresses a bottleneck in LLM inference: efficiently restoring cached conversational context (key–value attention states) for multi-turn dialogues and retrieval-augmented generation. Traditional systems either recompute dropped context from scratch or swap it to slow storage, incurring latency on cache miss. HCache instead restores LLM key–value states from intermediate transformer activations, using a “bubble-free” scheduler to overlap recomputation and I/O and a chunk-based storage manager to align data layout. This reduces “time-to-first-token” latency by up to 1.93× compared to pure KV offloading, while using only ~40–50% of the storage space. Versus fully recomputing from inputs, HCache cuts restoration time by up to 5.7×. Industry Relevance: For latency-critical LLM applications (chatbots, interactive agents), HCache can dramatically speed up context switching or long-dialogue handling. This is directly relevant to cloud providers and API services striving to serve LLMs with low latency and high throughput.

  • Stateful Large Language Model Serving with PensieveLingfan Yu, Jinkun Lin, Jinyang Li. Pensieve is a system for multi-turn conversational LLM services that avoids re-processing the entire dialogue history on each turn. It maintains state across requests by caching previously encoded tokens’ key–value pairs in a multi-tier cache (using GPU memory for recent history and CPU memory for older, larger context). Pensieve also extends a recent paged-attention mechanism to allow cached attention over non-contiguous memory segments. By eliminating redundant computation of prior turns, Pensieve achieves 1.14×–3.0× higher throughput and significantly lower latency compared to stateless serving with frameworks like vLLM. Industry Relevance: Pensieve’s approach is valuable for chatbot and assistant services – it improves efficiency when users engage in multi-turn conversations. This translates to cost savings and better user experience for industry players deploying chat-based LLM interfaces.

  • CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge FusionJiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang. CacheBlend tackles LLM serving in retrieval-augmented generation (RAG) scenarios, where multiple retrieved text chunks are prepended as context. It observes that naively reusing cached key–value states of common context chunks is hard when they appear in the middle of an input rather than strictly as a prefix. CacheBlend introduces a technique to fuse precomputed caches from arbitrary text chunks: it selectively recomputes a small subset of tokens at chunk boundaries to correct for cross-chunk attention, while reusing the majority of cached states. This yields the same output quality as full recomputation but substantially speeds up the “prefill” stage of LLM inference. Industry Relevance: Many production LLM applications (enterprise Q\&A, search assistants) rely on retrieved documents as context. CacheBlend’s method allows faster LLM responses in RAG pipelines by reusing past computation, directly benefiting such industry use-cases with improved throughput.

  • T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on EdgeJianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang. T-MAC proposes a hardware/software co-design that enables efficient LLM inference on CPUs by eliminating expensive multiplications on quantized models. The key idea is to convert the matrix multiply of low-precision weights and high-precision activations into a series of bit-wise table lookups. T-MAC’s lookup tables precompute partial results for combinations of low-bit weight bits and activation values, thus performing mixed-precision GEMM without on-the-fly dequantization or multipliers. This yields a unified and linear-scaling kernel with respect to weight bit-width. Experiments on quantized LLaMA and BitNet models show up to 4× higher throughput and ~70% less energy use than the highly optimized llama.cpp on CPUs. Impressively, on an Apple M2 Ultra chip, T-MAC generates 30 tokens/sec (single core) and 71 tokens/sec (8 cores) for a 3B model – even a Raspberry Pi 5 can achieve 11 tokens/sec. Industry Relevance: T-MAC enables edge deployment of LLMs by dramatically improving CPU inference speed. This is important for mobile and IoT scenarios where GPUs are absent, and for companies aiming to run large models on consumer devices or in CPU-only cloud instances for cost savings.

  • DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMsXiaozhe Yao, Qinghao Hu, Ana Klimovic. DeltaZip addresses the scenario of serving many fine-tuned versions of an LLM (each fine-tuned with full model parameter updates). Deploying each model separately is memory-intensive and redundant, since fine-tuning typically results in only small weight differences from the base model. DeltaZip introduces a system that compresses and serves only the model “deltas” (the weight changes) on top of a shared base model. With a specialized compression algorithm co-designed with the serving runtime, DeltaZip can shrink fine-tune weight updates by up to 10× while preserving accuracy. At query time, it efficiently applies the compressed delta to the base model weights on the fly. The result is a multi-tenant LLM serving framework that improves throughput by 2× to 12× compared to naive per-model deployment. Industry Relevance: Modern AI services often host numerous domain-specific variants of a base model (for different customers or tasks). DeltaZip dramatically reduces memory and compute overhead for multi-model serving, which is highly relevant for cloud providers or enterprises running dozens of fine-tuned LLMs concurrently.

  • SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUsRuibo Fan, Xiangrui Yu, Peijie Dong, Zeyu Li, Gu Gong, Qiang Wang, Wei Wang, Xiaowen Chu. SpInfer is a GPU inference framework for sparsified LLMs. It exploits low-level sparsity patterns (e.g. zero values in weight matrices) to accelerate computation on modern GPUs. The framework introduces GPU Tensor Core–friendly algorithms to skip operations on zeros, and likely integrates with hardware intrinsics for structured sparsity. By customizing the CUDA kernels for sparse matrix multiplication and memory access, SpInfer achieves significant speedups on models compressed via pruning or quantization. The system is high-performance and general, making sparse LLM inference more practical. Industry Relevance: As companies seek to compress large models (prune unimportant weights) to save cost, SpInfer enables them to actually realize latency and throughput gains from those sparse models on existing GPU hardware. It thus helps translate model compression into production performance improvements.

  • Flex: Fast, Accurate DNN Inference on Low-Cost Edges Using Heterogeneous Accelerator ExecutionTanmoy Sen, Haiying Shen, Anand Iyer. Flex targets resource-constrained edge devices (e.g. embedded boards) that have heterogeneous accelerators (CPU, small GPU, DSP, etc.). It presents a scheduler that splits and maps parts of a deep neural network to the most suitable hardware units in parallel, rather than running the whole model on a single accelerator. By doing fine-grained partitioning and handling data movement efficiently, Flex achieves low inference latency without needing expensive hardware. Industry Relevance: IoT and edge applications (smart cameras, drones, etc.) often need to run DNNs but cannot afford power-hungry GPUs. Flex allows them to utilize all available chips (GPU/CPU/NPU) to meet real-time requirements, expanding AI capabilities in edge devices with minimal cost.

  • Hourglass: Enabling Efficient Split Federated Learning with Data ParallelismQiang He, Kaibin Wang, Zeqian Dong, Liang Yuan, Feifei Chen, Hai Jin, Yun Yang. Hourglass improves federated learning (FL) in scenarios where models are split between edge and server (split learning) and also distributed across multiple devices (data parallelism). It introduces an efficient protocol to coordinate these two forms of parallelism: many clients jointly train a model where the lower layers are on clients and upper layers on a server. By carefully scheduling the synchronization of client-side updates and server-side model aggregation, Hourglass reduces idle time and communication overhead. This leads to faster convergence and less strain on any single client or server. Industry Relevance: For cross-device FL (e.g. training on smartphones) or cross-silo FL (hospitals training on sensitive data), Hourglass enables larger models to be trained collaboratively without compromising on speed, by splitting network layers and aggregating updates efficiently – beneficial for industries like healthcare or mobile AI.

Cloud and Systems Management

  • A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via FaroBeomyeol Jeon, Chen Wang, Diana Arroyo, Alaa Youssef, Indranil Gupta. This paper presents Faro, a system to manage a private ML inference cluster (on-premises cloud) under dynamic loads while meeting latency Service Level Objectives (SLOs). Faro takes high-level latency SLOs per application and converts them into utility functions, then “sloppifies” these utilities (i.e. relaxes precision) for tractable optimization. It uses probabilistic workload prediction and continuously reallocates resources across models to maximize total utility or fairness. Uniquely, Faro deliberately trades some precision for agility – simplifying models of cluster behavior to react quickly to load spikes. In a Kubernetes + Ray Serve testbed, Faro cut SLO violations by 2.3×–23× compared to state-of-the-art systems. Industry Relevance: Many enterprises run multiple ML services on shared clusters. Faro’s approach helps them maintain QoS (e.g. 99th percentile latency) for all models even as demand fluctuates, which is critical for user-facing applications and efficient cluster utilization in industry.

  • SpotHedge: Serving AI Models on Spot InstancesZiming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica. SpotHedge enables reliable ML inference on spot cloud instances (cheap VMs that can be preempted). It likely employs a redundancy and hedging strategy: launching model replicas on multiple spot VMs in different pools and cleverly overlapping execution so that if one instance is reclaimed, another can take over without missing deadlines. The system probably uses predictive techniques to decide when to proactively migrate or copy state. By hedging across spot markets, SpotHedge achieves on-par tail latency with on-demand instances, but at much lower cost. Industry Relevance: This system is valuable for cost-sensitive deployment of online services (e.g. a startup serving an AI API). It allows use of preemptible VMs (70-90% cheaper) without sacrificing reliability, thereby lowering operating costs for model serving in the cloud.

  • Towards VM Rescheduling Optimization Through Deep Reinforcement LearningXianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du. This work treats the VM rescheduling problem (periodically re-packing VMs to hosts to mitigate hotspots and fragmentation) as a learning task. It proposes a Deep Reinforcement Learning agent that observes cluster state and decides which VMs to migrate where, aiming to improve overall resource utilization and performance. Through training on simulations or historical data, the DRL agent learns strategies to reduce contention (like CPU or memory hotspots) better than heuristic policies. Industry Relevance: Large cloud operators regularly perform VM consolidation and load balancing. A DRL-based rescheduler could adapt to complex patterns and achieve higher efficiency (fewer servers, less throttling) than static rules, potentially saving costs in data centers.

  • Eva: Cost-Efficient Cloud-Based Cluster SchedulingTzu-Tao Chang, Shivaram Venkataraman. Eva is a cluster scheduler that optimizes for cost efficiency in a cloud setting. Unlike traditional on-prem schedulers that consider makespan or fairness, Eva is cloud-aware: it decides when to use cheaper instance types (or even transient ones) and when to spin down nodes to save money. It likely uses a model of cloud pricing and workload urgency to make scheduling and scaling decisions that minimize dollar cost while meeting job requirements. Industry Relevance: Companies running large workloads on public clouds seek to minimize cloud bills. Eva’s scheduling approach aligns resource allocation with pricing and workload value, directly translating to cost savings for cloud-based big data or AI pipelines.

  • HyperAlloc: Efficient VM Memory De/Inflation via Hypervisor-Shared Page-Frame AllocatorsLars Wrenger, Kenny Albes, Marco Wurps, Christian Dietrich, Daniel Lohmann. HyperAlloc improves the agility of adjusting a running VM’s memory (ballooning) by introducing a shared page-frame allocator between guest and hypervisor. Traditional ballooning is slow and can cause performance hiccups, because the guest OS and hypervisor operate largely independently (leading to redundant moves and waits). HyperAlloc’s co-designed allocator allows the hypervisor to reclaim or grant pages to the VM much more directly and safely. This yields near-immediate VM “memory inflation” or deflation without guest disruption. Industry Relevance: Cloud providers use memory ballooning for overcommitment. HyperAlloc makes this faster and smoother, meaning better VM density and responsiveness – providers can run more VMs per host and rapidly mitigate memory pressure, improving utilization and customer experience.

  • Serverless Cold Starts and Where to Find ThemArtjom Joosen, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Luke Darlow, Jianfeng Wang, Qiwen Deng, Adam Barker. This study takes a deep look at cold start latency in serverless platforms. It likely characterizes the various sources of delay (container initialization, code download, JIT, etc.) across different providers or runtimes. By tracing and measuring production systems, it identifies “where” the time is spent and which factors (e.g. package size, runtime language, VM warm-up) dominate cold start. It may also propose improvements or caching mechanisms to reduce these delays. Industry Relevance: Cold starts degrade user experience and limit serverless use in low-latency apps. Cloud providers (and developers) will benefit from the insights – for instance, this research could guide optimizations in FaaS platforms or best practices (like using certain runtime configurations) to minimize cold-start impact.

  • SeBS-Flow: Benchmarking Serverless Cloud Function WorkflowsLarissa Schmid, Marcin Copik, Alexandru Calotoiu, Laurin Brandner, Anne Koziolek, Torsten Hoefler. SeBS-Flow extends serverless benchmarking to function workflows, not just individual functions. It provides a suite of workflow patterns (sequence, parallel, conditionals, etc.) and metrics to evaluate end-to-end performance, failure handling, and cost across cloud providers. By benchmarking workflows, it uncovers overheads in orchestration (e.g. state passing, coordination delays) that are not seen in single-function benchmarks. Industry Relevance: As serverless adoption grows, complex applications involve multiple functions chained together. Cloud users and providers can use SeBS-Flow to identify bottlenecks in their workflow managers (like AWS Step Functions or Azure Durable Functions) and optimize throughput, ensuring that multi-step serverless applications run efficiently.

  • AlloyStack: A Library Operating System for Serverless Workflow ApplicationsJianing You, Kang Chen, Laiping Zhao, Yiming Li, Yichi Chen, Yuxuan Du, Yanjie Wang, Luhang Wen, Keyang Hu, Keqiu Li. AlloyStack introduces a LibOS approach to improve performance for serverless workflows. Instead of each function in a workflow running in a separate container/VM with its own OS, AlloyStack provides a unified library OS that can host an entire workflow (multiple functions) within a single sandbox while still enforcing isolation from other workflows. By “alloying” multiple functions together, it avoids the overhead of context-switching and communication between functions through external services. The LibOS likely offers lightweight isolation and fast inter-function calls. Industry Relevance: This is aimed at workflow-heavy serverless apps (e.g. an ETL pipeline of several functions). AlloyStack can significantly cut latency and resource duplication in such apps, making serverless viable for more demanding, tightly-coupled workloads that previously suffered from function-to-function invocation overhead.

  • Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal SharingShulai Zhang, Quan Chen, Weihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, Wei Lin, Minyi Guo. This work deals with GPU sharing in multi-tenant clusters. It identifies that static partitioning or naive time-slicing can leave GPU resources underutilized (creating idle “bubbles”) when workloads don’t perfectly align. The proposed system adaptively adjusts how GPUs are split spatially (SM partitions, memory slices) and temporally (time quanta) to eliminate bubbles in utilization. For example, if one job’s load drops, another can immediately seize the freed GPU capacity. The method likely uses feedback from GPU performance counters to resize partitions or scheduling slices on the fly. Industry Relevance: Cloud platforms offering fractional GPUs or multiple jobs per GPU (common in training and inferencing clusters) can achieve higher aggregate throughput and fairness by adopting such adaptive schemes, directly translating to cost savings and better performance for customers.

  • Multiplexing Dynamic Deep Learning Workloads with SLO-awareness in GPU ClustersWenyan Chen, Chengzhi Lu, Huanle Xu, Kejiang Ye, Chengzhong Xu. This paper also tackles multi-tenant GPU cluster scheduling but emphasizes SLO (service level objective) compliance for each job. It introduces a scheduler that dynamically multiplexes DL workloads (with time-varying resource usage) on the same GPU while ensuring each meets its training or inference latency SLO. Likely, it monitors each job’s progress and GPU usage, and when contention arises, it smartly decides which job to prioritize or temporarily throttle, so that no job violates its deadline or throughput target. Industry Relevance: In AI cloud services, guaranteeing performance isolation is crucial – e.g., an inferencing service should not spike latency because a neighbor is doing training. An SLO-aware GPU multiplexing system allows providers to safely increase GPU utilization (run more jobs per GPU) without risking SLA breaches.

  • Moko: Marrying Python with Big Data SystemsKe Meng, Tao He, Sijie Shen, Lei Wang, Wenyuan Yu, Jingren Zhou. Moko is a system that seamlessly integrates Python’s ease of use with the performance of big data systems. It likely allows developers to write data processing logic in Python while transparently leveraging optimized engines (like Spark, Flink, or C++ backends) under the hood. Moko might use techniques like runtime specialization, intelligent caching, or distributed scheduling of Python UDFs to mitigate the usual slowdown from Python’s interpreter. Essentially, it “marries” Python’s flexibility with efficient execution by the data platform. Industry Relevance: Many data scientists prefer Python, but enterprises need the scale of distributed data systems. Moko lets teams retain Python productivity without sacrificing big data scalability, accelerating development cycles in industry for ETL, analytics, and ML pipelines.

  • Collaborative Text Editing with Eg‐walker: Better, Faster, SmallerJoseph Gentle, Martin Kleppmann. Eg-walker presents a new approach to real-time collaborative text editing, improving on prior techniques like Operational Transform (OT) or Conflict-free Replicated Data Types (CRDTs). “Better, faster, smaller” suggests Eg-walker achieves lower latency, less metadata overhead, and better consistency for syncing document edits among users. It possibly introduces an algorithm that avoids tombstones or excessive buffering (common issues in CRDT-based editors) by walking through the document state in an efficient way to integrate changes. The result is a collaboration engine that handles concurrent edits with minimal payload size and high responsiveness, even for large documents. Industry Relevance: Collaborative editing is foundational for products like Google Docs, Office 365, and code collaboration tools. Eg-walker’s improvements could reduce operational costs (less data to sync) and improve user experience with more fluid, robust real-time editing – highly relevant to any company offering collaborative applications.

Networking and Distributed Systems

  • Introspective Congestion Control for Consistent High PerformanceWanchun Jiang, Haoyang Li, Jia Wu, Kai Wang, Fengyuan Ren, Jianxin Wang. This paper proposes a congestion control (CC) algorithm that adapts to network variations by introspecting on its own performance. Traditional CC (like TCP variants) can suffer in consistency – e.g., throughput fluctuates or drops under certain patterns. The “introspective” approach likely monitors metrics like latency variation or send/ack patterns to detect suboptimal behavior (bufferbloat, incipient congestion) and then adjusts its sending strategy or parameters. The goal is to maintain more stable high throughput with low latency. Industry Relevance: Large-scale services (video streaming, data replication, etc.) require predictable network performance. A smarter CC that self-tunes in real time could provide smoother throughput and low tail latency on varied networks (data centers, 5G, Internet), benefiting cloud providers and CDN operators with more efficient network utilization.

  • Fork: A Dual Congestion Control Loop for Small and Large Flows in DatacentersYuan Liu, Wenxin Li, Yulong Li, Lide Suo, Xuan Gao, Xin Xie, Sheng Chen, Ziqi Fan, Wenyu Qu, Guyue Liu. Fork addresses the classic problem of co-existence of mice and elephant flows in datacenters. It implements two coupled control loops: one optimized for short flows (minimizing latency) and one for long flows (maximizing throughput). Small flows get preferential quick-start and aggressive completion, while large flows use a steadier AIMD or rate-based control. The “dual loop” likely means the algorithm can distinguish flow sizes early (via hints or initial packet trends) and apply appropriate CC behavior, possibly switching mode as flows grow. Industry Relevance: This directly benefits datacenter networking – by reducing tail latency for RPCs and bursty short transfers without hurting bulk flow throughput, Fork can improve application-level performance (e.g., web search or microservices) in cloud data centers where mixed traffic is common.

  • Enabling Virtual Priority in Data Center Congestion ControlZhaochen Zhang, Feiyang Xue, Keqiang He, Zhimeng Yin, Gianni Antichi, Jiaqi Gao, Yizhi Wang, Rui Ning, Haixin Nan, Xu Zhang, Peirui Cao, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Chen Tian. This work introduces the concept of virtual priority levels in congestion control, aiming to enforce priority handling (like PFC / priority flow control) via software CC algorithms rather than relying purely on network hardware queues. It likely modifies the congestion feedback signals (e.g., marking packets with different priorities or using multi-bit ECN) so that higher-priority flows experience less queuing when contending with lower-priority flows, without needing strict priority scheduling in switches. Industry Relevance: Many data center applications would benefit from differentiated QoS (e.g., storage traffic vs. best-effort analytics). Implementing it at the CC layer means finer control and easier deployment (no need for complex switch config). This could translate to more predictable performance for high-priority services in multi-tenant or multi-application clusters.

  • Marlin: Enabling High-Throughput Congestion Control Testing in Large-Scale NetworksYanqing Chen, Li Wang, Jingzhi Wang, Songyue Liu, Keqiang He, Jian Wang, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Chen Tian. Marlin provides a platform or methodology for rapidly testing new congestion control protocols at scale. It likely uses simulation or emulation with real traffic patterns, optimized to run experiments with many flows and nodes faster than packet-level simulators. It might incorporate techniques like abstraction of link sharing or parallel simulation to achieve high throughput testing (covering large parameter spaces). Industry Relevance: Every new CC algorithm (like BBR, HPCC, etc.) needs extensive testing under realistic conditions. Marlin can accelerate the development and evaluation cycle for industry researchers building next-gen network algorithms, ensuring robust results before deployment in production networks.

  • Phantom: Virtualizing Switch Register Resources for Accurate Sketch-based Network MeasurementXiang Chen, Hongyan Liu, Zhengyan Zhou, Xi Sun, Wenbin Zhang, Hongyang Du, Dong Zhang, Xuan Liu, Haifeng Zhou, Dusit Niyato, Qun Huang, Chunming Wu, Kui Ren. Phantom focuses on network measurement sketches (like count-min, Bloom filters) which typically use switch memory/registers to track flow statistics. Hardware registers are scarce, so Phantom introduces a virtualization layer that gives the illusion of more counter space by swapping and sharing registers efficiently among multiple sketches or tasks. It likely uses fast algorithms to move sketch data to external memory (or among registers) without losing accuracy, perhaps by time-slicing the measurement tasks or compressing counts. This allows more comprehensive measurements to run concurrently on a switch without exceeding hardware limits. Industry Relevance: For network operators (cloud or ISP), accurate real-time traffic measurement is vital for DDoS detection, billing, and performance tuning. Phantom lets them deploy more measurement utilities on the same hardware switches, enhancing observability of network traffic without needing expensive hardware upgrades.

  • Pegasus: Transparent and Unified Kernel-Bypass Networking for Fast Local and Remote CommunicationDinglan Peng, Congyu Liu, Tapti Palit, Anjo Vahldiek-Oberwagner, Mona Vij, Pedro Fonseca. Pegasus proposes a unified approach to accelerate communication both within a single machine (inter-process) and across machines (network) by bypassing the kernel. It likely provides an API or layer that applications use for message-passing or RPC, which internally uses kernel-bypass (e.g. DPDK or RDMA) whether the destination is local or remote. This transparency means developers don’t have to handle IPC differently from RPC. Pegasus probably ensures that local communications short-circuit with shared memory, and remote goes over RDMA/Ethernet, but under one abstraction. Industry Relevance: In modern cloud software (especially microservices), the boundary between local and distributed communication blurs (e.g. a service might be co-located or remote). Pegasus can improve performance (low latency, high throughput) by using the fastest path available while simplifying development. This is beneficial in data centers for high-performance computing and distributed systems where both inter-thread and inter-node communications are critical.

  • Byte vSwitch: A High-Performance Virtual Switch for Cloud NetworkingXin Wang, Deguo Li, Zhihong Wang, Lidong Jiang, Shubo Wen, Daxiang Kang, Engin Arslan, Peng He, Xinyu Qian, Bin Niu, Jianwen Pi, Xiaoning Ding, Ke Lin, Hao Luo. Byte vSwitch is a new software virtual switch (like OVS) optimized for large cloud environments (developed by ByteDance). It likely achieves near line-rate packet processing in software by using techniques such as batching, lock-free data structures, NUMA-aware scheduling, and perhaps offloads (e.g. leveraging NIC hardware). It might integrate tightly with the hypervisor or container runtime to reduce context switches and memory copies (zero-copy packet paths). The result is a virtual switch that can handle very high throughput with low latency, even with many VMs/containers on a host. Industry Relevance: This directly serves cloud providers and large-scale SaaS operators – a faster vSwitch means higher network I/O performance for VMs/containers, enabling data-intensive applications and NFV (Network Function Virtualization) with less overhead. It improves the efficiency of network virtualization, which is a cornerstone of multi-tenant clouds.

  • Atlas: Towards Real-Time Verification in Large-Scale Networks via a Native Distributed ArchitectureMingxiao Ma, Yuehan Zhang, Jingyu Wang, Bo He, Chenyang Zhao, Qi Qi, Zirui Zhuang, Haifeng Sun, Lingqi Guo, Yuebin Guo, Gong Zhang, Jianxin Liao. Atlas is a distributed system for continuous network verification (checking that the network’s forwarding behavior meets policy, e.g. no loops, no blackholes, security isolation). Unlike offline verification tools, Atlas operates in real-time and at scale by partitioning the verification task across multiple nodes (e.g. each network device or each region verifies local properties, and results are aggregated). It likely uses a graph or SAT solver that is distributed, so verification computation grows with the network without central bottlenecks. This native distributed architecture means each router/switch might contribute to verifying reachability or compliance in parallel. Industry Relevance: Large networks (cloud data centers, telco networks) need fast verification of configurations to catch errors before they cause outages. Atlas’s approach could allow operators to get instant alerts to misconfigurations or policy violations even in huge networks, improving reliability and reducing downtime caused by network config errors.

  • Ladon: High-Performance Multi-BFT Consensus via Dynamic Global OrderingHanzheng Lyu, Shaokang Xie, Jianyu Niu, Chen Feng, Yinqian Zhang, Ivan Beschastnikh. Ladon addresses performance of Byzantine Fault Tolerant (BFT) consensus (used in blockchain or critical systems) by introducing a dynamic global ordering service. Classic BFT protocols struggle to scale in throughput and participant count. Ladon likely separates the ordering of requests from the agreement, using a smaller core (or a hierarchy) to establish a total order of transactions quickly, which the broader set of replicas then validate and execute. This dynamic ordering might adjust leaders or quorums on the fly based on load or faults, improving throughput under varying conditions. Industry Relevance: Permissioned blockchains and other BFT-replicated systems (financial ledgers, etc.) often need to handle high transaction volumes securely. Ladon’s multi-BFT approach could significantly boost throughput and lower latency for such systems, making BFT practical for industry use-cases that demand both strong fault tolerance and high performance (e.g. consortium blockchain between banks).

  • Achilles: Efficient TEE-Assisted BFT Consensus via Rollback Resilient RecoveryJianyu Niu, Guanlong Wu, Shengqi Liu, Xiaoqing Wen, Jiangshan Yu, Yinqian Zhang. Achilles leverages Trusted Execution Environments (Intel SGX or similar) to improve BFT consensus. TEEs can reduce the number of replicas needed (since some trust is placed in secure enclaves) but come with performance costs and new failure modes (e.g. enclave crashes). Achilles introduces a rollback-resilient recovery mechanism to handle enclave resets or state rollbacks without compromising security or liveness. Essentially, it makes a TEE-based replica “sticky” to its state so an attacker can’t reset it to violate protocol assumptions. This yields a consensus protocol that uses fewer replicas (thanks to TEEs) yet remains efficient and secure. Industry Relevance: In consortium networks or databases, combining hardware security (TEE) with consensus can reduce overhead (e.g. 4 replicas instead of 3f+1). Achilles makes such hybrid approaches safer and faster, potentially encouraging adoption of TEEs for blockchain and critical consensus services where performance and security are both paramount.

  • ParallelEVM: Operation-Level Concurrent Transaction Execution for EVM-Compatible BlockchainsHaoran Lin, Hang Feng, Yajin Zhou, Lei Wu. ParallelEVM aims to speed up Ethereum-like blockchain throughput by executing smart contract operations in parallel when safe. It analyzes transactions at the operation level (within the EVM opcodes) to find independent steps or non-conflicting state accesses, so that multiple operations (possibly from different transactions) can run simultaneously on multi-core processors. It likely deals with issues of atomicity and order by careful scheduling or using speculative execution with rollback if a conflict is detected. Industry Relevance: Blockchain platforms face scalability limits. By parallelizing smart contract execution, throughput (TPS) can increase significantly. This is valuable for public and private EVM-compatible chains (Ethereum, Binance Chain, etc.), enabling them to handle more transactions without sharding or layer-2 solutions, directly benefiting the crypto industry and decentralized app performance.

  • BINGO: Radix-based Bias Factorization for Random Walk on Dynamic GraphsPinhuan Wang, Chengying Huan, Zhibin Wang, Chen Tian, Yuede Ji, Hang Liu. BINGO is an algorithmic systems paper focusing on random walks in dynamic graphs (graphs that change over time). It proposes a radix-based factorization technique to efficiently update and sample random walks as the graph evolves. Likely, it factorizes the probabilities or biases of walk transitions into components (radix digits) such that updates (like edge weight changes or additions) only affect a portion of the precomputed structure, enabling fast updates. This yields faster generation of random walk samples on the fly for things like graph neural networks or link analysis on dynamic networks. Industry Relevance: Random walks are used in recommendation (e.g. Pinterest’s Pixie), fraud detection, and network analysis. BINGO allows real-time graph mining on evolving data (social networks, transaction networks) by keeping the random walk process efficient even as the graph changes, which is useful for social media analytics, financial transaction monitoring, etc.

  • CAPSys: Contention-aware Task Placement for Data Stream ProcessingYuanli Wang, Lei Huang, Zikun Wang, Vasiliki Kalavri, Ibrahim Matta. CAPSys deals with placing streaming computation tasks (operators) on machines in a way that minimizes resource contention. In stream processing frameworks (Flink, Spark Streaming), multiple tasks may compete for CPU, network, or memory, causing unpredictable latency. This scheduler likely profiles or predicts contention between specific operators (for instance, two tasks both heavy on network I/O shouldn’t co-locate on the same host) and places them on different machines or cores. By being “contention-aware,” it improves overall throughput and reduces processing latency. Industry Relevance: Streaming analytics pipelines (for telemetry, click streams, etc.) are latency-sensitive. CAPSys can boost the stability and throughput of streaming jobs in cloud environments by intelligently packing tasks – valuable for any company processing big data streams in real-time where inefficient placement would cause bottlenecks.

  • Impeller: Stream Processing on Shared LogsZhiting Zhu, Zhipeng Jia, Newton Ni, Dixin Tang, Emmett Witchel. Impeller bridges stream processing with a shared log abstraction. It builds a stream processing engine on top of a shared log (a durable, totally ordered log of events accessible to multiple readers/writers). This design unifies historical and online processing: streaming queries can treat the log as the single source of truth for both real-time and past data. Impeller likely provides strong consistency and failure recovery by relying on the shared log (which could be implemented via Apache Kafka, or a log-based database). It enables stateful stream operators to scale out or recover by replaying from the log as needed. Industry Relevance: The idea of a “log as a database” (as seen in LinkedIn’s Samza, or Kafka Streams) is powerful. Impeller gives cloud and big-data companies a way to simplify their Lambda architectures – the same log underpins streaming and batch, ensuring no data loss or inconsistency. This can lower complexity and improve the reliability of real-time analytics pipelines used in fintech, IoT, etc.

  • NeuStream: Bridging Deep Learning Serving and Stream ProcessingHaochen Yuan, Yuanqing Wang, Wenhao Xie, Yu Cheng, Ziming Miao, Lingxiao Ma, Jilong Xue, Zhi Yang. NeuStream integrates ML model inference into streaming data processing pipelines. Instead of treating model serving (e.g. image classification or anomaly detection) as an external RPC from a stream job, NeuStream makes it a native streaming operator. It likely optimizes the scheduling and batching of inference calls in the context of a streaming engine, so that throughput is maximized and backpressure is handled gracefully. Essentially, it closes the gap where streaming systems handle data transformations and then call out to a model server – NeuStream merges these, avoiding network overhead and synchronization issues by perhaps co-locating model executors within stream processors and aligning their batching with stream micro-batches. Industry Relevance: Many real-time applications involve both data stream processing and ML inference (for example, analyzing sensor data with an AI model). NeuStream’s unified approach can reduce end-to-end latency and simplify system architecture, beneficial for industries like fraud detection (stream of transactions + ML), monitoring (logs + anomaly models), etc., enabling them to act on insights faster.

Storage and Memory Systems

  • Deft: A Scalable Tree Index for Disaggregated MemoryJing Wang, Qing Wang, Yuhao Zhang, Jiwu Shu. Deft is a custom indexing data structure designed for memory disaggregation architectures, where RAM is pooled and accessed over the network. It provides a scalable tree index (likely a B⁺-tree variant) that minimizes remote memory accesses and latency. Deft probably clusters tree nodes and uses pointer-free techniques or approximate positioning to reduce indirections, making lookups and updates efficient despite the high latency of disaggregated memory. It also likely handles partial caching of the tree in local memory to accelerate operations. Industry Relevance: In disaggregated data centers (e.g. with CXL attached memory or rack-scale memory pools), standard indexes can incur too many network fetches. Deft enables fast data lookup in a network-attached memory scenario, which is useful for distributed databases or in-memory caches that span multiple machines, a scenario emerging in modern cloud designs.

  • LOFT: A Lock-free and Adaptive Learned Index with High Scalability for Dynamic WorkloadsYuxuan Mo, Yu Hua. LOFT combines two trends: learned indexes (which use machine learning models to map keys to positions in a sorted array, as an alternative to B-trees) and lock-free concurrency. It presents a highly scalable index structure that adapts to workload changes (insertions, deletions, shifting key distributions) by adjusting the learned models on the fly, and it does so with a lock-free algorithm to avoid contention. This yields an index that is both fast (taking advantage of ML for search) and concurrent (many threads can perform operations without blocking each other). Industry Relevance: High-throughput key-value stores and databases can benefit from learned indexes to save space and potentially speed up queries. LOFT’s contributions mean these benefits can be realized even under heavy, dynamic workloads (common in real-time analytics or social network feeds) and on multicore servers without contention – improving performance of storage systems in industry scenarios.

  • Chrono: Meticulous Hotness Measurement and Flexible Page Migration for Memory TieringZhenlin Qi, Shengan Zheng, Ying Huang, Yifeng Hui, Bowen Zhang, Linpeng Huang, Hong Mei. Chrono improves OS/hypervisor management for systems with tiered memory (fast DRAM and slower memory like NVM or SSD used as pseudo-RAM). It provides fine-grained hotness tracking: rather than coarse counters or infrequent scans, Chrono meticulously measures page access frequency with high resolution to identify truly “hot” and “cold” pages. It also introduces a flexible migration policy that avoids hardwired thresholds – possibly using a dynamic or ML-based policy to decide when to promote/demote pages between tiers. By aligning hotness measurement with the performance characteristics of devices (e.g. NVM latency), Chrono ensures that promotions/demotions yield real performance gains. This avoids thrashing and underutilization seen in prior systems with rigid rules. Industry Relevance: As data centers incorporate NVMe SSDs or byte-addressable NVM (e.g. Intel Optane) alongside DRAM, Chrono’s approach allows them to maximize memory utilization (keep hot pages in DRAM) with minimal overhead. This improves application performance (database caching, in-memory analytics) and enables larger memory capacity at lower cost by safely using slower tiers.

  • PET: Proactive Demotion for Efficient Tiered Memory ManagementWanju Doh, Yaebin Moon, Seoyoung Ko, Seunghwan Chung, Kwanhee Kyung, Eojin Lee, Jung Ho Ahn. PET focuses on the “demotion” side of tiered memory (moving data from DRAM to slower memory) and does so proactively. It observes that many systems wait until DRAM is full or a page becomes cold to evict it, which can be suboptimal. PET instead leverages insights about application memory allocation patterns – introducing the concept of a “PET-block”, which groups pages based on allocation context (e.g. the malloc or memory region). It proactively demotes entire PET-blocks that are predicted to be used less, before DRAM pressure becomes critical. This alignment with app behavior yields big DRAM savings with little performance hit. In evaluations, PET reduced DRAM usage by ~40% on average (up to 80%) with minimal slowdown. It also outperformed existing tiering schemes under high memory pressure, maintaining better performance. Industry Relevance: Proactively offloading infrequently used data to cheaper memory means higher effective memory capacity at lower cost. Cloud providers or enterprises running large in-memory workloads can use PET to cut hardware costs (using NVMe or remote memory for colder data) while preserving application speed, which is a direct economic win.

  • Adios to Busy-Waiting for Microsecond-scale Memory DisaggregationWonsup Yoon, Jisu Ok, Sue Moon, Youngjin Kwon. Adios revisits the OS page fault handler design in the context of ultra-fast disaggregated memory (where a page fault fetch over RDMA can be only a few microseconds). Prior systems like Fastswap or DiLOS chose to busy-wait on page faults (spinning until data arrives) to avoid slow context-switches and interrupts, but Adios shows this causes head-of-line blocking and underutilized CPU/network resources. Instead, Adios reintroduces yielding during page faults but with a twist: it places the page fault handler and scheduler in the same userspace address space and uses ultra-lightweight user-level threads (“unithreads”). This way, when a fault occurs, the thread can yield and let another ready thread run, with only a few nanoseconds of overhead – avoiding CPU wasting cycles on busy-wait and allowing more outstanding RDMA requests. Adios also includes a new dispatch algorithm to balance RDMA queue usage across threads. Altogether, it outperforms the busy-wait baseline DiLOS by 1.07×–1.64× throughput and cuts P99.9 tail latency by 1.99×–10.9× in real workloads. Industry Relevance: Disaggregated memory is on the horizon via CXL and RDMA pools. Adios provides the OS mechanisms to fully exploit high-speed remote memory – yielding better CPU utilization and lower tail latency, which is crucial for making disaggregated architectures practical in industry (where QoS and efficiency determine adoption).

  • Pre-Stores: Proactive Software-guided Movement of Data Down the Memory HierarchyXiaoxiang Wu, Baptiste Lepers, Willy Zwaenepoel. Pre-Stores takes a proactive approach to moving data from CPU caches down to lower levels (e.g. DRAM or NVM) before it’s evicted under pressure. It presumably allows software (perhaps the compiler or OS) to explicitly “demote” cache lines or pages that it predicts won’t be reused soon, instead of waiting for hardware LRU to evict them. By doing so, it could reduce cache thrashing and also prepare larger chunks of data for eviction (amortizing cost). Essentially, it’s the analog of prefetch but for eviction (“pre-evict” = pre-store to lower level). This can improve cache efficiency and memory bandwidth usage for certain patterns. Industry Relevance: Workloads with streaming or phase behavior (like batch data processing) could benefit, as they often blow out caches. Pre-Stores could increase effective cache size and reduce interference for high-performance computing and big-data jobs, potentially giving predictable performance improvements in cloud VMs or any environment where software can be tuned with such hints.

  • RAKIS: Secure Fast I/O Primitives Across Trust Boundaries on Intel SGXMansour Alharthi, Fan Sang, Dmitrii Kuvaiskii, Mona Vij, Taesoo Kim. RAKIS addresses the notoriously slow I/O when using Intel SGX enclaves. SGX enclaves are secure but performing I/O (network, disk) requires exiting the enclave (OCALLs) which is slow and breaks security if not done carefully. RAKIS provides a set of secure, optimized I/O primitives that span inside and outside enclave seamlessly. It likely introduces a mechanism for encryption/decryption and data copying that minimizes transitions – perhaps by batching I/O requests or sharing memory buffers securely between enclave and untrusted OS. The result is a much faster way for enclave code to perform file or socket I/O without compromising the trust model. Industry Relevance: SGX and TEEs are used for confidential computing (e.g. databases, secure ML inference). RAKIS helps such applications achieve near-native I/O throughput, which broadens the range of apps that can run in TEEs without unacceptable overhead – important for cloud vendors offering confidential computing instances.

  • Efeu: Generating Efficient, Verified, Hybrid Hardware/Software Drivers for I2C DevicesDaniel Schwyn, Zikai Liu, Timothy Roscoe. Efeu is a framework to automatically generate device drivers for I²C devices, splitting functionality between hardware (FPGA or ASIC) and software. Many I²C peripherals (sensors, etc.) have timing-sensitive or frequent operations that could be offloaded to a small hardware core, while the rest remains in software. Efeu likely takes a high-level device specification and then synthesizes a “hybrid” driver: part runs on an FPGA (to handle frequent polling or data moves), and part as normal driver code – and it verifies correctness of this partition. This yields drivers that are both efficient (offloading low-level bit-banging or timing-critical loops to hardware) and reliable (formally verified against spec). Industry Relevance: Writing drivers for the plethora of IoT devices is time-consuming, and pure software might not meet performance for some low-level tasks. Efeu could reduce development effort and bugs by generating drivers automatically, and also improve performance for device-heavy systems (like sensor hubs or robotics) by leveraging simple hardware accelerators – a win for embedded systems development in industry.

  • eNetSTL: Towards an In-kernel Library for High-Performance eBPF-based Network FunctionsBin Yang, Dian Shen, Junxue Zhang, Hanlin Yang, Lunqi Zhao, Beilun Wang, Guyue Liu, Kai Chen. eNetSTL is essentially a C++ STL-like collection of data structures/utilities optimized for in-kernel eBPF programs. Writing high-performance eBPF code is hard due to limited libraries and verification constraints. eNetSTL likely provides vetted, BPF-friendly implementations of common structures (maps, queues) and algorithms that can be used in eBPF hooks for packet processing. By being in-kernel and BPF-safe, these can execute with minimal overhead. It essentially enriches the eBPF programming environment to make it easier and faster to create complex in-kernel network functions (firewalls, telemetry, load balancers) without kernel module programming. Industry Relevance: eBPF is exploding in use (for observability, networking, security). eNetSTL would allow developers to write more sophisticated eBPF programs faster, with confidence in their performance and safety. This can accelerate the pace of innovation in kernel extensions and help companies like Netflix, Facebook, Cloudflare (big eBPF users) push more logic into the kernel datapath safely for performance gains.

  • Understanding the Linux Kernel, VisuallyHanzhi Liu, Yanyan Jiang, Chang Xu. This is a tool/system aimed at visualizing the internal behavior of the Linux kernel to aid understanding and debugging. It presumably traces kernel events (scheduling, memory allocation, syscalls) and generates intuitive visual representations (graphs, timelines) that show how the kernel is operating. This can help developers catch issues like deadlocks, long interrupt disabling, or just learn kernel behavior by seeing it in action. Industry Relevance: Kernel engineering and performance tuning are critical but difficult due to complexity. A visualization tool can shorten debug time and training time for kernel developers, SREs, or OS researchers by making low-level behavior observable. Companies working on custom kernels or diagnosing OS performance (e.g. Android OEMs, cloud OS teams) could benefit from such a tool to quickly pinpoint problems.

  • “Garbage Collection Does Not Only Collect Garbage”: Piggybacking-Style Defragmentation for Deduplicated Backup StorageDingbang Liu, Xiangyu Zou, Tao Lu, Philip Shilane, Wen Xia, Wenxuan Huang, Yanqi Pan, Hao Huang. In deduplicated backup systems, data fragmentation over time hurts restore performance. This paper proposes a piggybacked defragmentation approach that runs opportunistically during normal garbage collection (GC) cycles. When backup data is deleted and chunks are freed, the system uses that chance not only to collect garbage but to relocate remaining live chunks to more contiguous storage areas, thereby defragmenting the storage in the background. By integrating defrag into routine GC (instead of needing separate heavy defrag jobs), it reduces write amplification and space overhead. The result is a backup store that maintains high read/restore efficiency over time with little additional cost. Industry Relevance: Enterprise backup appliances (like Dell EMC Data Domain, which one author is from) suffer from fragmentation as data evolves, making recovery slower. This technique can improve restore speeds and overall storage utilization in such products, giving customers faster disaster recovery and backup expiration without performance degradation.

  • Overcoming the Last Mile between Log-Structured File Systems and Persistent Memory via Scatter LoggingYifeng Zhang, Yanqi Pan, Hao Huang, Yuchen Shan, Wen Xia. This work adapts Log-Structured File Systems (LFS) to better exploit persistent memory (PM). Traditional LFS writes batches of data sequentially to storage, which is great for disks but on PM (byte-addressable and with different performance characteristics) the model can be suboptimal (e.g. it might incur extra copies or poor cache usage). “Scatter Logging” likely writes data and metadata in a way that leverages PM’s fast random access – perhaps scattering small updates in place in PM while still maintaining logical log structure. It may reduce the overhead of cleaning (garbage collection) or the indirection mapping that LFS normally needs, taking advantage of PM’s persistence + low latency. Essentially, it bridges the gap by modifying how the “tail of the log” and cleaning operations work on PM, eliminating inefficiencies. Industry Relevance: As persistent memory technologies re-emerge (via NVMe or new NVRAM), file systems will need to adapt. Scatter Logging can enable higher throughput and lower latency file operations on PM, useful for databases, caching systems, or any software using PM as storage – ensuring they aren’t constrained by algorithms designed for slow disks.

  • Daredevil: Rescue Your Flash Storage from Inflexible Kernel Storage StackJunzhe Li, Ran Shu, Jiayi Lin, Qingyu Zhang, Ziyue Yang, Jie Zhang, Yongqiang Xiong, Chenxiong Qian. Daredevil rethinks the OS storage stack for modern SSDs. Today’s kernel I/O path has multiple layers (VFS, block layer, I/O schedulers) that were designed for much slower, simpler disks. Daredevil likely introduces a more flexible or direct I/O stack that can adapt to application needs and SSD characteristics (like parallelism and internal GC). It might provide applications more control over placement or scheduling of I/O, or bypass certain layers for efficiency, while still retaining safety. Essentially, it “rescues” performance lost in translation by offering an optimized path or an extensible interface to utilize advanced SSD features (NVMe multi-queue, Zoned Namespaces, etc.). Industry Relevance: With ultra-fast NVMe drives, the OS can become the bottleneck. Daredevil’s approach can unlock additional IOPS and cut latency by eliminating needless overhead, benefiting databases, high-frequency trading systems, and any application running on fast storage. For cloud providers, a more efficient storage stack means they can deliver more performance per drive to customers.

  • Solid State Drive Targeted Memory-Efficient Indexing for Universal I/O Patterns and Fragmentation DegreesJunsu Im, Jeonggyun Kim, Seonggyun Oh, Jinhyung Koo, Juhyung Park, Hoon Sung Chwa, Sam H. Noh, Sungjin Lee. This paper presents an indexing/data structure method designed specifically for how SSDs behave. It likely optimizes index placement and lookup considering SSD characteristics like erase block sizes, write amplification, and fragmentation. The solution might adjust dynamically between different index strategies (like hash vs tree) depending on the workload’s I/O pattern (random vs sequential) and the drive’s fragmentation level. By doing so, it remains memory-efficient (small index footprint) yet offers consistently good performance across workloads. Industry Relevance: Filesystems and key-value stores on SSD struggle to maintain performance as fragmentation increases or workload changes. A universal indexing scheme that can gracefully handle these variations means more stable latency and throughput for storage engines – relevant to any system managing data on flash, from embedded devices to large-scale storage systems.

Security and Reliability

  • CKI (Container Kernel Isolation): A Hardware-Software Co-Design for Efficient Secure ContainersJiacheng Shi, Yang Yu, Jinyu Gu, Yubin Xia. This work introduces CKI, which adds a new privilege level and monitor to achieve strong isolation of container kernels from the host, without the overhead of full VMs. By leveraging Intel’s memory protection keys (PKS) hardware, CKI runs each container’s guest kernel in a de-privileged mode and uses a thin Kernel Separation Monitor (KSM) to mediate privileged operations. This means even if a container’s kernel is compromised, it cannot harm the host or other containers (unlike normal containers), achieving security comparable to VMs. Importantly, CKI avoids the costly parts of virtualization: it has fast PKS-based switches and eliminates certain side-channel mitigations on those transitions, cutting overhead by hundreds of cycles. In bare-metal cloud tests, CKI improved memory-heavy application latencies by up to 18–47% vs KVM and even more in nested scenarios. In nested (VM-in-VM) setups, CKI got 6.8× higher throughput for I/O apps compared to traditional approaches. Industry Relevance: CKI provides VM-level security with container-level efficiency. This is a breakthrough for multi-tenant cloud services: cloud providers could offer “secure containers” where tenants get strong isolation without the performance hit of full VMs, combining the best of both worlds and potentially replacing many VM use-cases.

  • Erebor: A Drop-In Sandbox Solution for Private Data Processing in Untrusted Confidential Virtual MachinesChuqi Zhang, Rahul Priolkar, Yuancheng Jiang, Yuan Xiao, Mona Vij, Zhenkai Liang, Adil Ahmad. Erebor is a sandboxing mechanism to safely process sensitive data in a cloud VM that the user does not fully trust (even with confidential VMs, the cloud provider could be semi-trusted). It likely uses a combination of hardware enclaves and isolated execution to ensure that specific data or code runs in a contained environment inside the VM. “Drop-in” suggests it doesn’t require modifying the guest OS or apps much – it might intercept syscalls or use hypervisor tricks to compartmentalize workloads on the fly. This allows, for instance, running a third-party analytics function on your private data in a VM, while making sure that function can’t leak data out. Industry Relevance: As confidential computing becomes popular, fine-grained sandboxing inside those environments is needed (for multi-stage pipelines, or protecting against insider threats). Erebor could help companies safely utilize cloud VMs for sensitive workloads by adding an extra layer of defense around data processing modules, even if the VM OS or some libraries are not fully trusted.

  • Seal: Towards Diverse Specification Inference for Linux Interfaces from Security PatchesWei Chen, Bowen Zhang, Chengpeng Wang, Wensheng Tang, Charles Zhang. Seal automatically infers API usage rules (specifications) for Linux kernel interfaces by analyzing historical security patches. The insight is that many security fixes essentially add a check or a missing step (e.g., locking, input validation) – from these, one can deduce what should have been done originally. Seal likely mines a corpus of kernel patches, clustering them to identify implicit rules like “function X must be called after Y” or “check pointer Z before use”. By diversifying sources (different bugs), it infers broader specs than a single buggy example. These inferred specs can then be used to detect other bugs or ensure new code complies. Industry Relevance: The Linux kernel is huge and developers may unknowingly violate rules, causing vulnerabilities. Seal’s inferred specs provide precious documentation and automated checkers for kernel developers and static analysis tools. This helps OS vendors (Google, RedHat, etc.) catch bugs early and improve kernel security by learning from past mistakes.

  • BESA: Extending Bugs Triggered by Runtime Testing via Static AnalysisJia-Ju Bai. BESA is a technique to amplify and generalize bugs found in testing by using static analysis. When a dynamic test (fuzz or unit test) finds a bug, BESA performs static analysis around that code to see if similar patterns exist in other contexts or along other paths, effectively finding extensions of the bug. For example, if a null pointer dereference was found in one function, BESA might statically trace where else that pointer could come from null and not be checked, exposing more instances. It essentially takes a concrete bug trace and uses static analysis to explore variants of it without needing further dynamic input. Industry Relevance: This approach helps maximize the value of each test case: QA teams and fuzzers might only hit one scenario, but BESA can uncover semantically related bugs elsewhere automatically. This can increase bug-finding efficiency in large software projects (OS kernels, browsers) where once a vulnerability is seen, you want to ensure all similar vulnerabilities are found and fixed.

  • HawkSet: Automatic, Application-Agnostic, and Efficient Concurrent PM Bug DetectionJoão Oliveira, João Gonçalves, Miguel Matos. HawkSet targets concurrent bugs in Persistent Memory (PM) programs, such as missing flushes or ordering issues that could corrupt data consistency under crashes. It likely instruments and observes a PM-enabled application (e.g., using Intel Optane with pmem libraries) and automatically detects anomalies like write-after-fsync or non-atomic updates. It’s application-agnostic, meaning it doesn’t need custom specifications per app – it probably uses generic patterns of correct PM usage (e.g., every persistent write should eventually be flushed) to catch bugs. And it’s efficient enough to run concurrently (possibly using parallel checking threads or core-local analysis) without huge slowdowns. Industry Relevance: As persistent memory finds its way into databases, file systems, and caches, bugs in crash consistency can be catastrophic (data loss). HawkSet provides developers and testers a way to catch PM-specific concurrency and ordering bugs automatically before deployment, increasing confidence in systems like PM databases or storage engines used by enterprises.

  • Revealing the Unstable Foundations of eBPF-Based Kernel ExtensionsShawn Zhong, Jing Liu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau. This paper analyzes the fragility and pitfalls of eBPF programs running in the kernel. It likely uncovers that eBPF, despite being safer than kernel modules, rests on assumptions that can break: for instance, changes in kernel internal data structures can silently break eBPF programs, or resource exhaustion and verifier quirks can lead to instability. They might demonstrate issues like an eBPF program working on one kernel version but not another (kernel internal API instability), or difficulty debugging eBPF, or performance anomalies (e.g., jitter due to JIT). By “unstable foundations,” they highlight that eBPF’s promise of stability is not fully met due to these underlying issues. Industry Relevance: Many companies rely on eBPF for production monitoring and networking. This work warns practitioners of hidden risks – for instance, an innocuous kernel update could break an eBPF-based firewall. It may spur improvements in eBPF tooling, better documentation of stable vs unstable hooks, and more robust BPF verifier enhancements, ultimately helping industry users maintain reliable eBPF deployments.

References

Share on Share on