Checkpoint/Restore Systems: Evolution, Techniques, and Applications in AI Agents
Checkpoint/restore (C/R) technology – the ability to save a running program’s state to persistent storage and later resume execution from that point – has long been a cornerstone of fault tolerance and process management in computing. By capturing a snapshot of a process or group of processes, C/R enables recovery from failures, migration of computations, load balancing, and the suspension/resumption of work. Traditionally, C/R has been critical in high-performance computing (HPC) environments to mitigate frequent failures in large clusters, in operating systems for process migration and preemption, and in virtualization platforms for live virtual machine (VM) migration with minimal downtime. As we usher in an era of AI-centric applications – from AI-assisted developer tools and autonomous agents to distributed machine learning pipelines – the scope of C/R is expanding. Modern AI systems often consist of long-running stateful agents, complex multi-process pipelines, and GPU-accelerated workloads, all of which introduce new requirements and challenges for checkpointing. For example, training massive deep learning models over weeks exposes a system to many failures; one 54-day run of a 405-billion parameter model across 16,000 GPUs experienced 419 interruptions (78% from hardware faults), potentially costing millions in lost work. Techniques like maintaining redundant in-memory states for fast recovery are used in such cases, underscoring the importance of robust checkpointing. This survey provides a comprehensive overview of C/R systems and their evolution, spanning traditional use cases (before the advent of AI agents) and emerging applications in AI. We cover checkpointing at all levels of the software stack (OS-level, container, VM, application, and library-level), discuss stateless vs. stateful restoration strategies for AI systems, compare prominent open-source and proprietary C/R solutions, delve into the technical mechanisms enabling C/R (memory snapshotting, I/O and descriptor handling, GPU state, etc.), and highlight research challenges in bringing reliable, efficient C/R to dynamic, interactive AI agent environments. We also include extensive references to both classic literature and recent works (with a focus on peer-reviewed research), and we provide comparative tables to summarize the landscape of C/R tools and their capabilities. By looking at past and present developments, we aim to outline the trajectory of checkpoint/restore technology and identify opportunities for new tooling tailored to the next generation of AI-driven applications.