OS-Level Challenges in LLM Inference and Optimizations
Large Language Model (LLM) inference pushes computing systems to their limits, not only in raw compute but also in how the operating system (OS) manages resources. This report examines OS-level challenges for LLM inference and explores potential solutions. We focus first on key bottlenecks – memory management, CPU scheduling, I/O, and real-time constraints – then discuss how kernel-level techniques (like eBPF and custom scheduling) can address these issues. We also consider the impact of system calls and page faults on performance, security/isolation concerns in multi-tenant environments, best practices and emerging research in OS customization for AI, and practical considerations for implementing such optimizations in a research project. The goal is to guide a research initiative by outlining the challenges and then potential solutions and areas for further investigation.
OS-Level Bottlenecks in LLM Inference
LLM inference workloads strain various aspects of an OS. Key bottlenecks include memory management, CPU scheduling, I/O handling, and the need for real-time responsiveness. These factors can become limiting performance factors as models and workloads scale:
-
Memory Management Challenges: Modern LLMs often have tens of billions of parameters, requiring many gigabytes of memory for model weights and intermediate data. Ensuring sufficient and efficient memory usage is a major challenge. If a model does not fully fit in GPU memory, a common approach is to spill data to CPU RAM or disk – but traditional GPU–CPU memory swapping incurs high latency and low throughput (1). Even when fitting in RAM, the sheer size of models means managing memory is non-trivial; LLM inference memory footprints continue to grow with larger models and longer context lengths (1). Page faults and swapping (if the OS has to page out model data) can stall inference for milliseconds at a time. The OS’s default paging might not be ideal for LLMs’ access patterns, leading to thrashing or suboptimal use of caches. Large models also benefit from large, contiguous memory allocations, but standard allocation can lead to fragmentation. In short, efficient memory management for LLM inference remains a challenge (1: Pie: Pooling CPU Memory for LLM Inference), and memory bottlenecks can severely hurt latency and throughput if not handled carefully.
-
CPU Scheduling and Compute Jitter: LLM inference (especially on CPUs or when coordinating CPUs and accelerators) can suffer from OS scheduling overhead or “jitter.” The OS’s task scheduler may preempt or context-switch inference threads, introducing variability in latency. Background daemons, interrupts, or other processes on a shared system can interrupt an inference task unexpectedly. In high-performance scenarios, these interruptions cause OS noise that lengthens tail latencies and breaks real-time constraints. For example, measurements in a high-performance setting showed that without special tuning, one in 100 events could be delayed by over 2 milliseconds due to OS scheduling effects, and worst-case delays reached 11 ms (2) (2). By contrast, pinning inference threads to dedicated CPU cores and isolating them from normal scheduling can dramatically reduce latency jitter – bringing worst-case latency down to tens of microseconds (2: Strategy: Taming Linux Scheduler Jitter Using CPU Isolation and Thread Affinity - High Scalability -). This illustrates how significant the default scheduler’s overhead can be. For LLM services that need interactive response times (e.g. chatbots), such variability is problematic. The OS’s general-purpose scheduling (e.g. Linux’s CFS) doesn’t inherently prioritize real-time inference deadlines, so without adaptation, inference requests might queue behind less critical tasks. Ensuring CPU core affinity, using real-time scheduling policies, and preventing frequency scaling or power-saving-induced delays are often necessary to avoid unpredictable slowdowns.
-
I/O and Data Handling Bottlenecks: Although LLM inference is primarily compute-heavy, I/O can become a bottleneck in several ways. First, loading a large model from storage can take significant time (several seconds for multi-GB models) – if the OS doesn’t efficiently cache or prefetch model weights, startup or context-switching between models is slow. During inference, if the model or its parameters are memory-mapped from disk, any page fault to load data will incur disk I/O latency. Page faults on missing pages cause the OS to pause the process and fetch data from disk (or across PCIe from CPU memory to GPU memory in unified memory scenarios), leading to significant delays (3: Impact of Memory Allocation on Model Latency in TensorRT) (4). The pattern of memory access matters: irregular or random access to model weights can generate many small page faults that saturate I/O channels and degrade throughput (4). Furthermore, reading input data (for example, large text prompts or data batches) and writing outputs (streaming generated tokens to a socket) also involve OS I/O operations and system calls. If not managed, these can queue behind other I/O or suffer from kernel overhead. High-throughput inference servers must optimize how they read inputs and deliver outputs to avoid I/O becoming the slowest link. In summary, while raw computation is often the focus, the OS’s I/O subsystem (file system, disk, network) can limit performance if model data and results aren’t handled in a streaming, efficient manner.
-
Real-Time Constraints and Deadline Misses: Many AI applications have real-time or interactive latency requirements – for instance, responding to a user query within a few hundred milliseconds. The OS is not traditionally designed for strict real-time guarantees on complex workloads like LLMs. The aforementioned scheduling jitter and I/O unpredictability mean an inference might occasionally take much longer than average, violating service-level objectives. Unlike a purpose-built real-time system, a general OS might schedule an unrelated background job or handle an interrupt (e.g. a network packet, or a disk flush) in the middle of an inference computation, adding unforeseen delay. Additionally, interrupt handling and kernel timers can preempt running threads at inopportune moments. Without special configuration, these sources of latency variation can derail an otherwise fast model. For instance, periodic OS housekeeping tasks or device interrupts contribute to OS noise, which HPC studies have long identified as a problem for time-sensitive workloads (5: Meet osnoise, a better tool for fine-tuning to reduce operating system ...) ([PDF] Shoot first and stop the OS noise - The Linux Kernel Archives). In an LLM serving context, “real-time” means consistently low latency per token generation or request; to approach this, practitioners often isolate CPUs, elevate thread priorities, and disable certain OS services on inference machines. The challenge is balancing a responsive system with the isolation required for deterministic performance.
Why the OS Matters: These bottlenecks highlight that an OS not tuned for AI can become the performance limiter. Modern accelerators (GPUs, TPUs) are extremely fast at matrix computations, so the relative overhead of OS activities (memory paging, scheduling, syscalls) becomes more pronounced. Indeed, as one analysis notes, tasks like scheduling that were once negligible can now consume a significant fraction of inference time given the speed of optimized LLM kernels (6) (6). Similarly, context-switches and kernel-user mode transitions disrupt the execution flow, causing cache flushes and pipeline stalls that inflate latency and reduce throughput (7) (7). Overall, treating the OS as an integral part of the AI inference stack – not just an invisible layer – is crucial. Next, we explore solutions at the OS level, including customizing the kernel behavior, to alleviate these challenges.
Custom Kernel Extensions and eBPF for AI Workloads
To address OS-level bottlenecks, researchers and engineers are increasingly turning to custom kernel programming and eBPF (extended Berkeley Packet Filter) technology. These allow tailoring or extending OS behavior for specific workload needs – without reinventing a whole OS from scratch. Below, we discuss how kernel customizations can target the bottlenecks identified, and how eBPF in particular provides a flexible toolset for optimizing LLM inference in the OS.
-
Optimizing Scheduling with eBPF and Custom Policies: One of the most direct OS-level optimizations is customizing how the CPU scheduler treats AI inference tasks. Rather than relying solely on the default scheduler, frameworks like Google’s ghOSt provide an API for controlling Linux scheduling from user-space processes or eBPF programs (8). In ghOSt, one can implement custom scheduling policies (even the entire scheduler logic) at the user level, or use eBPF hooks in the kernel to intercept scheduling events. This means an AI-serving system could, for example, ensure that an LLM inference thread is never preempted during critical sections, or that it gets priority on specific cores. Ghost allows multiple “agents” (which can be user-space daemons or eBPF programs) to influence kernel scheduling decisions in a coordinated way (8). By fine-tuning time slices, core affinities, or load balancing decisions via such mechanisms, the scheduler can be taught to favor AI workloads under certain conditions (for instance, reserving cores for inference only, or accelerating wake-ups for latency-critical inference requests). The openEuler Linux distribution recently introduced a programmable scheduling framework based on eBPF, which lets the kernel scheduler dynamically extend its policies to meet different workload performance requirements (9). This kind of eBPF-based scheduling extension can be used to implement AI-specific policies – for example, making scheduling decisions that minimize interference for a high-QoS (quality of service) AI service. In summary, kernel extensibility via eBPF allows injecting domain-specific knowledge (like LLM throughput vs. latency trade-offs) into scheduling, beyond what a one-size-fits-all OS would normally do.
-
Memory Management and eBPF for Paging/Prefetching: Another fruitful area is using custom kernel code to handle memory paging and prefetching in a smarter way for LLMs. The access pattern of LLM inference (especially if using techniques like memory mapping model weights or swapping between CPU and GPU memory) can be predicted or guided with additional knowledge. eBPF programs can hook into kernel events such as page faults or swapping decisions. For example, an eBPF program could monitor page fault rates on model memory and trigger prefetching of certain pages (or promote them to higher-speed memory) before they are needed. Research prototypes like FetchBPF have used eBPF to implement customizable prefetching policies in Linux ([PDF] FetchBPF: Customizable Prefetching Policies in Linux with eBPF). In the context of AI, one could imagine an eBPF that recognizes when an inference process sequentially scans model layers and proactively faults in the next layer from disk or host memory, avoiding stalls. Similarly, eBPF could be used to implement a smarter swap strategy for GPU memory oversubscription, perhaps by integrating with GPU driver events. (In fact, NVIDIA’s Unified Memory system already does on-demand page migration between host and GPU, but one could complement it with kernel plugins that, say, lock critical pages in GPU memory or batch migrations to amortize overhead.) The key idea is that standard OS paging might be suboptimal for LLMs, but with kernel programming we can tailor it – whether through eBPF or kernel modules – to reduce page faults and improve data locality. For instance, using larger page sizes is one known practice: larger pages (hugepages) mean fewer total page faults and TLB misses. This was observed on IBM Power9 systems, where the default 64KB page size moved larger chunks of memory per fault, yielding better throughput than the 4KB pages on x86 (4). A custom kernel could enforce hugepage use for model memory or use transparent huge pages more aggressively. In summary, kernel-level tweaks like custom page fault handlers, prefetchers, or specialized allocators (possibly informed by eBPF-collected metrics) can mitigate memory bottlenecks.
-
Using eBPF for Monitoring and Adaptive Optimization: eBPF isn’t only for direct control; it’s also a powerful observability tool. We can attach eBPF programs to tracepoints or kprobes to gather fine-grained metrics on inference workloads – without significant overhead or modifying user code. For example, one could trace system calls made by the inference process, measure their latency, or count page fault events in real time. Such data helps identify where the OS is bottlenecking. A research initiative called BPFTime suggests adding LLM-specific observability metrics via eBPF, to collect exactly the information needed to optimize these workloads (10: Possible ideas for the future - eunomia). By gathering metrics on cache misses, context switches, I/O wait times, etc., an intelligent controller (perhaps another eBPF or a user-space agent) could dynamically adjust system parameters. For instance, if eBPF monitoring shows a spike in page faults when a model’s context window grows, the system might decide to lock those pages in memory (using
mlock
via a syscall) or increase the priority of the disk read-ahead thread. Or if it observes that an inference process is frequently yielding the CPU (e.g. waiting on GPU I/O), an eBPF could temporarily give other tasks more CPU to improve overall utilization, then switch back. This kind of feedback-driven optimization leverages kernel programmability: we can create closed-loop controls where eBPF programs both measure and act upon kernel events relevant to AI performance. -
Custom Kernel Modules for AI Scheduling and Isolation: Beyond eBPF (which is restricted in complexity by design), full custom kernel modules or patches can be deployed for AI tasks. For example, one might implement a new Linux scheduling class specifically for deep learning inference – perhaps a variant of the real-time scheduler that time-slices based on inference micro-batches or token generation steps. Modules could also improve isolation: e.g., a kernel module could enforce that an AI process gets exclusive access to certain cores or devices when active, and gracefully relinquishes them when idle. This is akin to how HPC systems use Linux’s
cgroups
or CPU sets to isolate jobs; a tailored module could do this dynamically in response to load. Another example is modifying the Linux memory manager to be NUMA-aware for model allocations – ensuring that if a CPU is doing part of the inference, the memory for that task stays on the local NUMA node to reduce access latency. (This yields measurable gains; keeping memory on the same NUMA node as the CPU core can improve performance by 5–10% compared to cross-node memory access (11: Optimization Practice of Deep Learning Inference Deployment on Intel®...).) Some research OSes have even proposed bypassing the kernel altogether for certain operations (user-space networking stacks, user-space drivers, etc.) to avoid kernel overhead. While that’s a broad technique, it can be applied in inference serving: e.g., using user-space I/O (like DPDK for networking) so that outgoing token streams don’t incur kernel context switches.
In summary, kernel-level programming (via eBPF or modules) provides a toolkit to surmount OS bottlenecks. By injecting AI-specific logic into the OS, we can significantly reduce overheads – scheduling delays, page faults, cache misses – that general-purpose OSes introduce. The next section will dive deeper into those overheads (system calls, context switches, etc.) and how they manifest in AI inference.
System Calls, Page Faults, and Kernel–User Space Interactions
Even with an optimized scheduler or memory manager, the boundary between user-space (where the model code runs) and kernel-space (where OS services run) is a critical juncture. System calls, context switches, and page faults each involve crossing that boundary, and each crossing carries performance costs that can impact LLM inference.
-
System Call Overheads: A system call (such as reading input data, writing output, or allocating memory) switches the CPU from user mode to kernel mode to perform an OS service. This transition is much more expensive than a regular function call – it may involve saving CPU registers, switching to a kernel stack, invalidating certain CPU state (like memory mappings), and then later restoring everything to resume user code. Modern CPUs and OSes have optimized the path (using techniques like VDSO for certain calls), but it’s still on the order of hundreds or thousands of CPU cycles for each syscall, plus potential cache disruption (7) (7). For an LLM generating thousands of tokens, if each token write uses a syscall (e.g. writing to a socket or file) or if intermediate steps frequently call into the OS (for memory allocation, thread synchronization, etc.), those costs add up. In fact, frequent kernel-user transitions can interfere with the CPU’s instruction pipeline and caches, leading to poor caching behavior and extra latency beyond the direct measured cost of the call (7). For example, waking a thread waiting on I/O involves a syscall (to sleep) and an interrupt (to wake), plus scheduler overhead, and incurs many cache misses along the way (7). In systems with very fast hardware (like NVMe SSDs, 100Gb networks, or GPUs), these context switch overheads become a dominant factor (7). Therefore, minimizing system calls in the hot path of inference is a known best practice. Techniques include using asynchronous I/O (so calls can be batched or handled by the kernel with fewer wake-ups), memory-mapping files (to handle file I/O via page fault rather than explicit read/write calls), and reusing memory buffers (to avoid constant
malloc/free
calls). The fewer times we jump between user and kernel, the more the CPU can stay busy on actual model computation. -
Page Faults and Memory Access Overheads: A page fault occurs when a process accesses a virtual memory address that is not currently mapped to physical memory. This triggers a trap into the kernel’s page fault handler. In LLM inference, page faults can happen if the model or data is larger than available RAM (causing parts to be swapped to disk), or when using memory-mapped files (causing on-demand loading), or in GPU unified memory scenarios (triggering data transfer between host and GPU). Page faults are essentially unavoidable cache misses at the OS level – and they incur huge penalties because accessing disk or even just handling the fault takes orders of magnitude more time than normal memory access. For example, if an LLM model is partially on SSD, a page fault means the OS must fetch that page (4KB or 2MB, etc.) from storage, which could take milliseconds. Even in a GPU oversubscription case, servicing a page fault requires possibly evicting some GPU memory to host, and copying needed data from host to GPU, with many rounds of kernel involvement (4). The pattern of page faults greatly influences performance: if an algorithm causes random page faults all over a large file or memory region, the system will spend more time thrashing pages in and out than doing computation (4) (4: Improving GPU Memory Oversubscription Performance | NVIDIA Technical Blog). Conversely, sequential access patterns allow OS prefetchers to anticipate and load pages ahead of use. For LLM inference, key strategies are to avoid or hide page faults. This can mean using locked memory (preventing critical pages from being swapped out), using huge pages to reduce the number of total pages (thus fewer faults and TLB misses), and aligning data structures with page boundaries to improve locality. It can also involve manually pre-touching pages (e.g. reading through the model once at startup to force it into memory) or using madvise/mmap flags like
MAP_POPULATE
to have the OS pre-load pages. Another strategy from systems like FlexGen and PagedAttention is to manage model paging at the application level: they carefully orchestrate which parts of the model reside in GPU, CPU, or disk at each step to minimize unplanned page faults. Nonetheless, when page faults do occur, their cost is very high – one miss can stall a thread for a million cycles or more if it has to go to disk. Therefore, understanding and measuring page fault frequency is important when profiling an inference system; a surprisingly slow throughput might be explained by even occasional faults that break the flow of data. -
Context Switches and Interrupts: In addition to deliberate syscalls and page faults, the kernel can interrupt a user process due to timer interrupts, hardware interrupts (e.g. network packets arriving), or to schedule another process. Each context switch, even between two user processes/threads, involves saving CPU state and loading another – which can cost on the order of a few hundred nanoseconds to a few microseconds, plus indirect costs of cache and TLB disturbances (7) (7: A Case Against (Most) Context Switches). In a busy environment, an inference thread might be context-switched out briefly and then resumed, which can disturb the continuity of data in CPU caches. Moreover, if an inference is running on multiple threads or uses GPU (which runs asynchronously), there may be frequent coordination points where one thread waits and another runs, causing context switches. One way to mitigate this is pinning threads to dedicated cores (so the OS ideally never context-switches them out for others, as long as a core per thread is available). Another is increasing the scheduling priority or using real-time scheduling, which reduces the chance of preemption by ordinary tasks. On the flip side, busy-waiting on synchronization (to avoid context switches) can lead to wasted CPU cycles, so a balance is needed. Modern OS kernels also employ techniques like interrupt coalescing (batching interrupts) and tickless kernels (to avoid frequent timer interrupts on idle cores) which can reduce unnecessary context switches. For LLM services, it is often recommended to isolate the CPU cores that run the model from those that handle interrupts or background tasks – essentially treating them as quasi real-time cores.
In essence, kernel-user interactions are expensive relative to the arithmetic our AI models perform. Every time we cross that boundary, we pay a cost in latency and lost CPU efficiency. A solid understanding of these costs informs why certain optimizations (like avoiding small syscalls in tight loops, or locking memory to avoid paging) can yield significant performance gains in practice. It also underlines why advanced techniques like eBPF or user-space networking exist: they try to eliminate some of these transitions by doing more work in one domain (e.g. handling a packet in kernel via eBPF instead of bouncing to user-space). For a research project, measuring how often and why your inference code enters the kernel (using tools like strace
, perf
, or eBPF tracepoints) can illuminate non-obvious slowdowns.
Security and Isolation in AI Inference Workloads
When optimizing at the OS level, we must also consider security and isolation, especially as AI inference moves to cloud and multi-tenant environments. LLM inference often involves sensitive data (user queries) and proprietary models, so the OS must isolate workloads from each other and protect the system from any misbehavior of the AI process. Here we outline the main concerns and challenges:
-
Multi-Tenancy and Isolation: In cloud or shared cluster scenarios, multiple users or applications might run inference on the same physical machine. The primary challenge is to ensure adequate isolation so that one tenant’s workload cannot interfere with or peek into another’s (12: Multi-Tenancy for AI Clusters: Enabling Scalability and Security). This includes isolating compute resources (CPU/GPU time), memory, and I/O bandwidth. The OS typically provides mechanisms like process isolation, cgroups/quotas, and namespacing (in Linux) to separate workloads. However, simply relying on OS process isolation may not be sufficient for performance isolation in AI workloads (). Contention in shared resources (CPU caches, memory bandwidth, disk, network interfaces) can cause one model’s performance to degrade due to another’s activity, even if they are separate processes. Researchers note that “shared systems cannot rely on OS mechanisms for isolation between tenants” alone, and often require additional application-level resource management (). For example, two inference processes sharing a GPU need cooperation beyond what the OS can enforce (since the GPU scheduling might be handled by its driver or hardware). Best practice in multi-tenant AI serving is to use containers or VMs to add stronger isolation boundaries around each model, and to employ scheduling policies that account for the unique needs of each workload (preventing a noisy neighbor from consuming all resources). The OS’s job is then to support these isolation boundaries efficiently.
-
Memory and Data Security: Large models and their intermediate data reside in system memory – an OS must ensure that one process cannot read or corrupt another process’s memory. This is standard memory protection, which operating systems handle via virtual memory. The concern is if there are any vulnerabilities (e.g., a bug in a custom kernel module or in the GPU driver) that could be exploited to break this isolation. Additionally, when using shared hardware like GPUs, memory isolation is handled by the GPU’s memory management unit and drivers. There have been instances of side-channel attacks on GPUs where one process can infer details about another by measuring timing or resource usage, so high-security environments may demand strict scheduling or even dedicating GPUs to a single tenant at a time. Another vector is swap space or disk cache: if the OS swaps out model memory to disk, that data might persist on disk unless properly encrypted or wiped, creating a risk if the disk is shared or later re-used. Ensuring that sensitive model data doesn’t leak via OS-managed resources is important (for example, using encrypted filesystems for swap, or disabling swapping for confidentiality).
-
System Calls and API Security: An inference server is typically a long-running process that might accept client connections, load model files, etc. This means it will make system calls that interact with the external environment (opening files, reading network sockets). Each of these interactions is a potential security risk – e.g., malformed input causing a buffer overflow in the model server, or a model file with unexpected format. While these are more application-level, the OS can provide mitigation like seccomp filters to restrict which system calls the process is allowed to make (reducing the impact if the process is compromised). For instance, an inference process likely doesn’t need to call
exec()
or manipulate system configuration, so a seccomp profile could lock it down to only networking and memory calls. Similarly, Linux capabilities can be dropped (e.g., no permission to load new kernel modules or use raw sockets) to limit what the process can do. If we use eBPF programs to assist the inference, note that loading eBPF requires privileges; care must be taken that only trusted code is run in kernel (the eBPF verifier helps by rejecting unsafe code, but a bug in the verifier could be disastrous). -
Running as Root vs. Least Privilege: Some OS-level optimizations often are done in privileged context – for example, setting real-time priorities or locking memory (via
mlock
) or using hugepages might require root privileges or specific capabilities. From a security standpoint, running the inference service with elevated privileges is risky: if an attacker finds a vulnerability in the model server or the model itself (perhaps through a malicious input designed to cause an overflow in C++ runtime or a Python library), they could gain control of a privileged process, leading to a full system compromise. Therefore, a principle of least privilege is crucial. One practical way to reconcile this is: perform the privileged setup (memory locking, reserving hugepages, setting CPU affinity) at launch time under a controlled context, then drop privileges for the main serving loop. Containerization can help here by giving the container the specific rights it needs (via Linux capabilities) without making it fully root on the host. For research purposes, one might run experiments as root on a test machine (to freely tweak the kernel), but any deployment should carefully sandbox such modifications. -
Side-Channel and Microarchitectural Security: Although more of a hardware concern, it’s worth noting that certain OS decisions affect side-channel resistance. CPU features like simultaneous multithreading (SMT/Hyper-Threading) can enable side-channel attacks (e.g., cache timing attacks) between threads on the same core. For a highly secure inference environment, one might disable SMT or ensure that two different tenants never share the same physical core (the OS scheduler can enforce this if configured). Similarly, ensuring that the OS is patched for speculative execution vulnerabilities (Spectre, Meltdown) is important, because those could in theory allow inferring data from another process’s memory. Note that many of those patches (like kernel page table isolation) incur performance overhead. There is a trade-off: maximum performance might be achieved by turning off certain security mitigations, but that’s only acceptable in controlled offline environments, not multi-tenant scenarios. A research project should be aware of these trade-offs and at least document them if any OS hardening features are disabled for performance testing.
-
Confidential Computing for Inference: An emerging approach to security is to run AI inference inside trusted execution environments (TEEs) or encrypted VMs to protect data and models even from the host OS itself. For example, Intel SGX or TDX, or AMD SEV, can keep model weights encrypted in memory and only accessible within an enclave. This adds additional OS-level challenges: standard OS memory management might not work the same (enclaves have limited secure memory and page faults in/out of enclaves are extremely costly ([PDF] Memory-Efficient Deep Learning Inference in Trusted Execution ...)), and scheduling an enclave thread has to preserve its security properties. While this is a bit tangential to OS optimization, it’s relevant to mention because a research initiative might consider whether the OS changes they propose are compatible with confidential computing. Running inference in an isolated VM (like AWS Nitro enclaves or similar) could be a way to achieve strong isolation at some performance cost. Container-based isolation is weaker but faster; VM-based isolation (or enclave) is stronger but introduces overhead (additional context switch or encryption cost). Thus, one must balance security needs with performance – often by running most heavy workloads on bare metal but isolating tenants by giving each their own dedicated hardware slice.
In summary, OS-level optimizations must not compromise isolation and security. The ideal is to have performance isolation (each workload gets predictable performance unaffected by others) and security isolation (no workload can access or degrade others’ data or resources). Achieving both is non-trivial: stronger isolation (like VMs) can introduce performance overhead, while pure performance tuning (like disabling certain protections) can weaken security. Current best practices tend to use container orchestration with Kubernetes or similar to schedule AI jobs with quotas, and use monitoring to prevent any job from starving others. In research, if you develop a custom OS feature (say a new scheduler), testing it in a multi-tenant scenario and checking that one heavy inference doesn't steal CPU cycles from a neighbor beyond what’s intended is important. Additionally, any kernel changes should be scrutinized for security (e.g., avoid introducing an exploitable bug in a custom module).
OS Customization for AI: Best Practices and Emerging Research
Given the challenges above, there is a growing body of work on customizing and tuning operating systems specifically for AI and ML workloads. This ranges from simple configuration tweaks to whole new OS architectures. Here we highlight some best practices in use today, as well as emerging research directions:
-
Tuning and Configuring Linux for AI: Many practitioners have a checklist of Linux settings to optimize performance for deep learning. These include enabling transparent huge pages (so the OS automatically uses 2 MB pages for large allocations, reducing page management overhead), or even using explicit huge pages for GPU memory buffers. Huge pages not only reduce the number of page faults, but also improve TLB (Translation Lookaside Buffer) efficiency – for memory-heavy workloads like LLMs, this can give a measurable boost. Other settings involve CPU scaling: disabling frequency scaling and C-states (to keep processors at a consistent high frequency), and pinning threads to specific cores (using
taskset
orcset shield
on Linux). Isolating cores using theisolcpus
kernel parameter or CPU sets ensures the OS doesn’t schedule random tasks on the cores reserved for inference. Additionally, turning off Turbo Boost in some cases can reduce jitter (more consistent performance at a slightly lower peak frequency is sometimes preferable for latency predictability). On the I/O side, using fast NVMe SSDs and high-performance filesystems (ext4 or XFS with proper mount options) for loading models is recommended, and if using network, configuring adequate kernel network buffers and perhaps enabling features like busy polling to reduce latency. These are all configuration-level best practices that do not require new code but are important to get right. -
HPC and Real-Time Kernel Patches: The high-performance computing (HPC) community has long dealt with OS interference issues, and some of their solutions carry over to AI. For instance, the Linux RT (Real-Time) kernel patch can be applied to get more deterministic scheduling (preempt_rt patch makes the kernel more preemptible). HPC-oriented Linux distributions or configs (e.g., those used in supercomputers) often minimize OS background activities – running a daemonless or low-service OS on compute nodes (sometimes called a compute node kernel). This idea is appearing in AI clusters as well: specialized deployment where nodes running critical inference have a pared-down OS image to reduce noise. Some cloud providers are exploring stripped-down OS or unikernels for inference serving, especially at the edge, to maximize performance. An example is using AWS Firecracker microVMs to run models in a lightweight virtual machine that has a very small OS footprint (this improves security isolation and keeps the "OS noise" low, while still launching quickly like containers).
-
Ghost and Advanced Schedulers: We mentioned Google’s ghOSt framework earlier as a way to offload scheduling decisions to user space. This is cutting-edge research that effectively turns scheduling policy into something pluggable. With ghOSt, Google demonstrated implementing Linux’s default CFS scheduler in eBPF as a proof-of-concept, then showed you can implement completely different policies as well (8) (8: Google's Ghost Look Very Appealing For Kernel Scheduling From User-Space & eBPF Programs - Phoronix). For AI, one could implement a policy that, say, prioritizes inference tasks over background data-processing tasks during business hours, then switches at night – all without rebooting or changing the kernel binary, just by swapping out the ghOSt scheduling agent. This separation of mechanism (in kernel) and policy (in user-space or eBPF) is a trend that could benefit AI workload scheduling greatly. It allows rapid experimentation with scheduling algorithms tailored to AI (for example, scheduling based on GPU readiness or batching state). Academic research is also looking at learning-based scheduling, where an ML model (ironic, but yes) learns to predict the best scheduling decisions. With eBPF, such a model could run in a limited form in the kernel or guide a user-space scheduler.
-
AI-optimized Operating Systems: There’s a vision of developing entire operating systems optimized for AI. While general-purpose OSes (Linux/Windows) are still the base, projects are investigating what an “AI-first” OS might look like. This could include first-class support for accelerators (better than today’s driver model), more direct user control over huge memory allocations, and perhaps new abstractions (like treating a sequence of GPU kernels and CPU ops as one schedulable unit to improve scheduling coherence). Some research has proposed treating neural network execution as an OS-managed pipeline, co-scheduling CPU and GPU together to minimize idle times. Others have looked at allocation of resources across distributed systems – effectively an OS for the datacenter that can allocate GPUs, storage, etc., to AI tasks on demand (Kubernetes does this at a higher level; research is ongoing to make it more fine-grained and efficient). There are also efforts like openEuler (by Huawei) which explicitly list support for LLM inference in their kernel features (9) (9) – indicating that mainstream OS vendors are paying attention to AI workload needs. In openEuler’s case, they highlight optimizations for frameworks like llama.cpp on CPUs (quantization support, optimized memory usage) and various scheduling/QoS improvements in the kernel (9) (9: key-features | openEuler documentation | v24.03_LTS).
-
Kernel Bypass and User-space Networking/Storage: A best practice from both HPC and high-frequency trading that is relevant: bypassing the OS kernel for certain operations to cut down latency. For instance, using DPDK (Data Plane Development Kit) for networking can allow an inference server to send/receive data from the NIC directly in user-space, avoiding kernel network stack overhead. Similarly, using SPDK (Storage Performance Dev Kit) allows user-space high-performance access to NVMe drives. If an inference service has extreme throughput requirements (say it streams high volumes of data), these techniques can help. They do require dedicating hardware (NIC/SSD) to the process and careful management of drivers. Another bypass approach is memory-mapped I/O combined with busy-wait polling, which can, for example, let a user-space thread poll a memory location that the NIC writes to (via RDMA or other mechanisms), again avoiding interrupts. These are advanced optimizations that trade CPU cycles for lower latency by avoiding kernel scheduling.
-
Emerging Memory Systems (CXL and others): On the horizon is new hardware like CXL (Compute Express Link) which allows expansion of memory transparently across devices or even nodes. OS support for such technologies will be crucial for AI, as they enable having huge pools of memory (possibly slower, but large) accessible to CPUs/GPUs. This could change how inference engines handle models that don’t fit in VRAM: instead of swapping to SSD, they might swap to CXL-attached memory which is faster. The OS will manage this like NUMA memory or separate tiers. We might see kernel policies to automatically move rarely-used model weights to CXL memory and keep active ones in local memory. Research is ongoing in tiered memory management, some of which is inspired by AI workloads that have an order-of-magnitude difference in “hot” vs “cold” data usage.
Overall, current best practices revolve around tuning existing OS parameters, while research is pushing towards more flexible and intelligent OS behavior tailored to AI. For someone leading a research project, it’s wise to build on those best practices first (for a strong baseline) and then identify gaps that novel techniques can fill. For example, one might start by applying known tweaks (huge pages, CPU pinning, etc.), measure remaining bottlenecks, then prototype an eBPF program or custom scheduler to address a specific issue (say, uneven GPU utilization due to OS scheduling). The combination of established tuning and novel customization can yield impressive results – as seen in some case studies where throughput of LLM serving was improved significantly by eliminating scheduling inefficiencies (6) (6: MLSys @ WukLab - Can Scheduling Overhead Dominate LLM Inference Performance? A Study of CPU Scheduling Overhead on Two Popular LLM Inference Systems).
Practical Considerations for Implementing OS-Level Optimizations
When embarking on OS-level optimization as part of a research project, there are important practical points to keep in mind. Implementing kernel changes or eBPF programs is quite different from user-space programming, and ensuring that your optimizations are effective and safe requires careful planning:
-
Measuring Baselines and Bottlenecks: Before changing anything, set up thorough profiling of the status quo. Use tools like
perf
(for CPU profiling, cache misses, branch misses),iostat
/vmstat
(for I/O and memory stats), andperfetto
or NVIDIA Nsight if GPUs are involved. Identify where the time is going: Is the CPU fully utilized or often waiting? Are there many syscalls or page faults? Baseline measurements will guide you to the most fruitful areas to optimize. It will also provide a way to quantify improvements (or regressions) after you make OS-level changes. Without good measurements, OS tweaks can be shooting in the dark. -
Incremental and Isolated Changes: OS-level changes can have wide-ranging effects, so try to make one change at a time and test. For example, if you want to experiment with a custom scheduler, you might start by isolating the inference process to certain CPUs and writing an eBPF program to monitor its scheduler events, before actually altering scheduling decisions. This way, if something goes wrong (e.g., performance drops or the system becomes unstable), you can pinpoint the cause. Working incrementally also means you’ll build understanding of the system piece by piece, which is valuable in research write-ups.
-
Use of eBPF vs. Kernel Module: Decide whether your optimization can be done with eBPF or needs a full kernel module/patch. eBPF has the advantage that it doesn’t require rebuilding the kernel or even stopping the system – you can load and unload eBPF programs at runtime. It’s great for prototyping ideas like “what if we prioritize this process in the scheduler” or “let’s capture an event and adjust something.” However, eBPF is sandboxed and limited (no loops or complex logic unless you use BPF ringbuffers/maps to communicate with user-space). If your idea requires heavy logic or direct modification of core kernel code paths, you might need to write a kernel module or patch. That in turn means you’ll need a development environment for compiling the kernel and a way to deploy it (perhaps a VM for safety). For a research project, using eBPF is often a faster path to iterate on ideas; only dive into kernel patches if absolutely necessary.
-
Stability and Reproducibility: Running custom kernel code can cause crashes if there’s a bug. It’s wise to test on non-production machines and have remote access or a fallback in case the system becomes unresponsive. Using virtual machines or containers (for eBPF, you can use unprivileged containers to some extent, but for loading programs you often need privileges) can contain any accidents. Also, document the exact kernel version and configuration you are using – OS optimizations might behave differently on different kernel versions. If you present research results, others might need to reproduce your environment. Where possible, use widely-available platforms (e.g., a mainstream Linux distribution, with your customizations on top) rather than a totally custom OS, so that your results generalize better.
-
Interaction Effects: Keep in mind that various optimizations can interact. For example, if you pin a process to a CPU core (to avoid scheduler latency) but that core handles interrupts from a network card, you might actually get worse interference (because now your inference thread is getting interrupted by network IRQs it can’t escape). In this case, you’d also need to redirect interrupts (via /proc/IRQ affinity settings) off that core. Similarly, enabling huge pages could backfire if the memory becomes too fragmented to allocate them – causing the OS to spend extra time compacting memory. So, watch out for side effects. The best approach is holistic: consider CPU, memory, and I/O together. Often an improvement in one area can surface a new bottleneck in another (classic example: speeding up CPU scheduling might make the workload now IO-bound, or vice versa). As you iterate, keep checking the whole system performance, not just one metric.
-
Security and Permissions: If your research setup is in a shared lab environment or on cloud VMs, obtaining the ability to change kernel settings or load eBPF might require coordination. On a personal machine you have free rein, but on a shared cluster you may need admin rights. Always communicate with system administrators if you’re doing kernel experiments on shared infrastructure – inadvertently crashing a shared machine or opening a security hole would not be good! In cloud environments, features like eBPF may be restricted; however, some cloud providers allow custom kernel images if you use bare metal instances. Plan for the environment accordingly.
-
Leverage Existing Frameworks: You don’t have to build everything from scratch. If your aim is to create a custom scheduler, consider using ghOSt (as it already provides a lot of infrastructure to intercept scheduling). If you want to manage memory, look at Linux’s cgroup v2 interface for memory (it allows setting memory limits, protections, NUMA policies, etc., which you could control programmatically). For networking and I/O, frameworks like io_uring (for asynchronous IO) provide a more efficient interface than traditional system calls – maybe your project can integrate that. By using existing tools, you reduce the amount of low-level code you need to write, and you can focus on the novel aspect (like the policy or strategy). The Linux kernel community and cloud providers are actively working on features for isolating workloads (for example, there’s ongoing development on better isolation for noisy neighbors, and new cgroup controllers). Keeping an eye on these developments can spark ideas or provide ready-made solutions to part of your problem.
-
Evaluating Results Properly: When you implement an OS-level optimization, evaluate it under realistic conditions. It’s possible to inadvertently create a scenario that benefits a micro-benchmark but not a real workload. For example, if you evaluate CPU scheduling changes with a single-threaded, CPU-bound loop, you might see a huge win; but an actual LLM inference might be multithreaded and also waiting on GPU, in which case the CPU scheduler tweak might not matter as much. So, test with actual model inference runs, possibly with varying batch sizes or sequence lengths to see how your OS changes behave. Also, evaluate not just raw throughput but tail latency (95th/99th percentile latencies) if real-time response is important – OS optimizations often have the biggest effect on those tail cases (reducing jitter).
-
Documentation and Further Investigation: Finally, treat each finding as a piece of a bigger puzzle. OS-level behavior can be complex; you might fix one bottleneck and uncover another. Document everything: “We enabled huge pages and saw a 5% gain in throughput, but CPU utilization dropped, indicating we became IO-bound – next, we addressed IO by doing X…” This narrative is valuable for a research report. It also points to further investigation: maybe your project will solve some issues but not all, and that’s fine – noting what remains (e.g., “GPU utilization is still only 70%, likely due to framework overhead – could be a target for future OS-runtime co-design”) shows you understand the landscape. In particular, areas like kernel-user co-design, smarter interrupts, and integration with specialized hardware (DPUs, smart NICs) are ripe for future research, and you can suggest those as follow-ups.
By considering these practical aspects, you increase the chances that your OS-level optimizations will yield meaningful, reproducible improvements and that your research insights will be applicable in real-world scenarios. The interplay between LLM inference and operating systems is an exciting frontier – with careful experimentation and design, there is ample opportunity to push the boundaries of performance while maintaining robust isolation and security.