Skip to content

When CPU Noise Slows Down GPU Inference: Measuring Scheduler and IRQ Impact with eBPF

Quantitative eBPF tracing of CUDA kernel launches, scheduler context switches, and IRQs shows when CPU noise matters for GPU LLM inference and how CPU pinning recovers throughput.

GPU inference often looks like a GPU problem, but the CPU still sits on the critical path. It prepares inputs, launches CUDA kernels, manages synchronization, handles runtime calls, and shares cores with system work, interrupts, and other tenants. If that CPU-side launch path is delayed, the GPU can be left waiting even when the GPU kernels themselves are fast.

This post asks a concrete question: when an LLM inference workload is running on a GPU, how much do Linux CPU scheduling decisions and IRQ handling actually matter?

To answer it, we built an eBPF tracing tool, cuda_sched_trace, that records CUDA kernel launches, scheduler context switches, and hard/soft IRQ events with nanosecond timestamps. We then ran Qwen3 0.6B inference under clean and noisy-neighbor conditions: CPU load from stress-ng, network load from iperf3, disk load from fio, a combined heavy-load case, and a mitigation case using CPU pinning and priority adjustment.

The short version: in a clean environment, scheduler and IRQ overhead are small. Under production-like noisy-neighbor conditions, they can become very real. Combined CPU, network, and disk interference reduced throughput by 20.5%, while simple CPU pinning reduced context switches by 96.3% and recovered most of the lost throughput.

Why CPU Scheduling Shows Up in GPU Inference

Modern GPU workloads, particularly LLM inference and training, require tight coordination between CPU and GPU execution. The CPU is responsible for:

  • preparing input data and kernel parameters
  • launching GPU kernels through CUDA APIs
  • managing memory transfers and synchronization

An interruption to that CPU-side workflow can delay GPU kernel submission. In the worst case, the GPU has available compute capacity but no new work to execute.

The motivation comes partly from Meta's work on sched_ext for AI training optimization, where production issues include "IRQs preempting our important tasks." Network interrupts (NET_RX/NET_TX) and block device interrupts can matter for large distributed training jobs, and custom scheduling policies can improve AI workload performance by 5-20%.

But the impact is workload-dependent. A single-node LLM inference loop is not the same as distributed training with all-reduce traffic. Before investing in custom scheduling, we wanted measurements that separate scheduler problems from normal application behavior.

The study has four goals:

  1. Measure the baseline impact of CPU scheduling on GPU kernel launches.
  2. Characterize IRQ interference patterns and their performance cost.
  3. Quantify noisy-neighbor impact under CPU, network, disk, and combined load.
  4. Evaluate how much CPU pinning and priority adjustment help.

Tracing the Launch Path

We developed cuda_sched_trace, an eBPF-based tracing tool that combines CUDA API uprobes, Linux scheduler tracepoints, and IRQ tracepoints.

CUDA API Tracing

The tool attaches uprobes to CUDA Driver and Runtime APIs:

// Attach to CUDA Driver API
SEC("uprobe/cuLaunchKernel")
int trace_cuLaunchKernel(struct pt_regs *ctx) {
    // Capture: timestamp, pid, tid, grid/block dimensions, shared memory, stream
    // Mark process as GPU process for scheduler tracking
}
 
// Attach to CUDA Runtime API
SEC("uprobe/cudaLaunchKernel")
int trace_cudaLaunchKernel(struct pt_regs *ctx) { ... }
 
SEC("uprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_enter(struct pt_regs *ctx) { ... }
 
SEC("uretprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_exit(struct pt_regs *ctx) { ... }

Scheduler Event Tracing

Scheduler activity is captured through sched_switch, filtered to GPU-related processes:

SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next) {
    // Only track if prev or next is a GPU process
    // Record: timestamp, prev/next pid, off-cpu/on-cpu duration
}

IRQ Tracing

Hard and soft IRQs are tracked through kernel tracepoints:

SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action) {
    // Track hard IRQ entry, record IRQ number and handler name
}
 
SEC("tp_btf/irq_handler_exit")
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action) {
    // Calculate IRQ duration
}
 
SEC("tp_btf/softirq_entry")
int BPF_PROG(softirq_entry, unsigned int vec_nr) {
    // Track soft IRQ: TIMER, NET_RX, NET_TX, BLOCK, SCHED, RCU, etc.
}
 
SEC("tp_btf/softirq_exit")
int BPF_PROG(softirq_exit, unsigned int vec_nr) {
    // Calculate soft IRQ duration
}

The data path is straightforward: the GPU application issues CUDA calls; eBPF programs observe CUDA, scheduler, and IRQ events in kernel space; events are sent through a BPF ring buffer; analysis scripts parse the resulting CSV.

┌─────────────────────────────────────────────────────────────────┐
│                         User Space                               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ GPU App     │    │ cuda_sched  │    │ Analysis Scripts    │  │
│  │ (qwen3.cu)  │    │ _trace      │    │ (Python)            │  │
│  └──────┬──────┘    └──────┬──────┘    └──────────┬──────────┘  │
│         │                  │                       │             │
│         │ CUDA calls       │ perf_event            │ CSV parsing │
│         ▼                  ▼                       ▼             │
├─────────────────────────────────────────────────────────────────┤
│                         Kernel Space                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │ uprobes     │    │ tracepoints │    │ BPF Ring Buffer     │  │
│  │ (CUDA API)  │    │ (sched,irq) │    │ (Event Queue)       │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Benchmark and Environment

The benchmark is Qwen3 0.6B LLM inference using qwen3.cu.

PropertyValue
ModelQwen3-0.6B-FP32
TaskSingle-turn Q&A
Input"What is eBPF?"
Output~30-50 tokens
Kernel PatternBurst submission (~950 launches per token)
GPU Memory~3 GB

This benchmark is useful because it resembles modern LLM inference, mixes compute-bound and memory-bound kernels, shows a clear burst submission pattern, and produces a measurable throughput metric in tokens per second.

ComponentSpecification
CPU24 cores (specific model TBD)
GPUNVIDIA GPU with CUDA support
MemorySufficient for model + system
OSLinux 6.15.11-061511-generic
KernelBTF-enabled for CO-RE eBPF
CUDADriver API + Runtime API

We used three interference tools:

ToolPurposeConfiguration
stress-ngCPU load--cpu 0 --cpu-method fft (all cores)
iperf3Network I/OServer + Client, 10 parallel streams, 60s
fioDisk I/Orandwrite, bs=4k, iodepth=32, 4 jobs

The full experiment has six scenarios:

ScenarioDescriptionInterference
BaselineClean environmentNone
Noisy CPUCPU-intensivestress-ng on all cores
Noisy NetworkNetwork I/Oiperf3 localhost loopback
Noisy DiskDisk I/Ofio random write
Heavy LoadCombinedCPU + Network + Disk simultaneously
OptimizedCPU pinningstress-ng + taskset -c 0-3 + nice -n -10

Data collection follows the same pattern in every run:

# Start tracing
sudo ./cuda_sched_trace > trace.csv 2> trace.log &
TRACE_PID=$!
 
# Run benchmark
cd qwen3.cu
/usr/bin/time -v ./runcu Qwen3-0.6B-FP32.gguf -q "What is eBPF?" -r 1
 
# Stop tracing
sudo kill -SIGINT $TRACE_PID
 
# Analyze results
python3 analyze_scheduler_impact.py

Analysis Method

The central analysis compares consecutive CUDA kernel launches:

Launch_i -> [interval] -> Launch_i+1
 
Group A: Launches with NO context switch in interval (normal flow)
Group B: Launches with context switch in interval (preempted)
 
Preemption Penalty = median(Group B interval) - median(Group A interval)

To compare runs of different lengths, scheduler and IRQ counts are normalized per 1,000 kernel launches:

Sched/1K = (Total Context Switches / Total Kernel Launches) x 1000
IRQ/1K = (Total IRQs / Total Kernel Launches) x 1000

Performance impact is reported as:

Slowdown % = (Baseline tok/s - Scenario tok/s) / Baseline tok/s x 100

RQ1: Does CPU Scheduler Significantly Impact GPU Performance in Clean Environments?

The first question is whether scheduler preemption matters when the machine is otherwise clean.

Experiment design

  • Condition: clean system, no artificial interference
  • Metrics: context switch frequency, preemption penalty, total runtime impact
  • Analysis: launch-pair comparison with and without context switches

Results

MetricValue
Total Runtime79.5 seconds
Kernel Launches51,464
Context Switches592 (7.44 Hz)
OFF-CPU Time7.88 ms (0.01%)

Launch-pair analysis shows that almost every consecutive launch pair is unaffected by context switches:

GroupCountPercentageP50 IntervalP90 IntervalP99 Interval
No Context Switch51,40199.9%2 us4 us4 us
With Context Switch620.1%15.3 ms15.5 ms5.0 s

The median preemption penalty is 15.3 ms. That is large for the affected pairs, but only 62 pairs were affected.

Tail-latency attribution confirms that most outliers are not caused by scheduler preemption:

PercentileTotal OutliersWith Context SwitchAttribution
P95+2,58062 (2.4%)97.6% application
P99+51562 (12.0%)88.0% application

The total scheduler impact is:

Impact = Affected Pairs x Penalty = 62 x 15ms = 0.93 seconds
Percentage = 0.93 / 79.5 = 1.2%

Finding: in clean environments, CPU scheduler impact is minimal at 1.2%. The vast majority of kernel launch pairs, 99.9%, are unaffected by context switches. Tail latency mostly comes from application behavior such as token-generation boundaries, not scheduler preemption.

RQ2: What Is the Impact of IRQ Interrupts on GPU Performance?

The second question is whether IRQs directly interfere with the CPU-side launch path.

Experiment design

  • Condition: clean system with IRQ tracing enabled
  • Metrics: IRQ frequency, duration, type distribution
  • Analysis: IRQ time as percentage of total runtime

Results

MetricValue
Total Runtime4.99 seconds
Kernel Launches125,236
Soft IRQs653 events
Hard IRQs0 events

Soft IRQ type distribution:

TypeCountTotal TimeAvg TimeMax TimePercentage
TIMER3170.77 ms2.4 us30.1 us49%
RCU2910.40 ms1.4 us17.2 us45%
NET_RX300.13 ms4.5 us14.0 us4.6%
SCHED150.07 ms4.9 us18.9 us2.3%

Total IRQ impact:

Total IRQ Time: 1.38 ms
Percentage of Runtime: 0.0276%

There are real reasons to worry about IRQs: direct handler time, cache pollution, CPU pipeline disruption, and delay accumulation on critical paths. But for this local inference workload, actual IRQ impact is small.

The reason is the workload shape. Qwen3 submits about 950 launches in a burst lasting less than 100 us, so IRQs rarely land inside the burst. Most IRQs happen between bursts during CPU compute. TIMER interrupts dominate and have a small cache footprint. There is little network I/O, so NET_RX appears only 30 times, and there are no hard IRQs from NVMe or SSD block-device interrupts.

Finding: IRQ impact is negligible for local LLM inference at 0.0276%. This does not mean IRQs never matter. Distributed training with network communication or on-the-fly data loading can see much higher IRQ impact, estimated around 5-20%.

RQ3: How Do Noisy Neighbors Affect GPU Performance?

The third question is the most production-relevant one: what happens when the GPU workload shares a machine with other CPU, network, and disk activity?

Experiment design

ScenarioInterferencePurpose
BaselineNoneReference point
Noisy CPUstress-ng (all cores)CPU contention
Noisy Networkiperf3 (10 streams)Network IRQ
Noisy Diskfio (4 jobs, randwrite)Block IRQ
Heavy LoadAll three combinedProduction simulation
OptimizedCPU stress + taskset + niceMitigation test

Results

Normalized metrics per 1,000 kernel launches:

ScenarioLaunchesSched/1KSoft IRQ/1KHard IRQ/1KIRQ Time (ms)
Baseline56,88222.85.80.00.62
Noisy CPU61,18411,932.86.40.00.33
Noisy Network154,3946.02.70.00.92
Noisy Disk126,67029.33.90.11.03
Heavy Load99,4246,044.62.40.00.37
Optimized108,984445.22.80.00.71

Performance impact:

Scenariotok/sRuntime (s)SlowdownContext Switch Increase
Baseline54.773.00-1x
Noisy CPU49.934.158.8%524x
Noisy Network53.237.222.8%0.26x
Noisy Disk54.955.60-0.3%1.3x
Heavy Load43.566.9720.5%265x
Optimized53.755.101.9%19.5x

Scenario Analysis

Noisy CPU (stress-ng) causes the most direct scheduling pressure. Context switches increase 524x, from 22.8 to 11,932.8 per 1,000 launches, and throughput drops by 8.8%. The mechanism is simple: the CFS scheduler time-slices between the GPU process and stress-ng workers.

Noisy Network (iperf3) behaves differently. Context switches actually decrease, because the network load changes CPU competition patterns, while soft IRQs rise slightly. Throughput drops only 2.8%. In this local setup, network I/O primarily shows up as IRQ overhead rather than scheduler pressure.

Noisy Disk (fio) introduces the first hard IRQs, corresponding to block-device interrupts, but context switches remain low and throughput is effectively unchanged at -0.3% slowdown. Disk I/O has little impact on this workload.

Heavy Load (CPU + Network + Disk) is the worst case. Throughput drops by 20.5%, and scheduler events rise to 6,044.6 per 1,000 launches, a 265x increase over baseline. Interestingly, that is only 50.7% of the context-switch rate in the Noisy CPU case. The interference sources compete with each other, but their combined effect is still worst overall.

Heavy-load soft IRQ breakdown:

TypeCountTotal TimeAvg Time
RCU213217.4 us1.0 us
TIMER17122.9 us7.2 us
SCHED533.3 us6.7 us

Finding: noisy neighbors significantly affect GPU performance. Combined CPU, network, and disk interference causes 20.5% degradation. The signatures differ by source: CPU contention increases context switches, network I/O affects IRQ overhead, disk I/O introduces block interrupts with little throughput impact here, and combined load is worst due to cumulative effects.

RQ4: Can CPU Pinning Effectively Mitigate Scheduler Impact?

The fourth question is whether a simple deployment-level mitigation helps before reaching for a custom scheduler.

Experiment design

  • Baseline: Noisy CPU scenario with stress-ng on all cores
  • Optimized: same stress-ng load, but the GPU process runs with:
    • taskset -c 0-3 to pin it to cores 0-3
    • nice -n -10 to give it higher priority

Results

MetricNoisy CPUOptimizedImprovement
Sched/1K11,932.8445.296.3% reduction
tok/s49.9353.757.6% improvement
vs. Baseline8.8% slower1.9% slowerSignificant recovery

CPU pinning and priority adjustment recover most of the lost throughput. But the optimized case still has 445.2 scheduler events per 1,000 launches, compared with 22.8 in the clean baseline. That is still 19.5x higher than baseline.

Complete elimination is hard because:

  1. stress-ng workers may still be scheduled on cores 0-3.
  2. System daemons and kernel threads cannot be fully excluded by taskset.
  3. IRQ affinity may still route interrupts to pinned cores.

For stronger isolation, the next steps are kernel-level isolation and IRQ placement:

# 1. Use isolcpus kernel parameter (boot time)
isolcpus=4-7 nohz_full=4-7
 
# 2. Bind GPU process to isolated cores
taskset -c 4-7 ./gpu_app
 
# 3. Bind IRQs away from GPU cores
echo 0-3 > /proc/irq/*/smp_affinity_list
 
# 4. Use cgroups for CPU isolation
cgcreate -g cpu:gpu_workload
cgset -r cpuset.cpus=4-7 gpu_workload
cgexec -g cpu:gpu_workload ./gpu_app

Finding: CPU pinning is highly effective. It reduces context switches by 96.3% and recovers 7.6% throughput. But full recovery under heavy load requires deeper isolation such as isolcpus, nohz_full, cpusets, and IRQ affinity management.

What the Results Mean

The results point to four practical insights.

First, environment matters. Scheduler impact ranges from 1.2% in a clean environment to 20.5% under combined heavy load. Optimizing the scheduler on a quiet dedicated server may not be worth the complexity. On a shared host, it can be the difference between stable and degraded inference.

Second, workload shape matters. Qwen3 has bursty kernel submission, roughly 950 launches in less than 100 us per token burst. That shape makes it resilient to many IRQs because interrupts usually occur between bursts. A different workload with continuous network communication, streaming input, or tighter CPU-GPU handoff might behave differently.

Third, interference sources have distinct signatures:

InterferencePrimary ImpactSecondary Impact
CPUContext switchesNone
NetworkIRQ overheadSlight scheduling
DiskHard IRQsMinimal
CombinedAll of aboveWorst overall

Fourth, simple mitigations work, but only up to a point:

  • CPU pinning: very effective, 96% context-switch reduction
  • Priority adjustment: helpful but limited
  • Full isolation: requires kernel configuration and IRQ affinity management

Comparison with Meta's sched_ext Findings

Our results differ from Meta's AI training observations because the workload is different.

AspectMeta (AI Training)Our Study (LLM Inference)
Primary IssueNetwork IRQ (NET_RX)CPU scheduling
IRQ Impact5-20%0.03% (local inference)
Optimizationsched_ext layertaskset + nice
WorkloadDistributed trainingSingle-node inference

The key difference is communication. Distributed training constantly exchanges data through all-reduce, making NET_RX a major bottleneck. Local inference has minimal network I/O, so the dominant issue under noise is CPU scheduling rather than network interrupts.

Limitations

There are several limits to this study:

  1. eBPF tracing itself adds 1-5% overhead.
  2. The tool only supports CUDA, not OpenCL or HIP.
  3. The trace does not include GPU-side execution timing, so it cannot directly measure actual kernel runtime.
  4. IRQ attribution is limited: the trace cannot always identify which process caused a given IRQ.
  5. The experiments use a single GPU and do not cover multi-GPU behavior.

Practical Recommendations

For production deployments:

EnvironmentRecommendationExpected Benefit
Dedicated ServerNo optimization needed-
Shared Server (light)taskset + nice5-10% improvement
Shared Server (heavy)isolcpus + IRQ affinity15-20% improvement
KubernetesCPU limits + nodeSelectorVaries

The decision tree is simple:

Is GPU workload latency-sensitive?
├── No -> No optimization needed
└── Yes -> Is server shared?
    ├── No -> Monitor only, optimize if needed
    └── Yes -> How heavy is colocated load?
        ├── Light -> taskset + nice
        └── Heavy -> isolcpus + dedicated cores

Conclusion

CPU scheduling and IRQ handling do not always matter for GPU inference, but they matter under the conditions where production systems often run: shared hosts, background load, and noisy neighbors.

The clean baseline shows minimal overhead: 1.2% scheduler impact and 0.03% IRQ impact. But combined CPU, network, and disk interference causes 20.5% throughput degradation. CPU pinning cuts context switches by 96.3% and recovers most of the lost performance, but not all of it.

The practical lesson is to measure first. Use tracing to identify whether your workload is scheduler-bound, IRQ-sensitive, or mostly application-limited. Then choose the mitigation that matches the signature: CPU pinning for CPU contention, IRQ affinity for interrupt interference, I/O tuning for block-device pressure, and full CPU isolation when the workload is latency-sensitive and colocated load is heavy.

References

  1. Meta Platforms, Inc. "Accelerating AI Training with sched_ext." Linux Plumbers Conference 2025. https://lpc.events/event/19/contributions/2039/
  2. NVIDIA Corporation. "CUDA Driver API Reference." https://docs.nvidia.com/cuda/cuda-driver-api/
  3. Linux Kernel Documentation. "BPF Documentation." https://www.kernel.org/doc/html/latest/bpf/
  4. stress-ng. "A tool to load and stress a computer system." https://github.com/ColinIanKing/stress-ng
  5. iperf3. "A TCP, UDP, and SCTP network bandwidth measurement tool." https://github.com/esnet/iperf
  6. fio. "Flexible I/O Tester." https://github.com/axboe/fio