NUMA-Aware Resource Scheduling

Updated 20 December 2025

NUMA-aware scheduling is a technique that aligns task and memory placement with local resource characteristics to minimize remote access penalties.
It employs heuristic, rule-based, and learning-driven methods to reduce cache thrashing, balance bandwidth, and mitigate resource contention.
Empirical evaluations demonstrate substantial improvements in throughput, latency, and scalability across diverse many-core and disaggregated systems.

Non-Uniform Memory Access (NUMA)-Aware Resource Scheduling refers to a class of methodologies and systems that explicitly account for the spatial heterogeneity and memory-access locality characteristics of modern many-core, multiprocessor, and disaggregated hardware during the allocation, placement, and dynamic control of computational and memory resources. NUMA-aware scheduling seeks to minimize access latencies, maximize local bandwidth, and reduce harmful remote memory traffic, cache-thrashing, and resource contention, thereby improving end-to-end application throughput, latency, and system scalability in environments where memory and compute locality are non-uniformly distributed.

1. NUMA Architectures and Performance Implications

NUMA architectures are characterized by sets of cores and memory banks grouped into nodes (“NUMA domains”) that communicate with lower latency and higher bandwidth within the domain than across it. Typical topologies include SMP systems with multiple sockets (each with local DRAM controllers), modern server CPUs with integrated memory controllers per socket, multi-chiplet accelerators, and disaggregated datacenter hardware where remote resources are accessible only with high penalty. Memory-access cost between domain $i$ and $j$ is typically captured as $L_{i,j}$ (latency) and $B_{i,j}$ (bandwidth), with $L_{i,i} \ll L_{i,j}$ , $B_{i,i} \gg B_{i,j}$ for $i \neq j$ (Memarzia et al., 2019, Rayhan et al., 5 Nov 2024, Lakew et al., 2 Jan 2025, Vivas et al., 25 Nov 2025, Choudhary et al., 3 Nov 2025).

The negative impact of NUMA-unaware scheduling includes increased memory-access latency, lower cache-coherence efficiency, bandwidth imbalance, reduced throughput, and higher tail-latency for remote-heavy workloads (Vivas et al., 25 Nov 2025, Rayhan et al., 5 Nov 2024).

2. Fundamental Principles in NUMA-Aware Scheduling

NUMA-aware resource scheduling is guided by the following central principles:

Locality Maximization: Tasks/threads/containers/virtual machines should be placed, whenever feasible, near the physical memory holding their dominant working set, to maximize the ratio of local to remote accesses (LAR) (Memarzia et al., 2019, Liu et al., 3 Nov 2024, Lim et al., 2021).
Bandwidth Balancing: Hot data and compute regions (“slices,” “clusters”) should be distributed across NUMA nodes to prevent overloading individual memory controllers or network links (Rayhan et al., 5 Nov 2024).
Contention Avoidance: Scheduling must recognize and mitigate resource contention, both for compute (core/core and socket/socket) and for bandwidth (DRAM/IMC or interconnect link) (Lakew et al., 2 Jan 2025).
Adaptive Response: Workload and interference profiles are dynamic; reactive or learning-based approaches are favored for runtime adaptation (Chasparis et al., 2018, Liu et al., 3 Nov 2024, Abduljabbar et al., 2021).
Hardware-Awareness: The scheduler must discover or be provisioned with the actual hardware topology: the mapping of cores to sockets/nodes, interconnect graph, per-node memory/caches/bandwidth (Choudhary et al., 3 Nov 2025, Vivas et al., 25 Nov 2025, Tahan, 2014).

3. Algorithmic and System Approaches

A spectrum of approaches has been developed and deployed:

3.1. Heuristic and Rule-Based Scheduling

Thread and Memory Binding: Explicitly pins threads to cores and memory pages to nodes based on detected or specified access patterns (Memarzia et al., 2019, Lim et al., 2021). Thread placement can follow dense or sparse policies (packing vs spreading across nodes).
User-Level and Kernel-Level Schedulers: User-space solutions monitor per-process NUMA statistics (/proc, /sys) and conduct migration or rebinding without kernel modification; kernel-level extensions (e.g., Phoenix) integrate CPU scheduling with page allocation and direct page-table migration/replication for further optimization (Siavashi et al., 15 Feb 2025).
Smart Thread and Task Scheduling in Runtimes: Extensions to OpenMP/NANOS and work-stealing runtimes employ topology-aware thread binding, per-core priority equations, and NUMA-priority work-stealing victim lists (Tahan, 2014, Deters et al., 2018). These target co-location of tasks and data, and mitigate “work inflation” (Deters et al., 2018).

3.2. Learning- and Metric-Driven Scheduling

Hardware Counter–Driven Policies: Low-level performance monitoring units (PMU) statistics are used to drive decisions, including instructions-per-cycle, stall ratios, local/remote bandwidth, and DTLB miss rates (Rayhan et al., 5 Nov 2024, Liu et al., 3 Nov 2024, Chasparis et al., 2018). MAO (Baidu) applies a lightweight PMU-based monitoring and uses an XGBoost regression model (“NUMA Sensitivity Model”) for workload placement and dynamic rebinding (Liu et al., 3 Nov 2024).
Reinforcement Learning & Transformative Models: P-MOSS frames NUMA scheduling for DBMS index “slices” as an MDP, training a Decision Transformer on PMU trajectories to jointly decide compute and memory placement (Rayhan et al., 5 Nov 2024). SPANE recasts cloud VM allocation as an MDP over multi-NUMA PMs, enforcing $S_m$ -permutation invariance via symmetry-aware dueling DQN to optimize wait time across dynamic workloads (Chan et al., 21 Apr 2025).

3.3. Cost- and Model-Based Optimization

Mathematical Formulation: Scheduling problems are cast as mixed-integer programs (MIPs) or ILPs, with binary assignment variables tracking thread/VM/core/memory placements, and cost functions that aggregate local/remote access penalties, contention measures, and migration or fragmentation terms (Lakew et al., 2 Jan 2025, Papp et al., 23 Apr 2024).
Multi-level and Moldable Schedulers: ARMS models per-task performance as a function of moldable resource partition size and NUMA domain, assigning tasks at runtime to local or wide domains to minimize total CPU×time cost, dynamically adapting to data locality and parallelism (Abduljabbar et al., 2021).

4. Implementation in Operating Systems, Runtimes, and Data Systems

NUMA-aware scheduling has been realized across several layers:

Linux Kernel Extensions: Phoenix (Linux LKM) directly coordinates thread placement, allocates page-table pages on the home node, and triggers on-demand (coherent) replication of page-tables based on real-time PMU-detected page-walk cycles and memory traffic (Siavashi et al., 15 Feb 2025).
Runtime Libraries and Task Schedulers: NUMA extensions to task-parallel runtimes (NUMA-WS, OpenMP/NANOS) maintain per-node “place” abstractions, NUMA-aware work-stealing protocols (biasing/restricted victim selection), and integrate task/data annotation via API (Deters et al., 2018, Tahan, 2014).
Database and Data Analytics Systems: NUMA-aware scheduling for in-memory analytics often leverages black-box strategies—thread binding, memory interleaving, alternative allocators—and may further include spatial scheduling of data and compute, exploiting dynamic PMU data or learning models (e.g., P-MOSS) (Memarzia et al., 2019, Rayhan et al., 5 Nov 2024).
Cloud Orchestration: At scale, per-container or per-VM placement decisions are fed back to higher-level schedulers via monitoring, profiling, and formal sensitivity models, e.g., MAO’s integration with Matrix scheduler at Baidu, or SPANE’s symmetry-preserving policy learning in dynamic cloud traces (Liu et al., 3 Nov 2024, Chan et al., 21 Apr 2025).

5. Theoretical Guarantees, Limitations, and Performance Results

Rigorous analysis and empirical evaluation confirm that NUMA-aware scheduling can guarantee:

Existence of Equilibrium and Bounded Sub-optimality: Dynamic pinning schemes and NUMA-aware work-stealing admit pure-strategy Nash equilibria, which are within a tight bound ( $\leq 2m/(m+1)$ of global optimal makespan for $m$ cores) (Chasparis et al., 2018).
Provable Scalability: NUMA-WS, by adhering to the work-first principle and biasing steals, matches the O( $T_1/P + O(T_{\infty})$ ) time bound, while mitigating work inflation that can otherwise degrade parallel efficiency by 2–5 $\times$ (Deters et al., 2018).
Empirical Gains:

| System/Method | Experimental Setting | Max Observed Gain | |-----------------------|---------------------------------------------|--------------------------| | P-MOSS | B-Tree index, Milan, YCSB-A | Up to $5.3\times$ | | MAO | Feed (Baidu, 12,700+ prod servers) | $12.1$% latency, $9.8$% CPU saved | | Phoenix | Real servers, page-table replication | $2.09\times$ fewer CPU cycles, $1.58\times$ fewer page-walk cycles | | User-Level Scheduler | PARSEC suite, Dell PE-R910, 40 cores | $25\%$ speedup | | ARMS | MatMul/Stencil, dual-socket Xeon | up to $3.5\times$ speedup| | nFlows | HEFT/Min-Min, HPC node | Avoided 20 $\mu$ s–100 $\mu$ s extra makespan in micro-benchmarks |

NUMA-aware algorithms especially dominate in resource-constrained, memory-intensive, or interference-prone scenarios; their benefit is smaller but non-negative when resources are ample and workload is compute-bound (Chasparis et al., 2018, Lakew et al., 2 Jan 2025, Memarzia et al., 2019, Vivas et al., 25 Nov 2025). However, page migration and overly frequent binding can cause performance regressions in some dynamic contexts—careful thresholding and hybrid heuristics are essential (Liu et al., 3 Nov 2024, Siavashi et al., 15 Feb 2025).

6. Advanced Topics and Heterogeneous Environments

NUMA-effect complexity extends to modern GPUs (chiplet-based), scientific workflows, and distributed clouds:

GPU Chiplet NUMA: AMD MI300X provides private L2 caches per chiplet (XCD); swizzled head-first mapping for attention kernels aligns compute for “heads” with NUMA domains, raising cache hit rates from 1% (naive) to 80-97%, and offering up to 50% throughput gain over conventional scheduling (Choudhary et al., 3 Nov 2025).
Disaggregated and Multi-Tier NUMA: VM placement and memory mapping in multi-level, disaggregated clusters require algorithms aware of multiple latency bands across interconnects and node hierarchies; migration and isolation policies avoid cross-server remote hot-spots (Lakew et al., 2 Jan 2025, Vivas et al., 25 Nov 2025, Chan et al., 21 Apr 2025).
Workflow Scheduling: Scientific workloads modeled as DAGs (nFlows) require extension of classic heuristics (Min-Min, HEFT) to account for NUMA-specific latency and bandwidth in their cost functions, shifting scheduling outcomes in both simulation and real-machine runs (Vivas et al., 25 Nov 2025, Papp et al., 23 Apr 2024).

7. Best Practices and Open Challenges

NUMA-aware resource scheduling best practices, as distilled from the surveyed literature (Memarzia et al., 2019, Liu et al., 3 Nov 2024, Lakew et al., 2 Jan 2025, Siavashi et al., 15 Feb 2025, Choudhary et al., 3 Nov 2025, Rayhan et al., 5 Nov 2024, Vivas et al., 25 Nov 2025), include:

Pin threads/containers/VMs to cores and local memory regions, balancing for both load and bandwidth.
Monitor PMU (hardware performance) counters continuously, using key statistics (memory bandwidth, stall ratio, locality ratio, DTLB misses) to guide adaptation.
Integrate NUMA affinity and performance feedback into global orchestration (scheduler) frameworks.
For short-lived or bandwidth-bound workloads, prefer binding over page-migration.
Employ lightweight or learning-based cost models, retrained periodically for changing workload mixtures.
Explicitly handle page-table data placement, replicate only when page-walk overhead or remote-walk fraction exceeds cost thresholds (e.g., $\alpha_t>10\%$ (Siavashi et al., 15 Feb 2025)).
Develop placement policies and scheduling algorithms that are robust to topology size and asymmetry.

Open research directions include managing heterogeneity across nodes, integrating storage and network as first-class NUMA resources, supporting live migration and preemption, enforcing multi-tenant fairness, and robust adaptation to non-stationary or unpredictable workload shifts (Chan et al., 21 Apr 2025, Vivas et al., 25 Nov 2025).

Key References:

(Chasparis et al., 2018, Rayhan et al., 5 Nov 2024, Liu et al., 3 Nov 2024, Choudhary et al., 3 Nov 2025, Siavashi et al., 15 Feb 2025, Lakew et al., 2 Jan 2025, Vivas et al., 25 Nov 2025, Tahan, 2014, Abduljabbar et al., 2021, Vivas et al., 25 Nov 2025, Memarzia et al., 2019, Lim et al., 2021, Deters et al., 2018, Papp et al., 23 Apr 2024, Chan et al., 21 Apr 2025)