Papers
Topics
Authors
Recent
2000 character limit reached

Phoenix LKM: NUMA Optimization in Linux

Updated 24 December 2025
  • Phoenix (Linux LKM) is a kernel module that optimizes NUMA performance by integrating CPU scheduling and memory management to enable selective on-demand page-table replication.
  • It distinguishes data from page-table pages by enforcing home-node affinity and dynamically adjusting replication based on configurable page-walk ratio thresholds.
  • Leveraging Intel RDT for bandwidth QoS, Phoenix mitigates contention, reduces remote memory penalties, and achieves 1.5–2× performance gains in CPU and page-walk cycles.

Phoenix is a Linux loadable kernel module (LKM) implementing a coordinated approach for NUMA (Non-Uniform Memory Access) performance optimization, integrating both CPU scheduler and memory management functions. By directly distinguishing between data and page table pages and providing on-demand page table replication or migration, Phoenix allows efficient thread and memory placement, minimizing the overhead imposed by NUMA memory hierarchies. It also leverages Intel Resource Director Technology (RDT) for bandwidth quality-of-service enforcement, addressing contention and coherency bottlenecks absent in prior solutions such as eager page table replication. Benchmarks on commodity NUMA systems indicate substantial reduction in CPU and page-walk cycles compared to state-of-the-art techniques (Siavashi et al., 15 Feb 2025).

1. System Architecture and Kernel Integration

Phoenix is architected as a self-contained LKM, interfacing with Linux kernel v5.4 via scheduler and memory management subsystem hooks. Its principal entry points are scheduler hooks—enabling thread monitoring and pinning—and page-table allocator hooks—distinguishing page table allocation from data allocation. Phoenix’s internal layering introduces per-task and per-CPU state data, callback tables, and dedicated replication, scheduler, and MBA management modules.

Kernel Integration Block Diagram:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
┌───────────────────┐
│   Linux Kernel    │
│  +---------------+│
│  │ CFS scheduler │◀─── scheduler-domain hooks
│  +---------------+│
│  │ mm subsystem  │◀─── page-table alloc hooks
│  +---------------+│
│  │ libpqos (RDT) │◀─── MBA throttling API
│  +---------------+│
└───────────────────┘
             ▲
             │
        Phoenix LKM
             │
     ┌───────┴────────┐
     │  per-task state│
     │  per-cpu state │
     │  callback ptrs │
     │  replication   │
     │  scheduler     │
     │  MBA manager   │
     └────────────────┘

Phoenix minimally patches the Linux scheduler to interpose a defined callback set (e.g., on_fork, on_exec, on_cs_in/out, on_scheduler_tick, on_rebalance) and installs replacements for all page-table allocation routines to enforce page table “home-node” affinity. Data structures per task include status flags, performance counters, home node IDs, allowed NUMA nodes, and page table replica pointers; per-CPU structures track last-level cache (LLC) misses and cycles.

2. Page Table Management: Replication and Migration Policy

Linux’s default memory management treats page-table and data pages indistinctly; Phoenix intercepts allocators, ensuring all page-table pages are first allocated on the designated home node for the process:

1
2
pgd = orig__pgd_alloc();
set_page_node(pgd_page, task->home_node);

Policy distinctions:

  • Data pages use existing first-touch/AutoNUMA heuristics.
  • Page-table pages use enforced home-node allocation with Phoenix oversight.

On-demand Replication Algorithm:

Phoenix continuously samples, per task, the total CPU cycles CtotC_{tot} and page-walk cycles CpwC_{pw}. The page-walk ratio,

RpwCpwCtot,R_{pw} \equiv \frac{C_{pw}}{C_{tot}},

acts as a trigger. Given user-configurable thresholds TrepT_{rep} (e.g., 10%) and TunrepT_{unrep} (e.g., 5%):

  • If replication is disabled and Rpw>TrepR_{pw}>T_{rep}, Phoenix enables page-table replication across allowed NUMA nodes.
  • If enabled and Rpw<TunrepR_{pw}<T_{unrep}, replication is disabled.

Replica Coordination:

1
2
3
4
5
6
7
8
9
10
for each node in mm->allowed_nodes {
    if (node == mm->home_node) continue;
    for each leaf pgd_page in mm->pgd_tree {
        struct page *new = alloc_page_on_node(node);
        copy_page(new, pgd_page);
        link_replicas(pgd_page, new);
        mm->pgd_replicas[node] = new;
    }
}
mm->replicated = true;
Updates to page-table pages (e.g., via set_pte) propagate via a circular linked list among all active replicas, minimizing coherency overhead given the typically low frequency of updates.

3. CPU Scheduling and Thread Placement

Phoenix fuses page table policies with thread scheduling to ensure NUMA locality. Through periodic sampling of hardware performance monitoring counters (PMCs) during on_cs_in/on_cs_out and on every scheduler tick, Phoenix dynamically updates CtotC_{tot}, CpwC_{pw}, and RpwR_{pw}, activating or deactivating replication accordingly.

Thread Placement Policy:

During process exec, Phoenix selects the “home_node” as the NUMA node with sufficient idle cores and the least estimated memory bandwidth (as measured by LLCmisses,node×CACHELINE_SIZELLC_{misses,node}\times CACHELINE\_SIZE). Threads and their children are pinned to this node as resources allow, falling back to the next-closest node if overloaded. The NUMA rebalancer is hooked via on_rebalance, and policy enforcement suppresses cross-node migrations that would violate home node affinity.

4. Memory Bandwidth QoS: Intel RDT (MBA) Coordination

In multi-tenant/multi-task scenarios, memory bandwidth can be saturated by “noisy neighbor” workloads, impacting critical page-walk and replica update latency. Phoenix integrates with the Intel RDT MBA (memory bandwidth allocation) via libpqos to throttle memory-intensive, low-priority tasks:

  1. Per-CPU LLC misses are used to estimate per-node bandwidth utilization.
  2. If high-priority tasks experience page-walk ratio spikes, the antagonistic class receives maximum throttling (MBA rate set to 10%).
  3. Throttling is periodically relaxed when the protected workload is idle.

API Example:

1
2
pqos_alloc_assoc(COS_LOW, pid_low_prio_list);
pqos_ctrl_ptr->mba_set(COS_LOW, MBA_MAX_THROTTLE);
The hardware-assisted reconfiguration latency for MBA is on the order of tens of microseconds, ensuring negligible runtime impact.

5. Implementation and Design Trade-offs

Phoenix is implemented with under 200 lines of kernel patching and approximately 1,200 lines for the LKM itself. Integration is achieved through patching and replacement of key APIs: sched_fork, sched_exec, task context switches, scheduler ticks, top-level NUMA rebalancer, and replacing internal page-table allocators. Key design choices include:

  • The use of static thresholds for RpwR_{pw}; adaptive or machine learning–based controls may yield further improvements.
  • Replication induces additional DRAM usage (one replica per node per level), a trade-off justified by avoided remote page walks.
  • Fine-grained page table locks (ptl) minimize, but do not eliminate, lock contention compared to global locking.
  • MBA throttling is coarse-grained (“max throttle”) and conservative; more dynamic policies may increase efficacy.
  • Legacy application compatibility is preserved; workloads not encountering high RpwR_{pw} remain unaffected.

6. Experimental Evaluation and Quantitative Results

Phoenix was evaluated on dual Intel Xeon Gold 6142 servers (16 cores/socket, 2-way HT, 384 GB DDR4), running Ubuntu 20.04, with a suite of memory- and CPU-intensive benchmarks including Redis (75 GB KV store), GUPS (64 GB), XSBench (85 GB), BTree lookup (150 GB), Graph500 (113 GB), Wrmem (190 GB), Apache static server, and the STREAM benchmark. Phoenix’s performance was contrasted against vanilla Linux and Mitosis (an eager page-table replication solution).

Summary of Experimental Results:

Scenario CPU Cycles vs. Mitosis Page-Walk Cycles vs. Mitosis
Baseline (no interference) Phoenix −27% Phoenix −3%
Co-located interference Phoenix ×1.87 Phoenix ×1.58
Bandwidth throttled Phoenix ×1.95 (Linux) / ×2.51 (Mitosis) -
Overall gains (all workloads) Phoenix ×2.09 Phoenix ×1.58

Primary performance determinants:

  • Phoenix avoids unnecessary replicas for low-TLB-pressure workloads.
  • Threads and page tables are co-located, eliminating remote NUMA memory traffic for page table walks.
  • MBA throttling protects page-walk latency from adversarial workloads.

These results establish Phoenix’s efficacy in simultaneously reducing CPU and memory subsystem contention, exceeding the performance of established techniques by factors of 1.5–2× (Siavashi et al., 15 Feb 2025).

7. Significance and Research Context

Phoenix demonstrates that orchestrated, cross-layer kernel interventions—minimally invasive but semantically holistic—enable avoidance of major NUMA-induced inefficiencies, particularly under data center–scale workload mixing. A notable methodological advance is the use of a lightweight, event-driven on-demand replication scheme, in contrast to static or eager strategies, coupled with bandwidth-quality enforcement at the hardware level. Identified trade-offs include the need for more adaptive policies and the potential DRAM cost of replication. The documented gains position Phoenix as a reference point for further studies in NUMA scheduling, kernel-level data structure placement, and operating systems QoS mechanisms (Siavashi et al., 15 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Phoenix (Linux LKM).