CXL-Aware Allocation Techniques

Updated 20 December 2025

CXL-aware allocation is a memory management approach that integrates CXL-attached devices with traditional DRAM to enhance resource utilization and performance.
It leverages NUMA abstractions, dynamic throttling, and real-time performance metrics to minimize latency and balance workload demands.
Experimental results demonstrate significant throughput improvements and cost savings across applications like ANNS search, LLM fine-tuning, and MoE inference.

Compute Express Link (CXL)-aware allocation describes a suite of hardware and software techniques designed to intelligently distribute and manage memory resources across heterogeneous memory systems that include CXL-attached devices. CXL offers a coherent, high-bandwidth interconnect for pooling, sharing, and expanding memory beyond traditional DRAM models, but its distinct latency/bandwidth, device topologies, and protocol features necessitate allocation strategies that maximize resource utilization, minimize contention, and exploit architectural advantages, such as full-duplex channels and near-data processing.

1. Architectural Principles for CXL-Aware Allocation

CXL-attached memory, especially in tiered-memory and pooling scenarios, appears as one or more slow NUMA nodes or HDM/FAM domains. Allocation strategies must reconcile the following:

Latency and bandwidth separation: CXL memory (Type 3, HDM/FAM) typically incurs ~2x latency penalty versus host DRAM (e.g., 170–250 ns vs. 80–140 ns; (Liaw et al., 4 Jul 2025)), and operates at lower or shared per-host bandwidth due to PCIe link aggregation.
NUMA abstraction: CXL expands the system address space; OS-level allocators expose HDM and FAM as NUMA nodes and annotate each page with policy bits (see HMM-based Cohet (Wang et al., 28 Nov 2025)).
Coherency domain and memory windowing: Depending on CXL protocol generation, memory can be mapped with host-managed or device-coherent semantics, further affecting allocation region granularity and forwarding requirements (Jain et al., 4 Apr 2024, Sharma et al., 2023).
Device topology and pool sharing: Multi-host topologies, such as Octopus pods (Berger et al., 15 Jan 2025), use regular BIBD incidence structures to provide bounded device access, allowing scalable, cost-effective memory pooling but requiring balancing algorithms for optimal allocation.

The system's allocation policy must optimize for both throughput and tail latency, especially under diverse workload mixes of memory-bandwidth- and latency-sensitive applications.

2. Formal Models and Optimization Problems

Several representative allocation models have been deployed:

Adjacency-aware placement: In Cosmos, clusters of vectors are assigned to CXL devices such that queries probing neighboring clusters are maximally spread out, thus exploiting device parallelism and reducing “adjacency penalty” (Ko et al., 22 May 2025). The objective:

$\min_x \sum_{c\in C}\sum_{c'\in \mathrm{Adj}(c)}w_{c,c'}\sum_{d\in D} x_{c,d}x_{c',d}$

subject to per-device and per-cluster constraints.

Capacity and contention model for CPU offloading: For LLM fine-tuning, optimizer-critical tensors (parameters, gradients, optimizer state) are prioritized to remain in local DRAM; latency-tolerant activations are shunted to CXL memory. The optimization jointly minimizes transfer times and compute latency:

$\min_{0\le x_i\le1} \left\{ \frac{S_P x_P + S_G x_G + S_O x_O}{B_{\mathrm{DRAM}\to \mathrm{GPU}}} + \frac{S_P(1-x_P)+S_G(1-x_G)+S_O(1-x_O)}{B_{\mathrm{CXL}\to \mathrm{GPU}}/N_g} + \alpha[\cdots] \right\}$

(Liaw et al., 4 Jul 2025)

Multi-headed pooling topology allocation: Octopus assigns host requests proportionally to all physically reachable devices, balancing load and maximizing granted memory per host, under device capacity constraints (Berger et al., 15 Jan 2025).
Expert selection for context-aware MoE inference: Pinning “hot” experts in GPU HBM while mapping cold experts to lower-precision execution on CXL-NDP devices using knapsack optimization based on prefill statistics (Fan et al., 4 Dec 2025).
Best-shot Interleaving and Regulated Tiering: Performance-derived metrics steer page-level interleaving ratios and migration rates, leveraging 12 PMUs to predict slowdowns in CXL/NUMA mixes with ≥90% accuracy (Liu et al., 22 Sep 2024).

A common theme is that allocation is subject to multi-dimensional constraints (capacity, connectivity, latency, per-device utilization), and is solved either via lightweight greedy heuristics, one-pass proportional assignment, or cost-modeling from hardware counters and workload statistics.

3. Allocation Algorithms and Heuristics

CXL-aware allocation algorithms use workload, application, and system feedback to adapt placement:

Cosmos adjacency-aware heuristic: Greedy assignment ranks clusters by size, chooses devices minimizing adjacency penalty (based on centroid distances and monotonic penalties), and breaks ties by available capacity (Ko et al., 22 May 2025). Complexity is $O(N M K)$ for $N$ clusters, $M$ devices, $K$ nearest neighbors.
Greedy-Balance for Octopus: Allocates each host's request across all connected devices in proportion to available capacity, requiring only $O(m)$ time per host and preventing device hotspots without needing complex LP solutions (Berger et al., 15 Jan 2025).
Dynamic demotion/promotion (TPP): Transparent kernel policies use active/inactive LRU tracking and NUMA hint faults to demote cold pages to CXL nodes and promptly promote resurgent hot pages, ensuring only $\leq5\%$ of hot traffic is served remotely (Maruf et al., 2022).
Feedback-driven page placement (Caption): A sign-based greedy feedback loop incrementally tunes the interleaved page allocation ratio ( $p_\mathrm{CXL}$ ) between DRAM and CXL, using linear regression models of hardware counter signals to converge to empirically optimal ratios (30–40%) (Sun et al., 2023).
Time-series predictive scheduling (CXLAimPod): Cgroup hints and per-task PMU counters inform eBPF/sched_ext scheduling decisions, steering read/write-balanced, throughput-optimized tasks onto full-duplex CXL links (Yang et al., 21 Aug 2025).
Dynamic throttling (MIKU): Uses real-time PMU measurements and Little's Law to control CXL request injection rates via additive-increase/multiplicative-decrease rules, thus avoiding congestion and protecting DDR bandwidth (Yang et al., 22 Mar 2025).

These algorithms are designed to be lightweight and to respond to device- and workload-level feedback with minimal overhead.

4. Integration with Hardware, OS, and Application Frameworks

Proper CXL-aware allocation requires system-level integration:

NUMA node abstraction: CXL memory typically appears as additional NUMA nodes. OS allocators (e.g., Linux HMM, kernel >= 5.18) extend standard memory placement/migration APIs to support device selection, page migration, and lazy fault allocation (Wang et al., 28 Nov 2025, Maruf et al., 2022).
Memory APIs and frameworks: Abstractions such as OpenSHMEM PGAS, custom user-space verbs, and memory management calls (CXL-alloc, shmem_malloc, shmem_put/get/barrier) expose page, chunk, or region-level allocation primitives for shared, pooled, and private device memory (Jain et al., 4 Apr 2024).
Physical address mapping: Remap tables in CXL controllers translate HostPA to device addresses; per-page policy bits and granular snoop filters enforce coherency over shared regions (Jain et al., 4 Apr 2024, Sharma et al., 2023).
Performance counter-guided policies: Allocation policies leverage CPU stall cycles, L1/L2/LLC miss rates, store-buffer stalls, and tail latency metrics obtained via PMUs, allowing adaptive decisions on page migration, interleaving, and throttling (Liu et al., 22 Sep 2024, Yang et al., 22 Mar 2025).
Hybrid coherence models: Hardware back-invalidation combined with software-managed snooping is used for fine-grained synchronization or meta-data regions, while bulk data may be managed via chunk-level migration and remote memory access primitives (Jain et al., 4 Apr 2024, Sharma et al., 2023).

Such mechanisms ensure transparent expansion and dynamic steering of allocations between DRAM and CXL, with low overhead and compatibility with existing high-level programming models.

5. Application Domains and Experimental Results

CXL-aware allocation methodologies have been evaluated in a variety of domains:

High-throughput approximate nearest neighbor search: Cosmos adjacency-aware allocation yields 6.72× throughput for SIFT1B and 2.35× over previous CXL-ANNS offload, with consistently lower load imbalance and latency (Ko et al., 22 May 2025).
Long-context LLM fine-tuning: Optimized partitioning of optimizer-critical and latency-tolerant tensors across DRAM and CXL stretches GPU memory limits with ≤3% drop in throughput (single AIC), nearly zero overhead with dual AICs and striping (Liaw et al., 4 Jul 2025).
Pooling and sharing across hosts: Octopus topology and allocation algorithm achieves 16–17% lower cost per host, up to 3× pod size scaling, and 3× lower pool latency than symmetric designs (Berger et al., 15 Jan 2025).
Mixture-of-Experts inference: Context-aware expert pinning with CXL-NDP placement and mixed-precision quantization achieves up to 8.7× throughput improvement with only 0.13% accuracy loss compared to state-of-the-art (Fan et al., 4 Dec 2025).
Transparent kernel-level page placement: TPP maintains application throughput within <1% of ideal, outperforming NUMA Balancing by 5–17% (Maruf et al., 2022); Caption’s feedback-tuned interleaving boosts throughput by up to 24% compared to static policies (Sun et al., 2023).
Duplex scheduling exploitation: CXLAimPod attains 55–61% bandwidth improvement at balanced R/W ratios, with up to 150% Redis improvement for sequential patterns, and 71.6% improvement for LLM text generation workloads (Yang et al., 21 Aug 2025).

Allocation policy selection—guided by workload domain (bandwidth- vs. latency-sensitive), access patterns, concurrency level, and device topology—directly controls resource efficiency and performance.

6. Design Guidelines, Limitations, and Open Challenges

Actionable guidelines based on the surveyed work include:

Select the allocation granularity (page, chunk, region) most suitable for access pattern and coherency overhead.
Use performance prediction models and PMU metrics rather than single-point latency/bandwidth statistics to steer allocation.
Prefer explicit interleaving ratios and cgroup hints for mixed R/W workloads; reserve local DRAM for strict latency requirements (Liu et al., 22 Sep 2024, Yang et al., 21 Aug 2025).
Employ regulated migration and throttling to avoid bandwidth contention or migration storms (Yang et al., 22 Mar 2025).
Integrate allocation policy with OS/hypervisor or orchestration via extended APIs (e.g., madvise, membind, cxl_alloc), exposing control to higher-level frameworks (Wang et al., 28 Nov 2025, Jain et al., 4 Apr 2024).
Hardware and OS integration must support dynamic pool resizing, fabric remapping, and adaptive snoop filter sizing—especially for rack-scale and multi-host deployments (Jain et al., 4 Apr 2024, Sharma et al., 2023).

Current limitations include incomplete PMU coverage on some platforms, unpredictable CXL tail latency spikes due to hardware/fabric effects, migration and TLB shoot-down overhead at large scale, and granularity mismatches between hardware snoop filters and allocation regions. Open directions point to hierarchical, multi-tenant allocation schemes, deeper OS and hardware co-design, and adaptive, ML-driven policy controllers.

7. Representative Algorithms and Performance Summary

Algorithm/Policy	Domain	Key Benefit
Cosmos adjacency-aware (Ko et al., 22 May 2025)	ANNS search, RAG	6.72× throughput, load balance
Caption feedback loop (Sun et al., 2023)	General kernel allocations	+8–24% throughput, 0 migration
TPP (Maruf et al., 2022)	OS-level promotion/demotion	<1% gap to ideal, robust scaling
Octopus greedy-balance (Berger et al., 15 Jan 2025)	Pooling/topology	3× scaling, 16–17% cost savings
LLM CXL-opt (Liaw et al., 4 Jul 2025)	LLM fine-tuning	Up to 21% improvement, near-DRAM perf
CXLAimPod (Yang et al., 21 Aug 2025)	Kernel scheduling, duplex	+7–150% perf for mixed workloads
Context-Aware MoE (Fan et al., 4 Dec 2025)	MoE inference	8.7× throughput, 0.13% accuracy drop
SupMario bestshot/Alto (Liu et al., 22 Sep 2024)	Runtime tiering/interleave	Near-optimal perf, ≤177% over baseline