CXL-enabled Memory Pooling

Updated 12 December 2025

CXL-enabled memory pooling is a disaggregated memory architecture that leverages cache-coherent CXL protocols to form flexible, shared pools among hosts, GPUs, and accelerators.
The design uses regular bipartite topologies and both direct-attach and switch-based architectures, achieving 3–7× memory scaling and up to 25% cost reduction.
Advanced allocation algorithms and integrated software stacks enable dynamic NUMA integration and efficient, balanced memory distribution for diverse workloads.

Compute Express Link (CXL)–enabled memory pooling is a system-level architectural paradigm that leverages CXL’s cache-coherent interconnect protocols to construct large, flexible, and elastic memory pools de-coupled from individual compute nodes. These pools can be accessed simultaneously or dynamically partitioned among multiple host CPUs, GPUs, or accelerators, thereby enhancing both utilization and performance while reducing costs associated with system memory provisioning and stranded resources (Berger et al., 15 Jan 2025, Chen et al., 2024). The section below outlines the formal models, hardware architectures, management algorithms, evaluation results, and core design trade-offs of this approach, drawing exclusively from primary experimental and theoretical research.

1. Formal Model and Topological Design

A CXL-enabled memory pool can be precisely described as a bipartite graph $G = (H \cup P, E)$ , where $H = \{h_1, \ldots, h_H\}$ is the set of hosts, $P = \{p_1,\ldots, p_M\}$ the set of multi-headed CXL pooling devices (MHDs), and $E \subset H \times P$ encodes the CXL links (Berger et al., 15 Jan 2025). Each host $h$ is connected to $X$ MHDs and each $p$ to $N$ hosts, such that $\forall h \in H: \deg(h) = X$ , $\forall p \in P: \deg(p) = N$ .

A "regular Octopus" topology enforces that every unordered host pair shares exactly one MHD: $\forall h_i \neq h_j \in H: |\Gamma(h_i) \cap \Gamma(h_j)| = \lambda = 1$ . This structure corresponds to a balanced incomplete block design (BIBD) with the following constraints:

$H = 1 + X(N-1) \quad\text{and}\quad M = \frac{H X}{N}$

The construction of sparser or denser graphs enables topological tuning for bisection bandwidth, redundancy, and communication reach. Large clusters or hierarchical multi-cluster networks may combine multiple such subgraphs connected via CXL switches or fabric managers (Woo et al., 16 Oct 2025, Jain et al., 2024).

2. Hardware and System Architecture

CXL pooling requires hosts with CXL root ports interfaced over PCIe Gen5/6 physical layers (x8 or x16) to either direct-attach MHDs, multi-port CXL memory expanders, or hierarchical switches (Berger et al., 15 Jan 2025, Arelakis et al., 2024). The memory expander devices implement CXL.mem, CXL.cache, and CXL.io protocols. Hardware variants include:

Direct-Attach Pools: Point-to-point or star topologies where each host is directly wired to multiple MHDs (Berger et al., 15 Jan 2025).
Switch-Based Pools: CXL 2.0/3.0 switches support fan-in/fan-out to scale pod sizes, with Type-3 memory expanders behind them (Jain et al., 2024, Chen et al., 2024).
Hierarchical/Hybrid Fabrics: Multi-root, multi-tiered designs for scale-out across racks, integrating CXL and accelerator-centric links (e.g., XLink/NVLink) (Woo et al., 16 Oct 2025).
Disaggregated Cache-Coherent Pools: Use CXL 3.0’s hardware-managed MESI coherence across hosts and devices with back-invalidate and snoop filters (Jain et al., 2024).

Each device exports a DRAM or PMEM region addressable as system memory. All allocations and mappings are surfaced either as new NUMA nodes, memory hot-plug regions, or as part of a shared physical address space. Memory compression hardware (e.g., Tiered Memory Expander SoCs with on-die IP) can increase effective pool capacity by 2–3× (Arelakis et al., 2024).

3. Memory Allocation and Pool Management

Memory allocation in CXL pools is managed to optimize for utilization, minimize stranding, and adapt to device heterogeneity:

Greedy-Balancing Algorithm (Octopus): Given a host's request for $C$ bytes, allocation across its connected MHDs $P_0$ is proportional to remaining free space:

$\text{For}\ p \in P_0: \text{Alloc}_p = C \times \frac{\text{avail}_p}{\sum_{q\in P_0} \text{avail}_q}$

Pod-Wide Allocation Strategies: Regular topologies ( $\lambda=1$ ) are leveraged for one-hop host-to-host queues; dense or sparse designs permit redundant paths or toroidal interconnects for HPC (Berger et al., 15 Jan 2025).
Tiering and Pool-Aware NUMA: Latency-critical pages remain in local DRAM, latency-tolerant or capacity-bound allocations are placed in the pool, dynamically interleaved at page or sub-page granularity based on monitored hotness or policy (Liaw et al., 4 Jul 2025, Wahlgren et al., 2022).
Software:DRAM Cache and Prefetch: Local DRAM as a set-associative cache for pool memory, with sub-page signature-based prefetching to hide remote access latency (Tirumalasetty et al., 2024).

4. Latency, Bandwidth, and Scalability Results

Empirical evaluations quantify the trade-offs and efficiency of CXL pooling. Key metrics (all verbatim from cited papers):

Configuration	Single-Access Latency	Bandwidth (per link)	Pool Size/Host	Notable Results
Small MHD, direct attach	230–350 ns	25 GB/s (x8)	13–57 (+3–7×)	Octopus achieves ~3–7× larger pods at equal per-host cost (Berger et al., 15 Jan 2025)
CXL switch, multi-hop	500 ns+	20–25 GB/s	up to 100s	Each switch hop adds ~250 ns; switch cost dominates (Berger et al., 15 Jan 2025)
CXL with compression (OCP)	<250 ns (tail)	≥46 GB/s	n/a	2–3× effective capacity; 20–25% TCO reduction (Arelakis et al., 2024)
CXL pool w/ DRAM cache	~250 ns (w/ PF)	~42–48% BW util	n/a	Prefetch+WFQ: +7–10% IPC, 15% latency reduction (Tirumalasetty et al., 2024)

Fine-grained page placement in hosts’ OS/runtimes and the use of feedback-controlled throttling or weighted queueing in memory nodes further enhance performance under bandwidth contention (Yang et al., 22 Mar 2025, Tirumalasetty et al., 2024). Application-level performance drops <10–15% for classical scientific workloads with 75% of pages in pooled memory, though graph workloads are more sensitive to added latency (Wahlgren et al., 2022).

5. Cost Models, Utilization, and Economic Implications

CXL-enabled pooling drives improved memory utilization (reducing stranded capacity from 54% to 19% with 8 hosts, by queueing-theoretical estimation (Zhong et al., 30 Mar 2025)) and favorable cost-per-host scaling, owing to small and cheap MHDs and the elimination of expensive CXL or PCIe switches. Example (from (Berger et al., 15 Jan 2025) Table 1):

Device Ports N	Die-Normalized Cost ( $c_N$ )
2	0.19
4	0.42
8	1.00
16	3.07

When normalized to $N=8$ device cost $\approx \$1,600%%%%17 $\lambda=1$ 18%%%%N=4 $,$ X=4%%%%20 $P = \{p_1,\ldots, p_M\}$ 21%%%%3\times$, yielding net system TCO reduction of 20–25% for hyperscalers (Arelakis et al., 2024).

6. Software Stack, Programming Model, and Coherence

CXL pooling is exposed to user-space via modified Linux kernel stacks: remote memory appears as additional NUMA nodes, and allocations use standard APIs (malloc, mmap, numactl) (Jain et al., 2024, Chen et al., 2024).

Consistency and Cache Coherence:
- CXL 2.0 pools provide NUMA-like semantics with MHDs as host-owned; CXL 3.0 enables multi-host coherence via directory and snoop filter in switches (MESI, back-invalidate) (Jain et al., 2024, Assa et al., 2024).
- Precise coherence is 64 B (cacheline) granularity; imprecise approaches (e.g., 4 KB) are used for metadata cost reduction.
- Formal programming models, e.g., CXL0, define operations (LStore, RStore, MStore, LFlush, RFlush) with operational semantics and correct-by-construction transformations for durable, concurrent applications (Assa et al., 2024).
Resource Management: Verbs APIs for atomic operations, workers and context assignment; global fabric managers to partition and isolate tenants; hybrid software-hardware approaches for failover, redundancy, and sharing (Jain et al., 2024, Chen et al., 2024).

7. Limitations, Challenges, and Future Directions

Major practical and research challenges include:

Fabric and Scaling: Link reach limited to a few meters; hardware support for multi-hop, multi-root, and cross-rack fabrics is still maturing (Arelakis et al., 2024, Wang et al., 2023).
QoS and Contention: Need for hardware and OS-level QoS; interference on shared pools degrades performance for bandwidth-heavy workloads without per-host or per-job control (Wahlgren et al., 2022).
Coherence State Explosion: Snoop filter and metadata scalability for $>100$ endpoints; hybrid or software-augmented approaches are deployed (Jain et al., 2024).
Programming/Consistency: Lack of fine-grained, portable flush/ordering primitives and OS/hypervisor integration for persistent and crash-tolerant applications (Assa et al., 2024, Jain et al., 2024).
Reliability and Security: Hardware support for failover, erasure coding, and isolation is ongoing in the standards community (Arelakis et al., 2024).
Large-Scale LLM/AI Pools: Tiered pooling (tier-1: coherent, low-latency; tier-2: disaggregated, cheap, bulk) is explicitly used in accelerator clusters for LLM training/inference, yielding up to 4.5× latency improvements over RDMA (Woo et al., 16 Oct 2025).

Conclusion

CXL-enabled memory pooling systems, precisely modeled as regular bipartite topologies and realized with commodity MHDs and CXL-aware OS/runtime stacks, deliver significant improvements in memory utilization, cost, and scalability. State-of-the-art designs such as Octopus, OCP Tiered Expanders, and hybrid fabric clusters (ScalePool) demonstrate practical pool sizes of tens to hundreds of hosts, stepwise TCO reductions, and system-level performance competitive with local-attached DRAM. Despite residual bottlenecks in link bandwidth, switching, and protocol costs, as well as open challenges in programmability and robustness, CXL memory pooling is becoming a canonical approach for both hyperscale and HPC environments (Berger et al., 15 Jan 2025, Arelakis et al., 2024, Woo et al., 16 Oct 2025).