MemoryX: Unified CXL Memory Expansion

Updated 17 November 2025

MemoryX is a CXL-enabled memory expansion architecture that unifies host, DRAM, and SSD memory into a single coherent address space.
It employs advanced ML-driven caching policies and hardware–software co-design to mitigate latency and granularity mismatches typical of conventional caching methods.
MemoryX architectures demonstrate significant performance improvements in latency, throughput, and resource efficiency, making them ideal for HPC and ML applications.

MemoryX is a class of CXL-enabled memory expansion architectures unifying host and device memory spaces—including SSDs—under a coherent address pool, incorporating advanced policy engines for device-resident DRAM caches, and leveraging hardware–software co-design for efficiency. Recent implementations such as ICGMM (Chen et al., 2024), CXLMemUring (Yang, 2023), and SkyByte (Zhang et al., 18 Jan 2025) exemplify the technical diversity and evolution in this space, employing machine learning for cache management, asynchronous in-core support for parallel access, and context-sensitive OS/controller integration. The term “MemoryX” encapsulates this new generation of architectures: CXL-attached DRAM or SSD devices managed as near-seamless extensions of main memory, aiming to overcome the latency, bandwidth, and manageability barriers endemic to far-memory scaling.

1. CXL-Based Memory Expansion Fundamentals

MemoryX systems exploit the Compute Express Link (CXL)—a cache-coherent protocol extending PCIe—presenting host DRAM, device DRAM, and SSD-backed memory as a unified address space. In the memory-expansion mode, host load/store instructions transparently access SSD-backed memory via CXL.mem. To mask the SSD’s inherent latency (e.g., 75 μs read, 900 μs write), device-side DRAM acts as a page-granularity cache. Host–device coherent pooling enables order-of-magnitude capacity gains relative to DRAM-only systems, though high-latency “far-memory” misses remain a primary challenge. Granularity mismatch arises: with host accesses at 64 B lines, but CXL/SSD transfer and device cache operating at 4 KB pages, leading to cache pollution and suboptimal hit rates (Chen et al., 2024, Zhang et al., 18 Jan 2025).

2. Caching Policy Challenges and ML-Driven Solutions

Conventional cache management (e.g., LRU) is ill-suited to the DRAM–SSD tier due to microsecond-scale penalties for DRAM-cache misses, and poor granularity adaptation. ML-based policies (notably LSTM) demand high runtimes and excessive resources, especially in hardware-constrained contexts. The ICGMM approach (Chen et al., 2024) circumvents this via a hardware-friendly Gaussian Mixture Model: modeling each page-access record as a 2-D tuple (P=page index, T=timestamp), inferring admittance/eviction score through

$\mathcal{G}(x;\pi,\mu,\Sigma) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k,\Sigma_k)$

with component $\mathcal{N}$ denoting Gaussian density, $K$ typically 256. EM training proceeds offline; at runtime, deeply pipelined FPGA kernels compute caching/eviction decisions in $\sim$ 3 μs, effecting $>$ 10,000 $\times$ speedup and sharp reductions in miss rate (up to 6.14%) and SSD access latency (up to 39.14%) versus both LRU and LSTM.

3. Hardware–Software Co-Design for Parallel, Flexible Access

CXLMemUring (Yang, 2023) advances MemoryX via hardware–software synergy. Host-side extensions to BOOMv3 core introduce a CXL-IO FSM, 16-entry mailbox in the ROB for offloading marked loads, and resume logic to restore dependent instructions upon CXL.mem return. At the endpoint, a $\mu$ Core (RISC-V, 0.3 GHz) executes MemUring Offload Kernels (MOKs)—asynchronously synthesized by an MLIR JIT, guided by profiling hot pointer chains. Simulations (FPGA, CHI-modeled 32 GB/s, 500 ns RTT) show CXL load latency dropping from $\sim$ 1.5 μs to 600 ns, pointer-chase throughput scaling 3.6 $\times$ , and host L1 miss rate falling 40%. Profiling/JIT overheads are modest ( $<$ 2%), and $\mu$ Core utilization stays well below saturation. This architecture enables highly asynchronous, flexible memory pool access, effectively bridging traditional memory wall bottlenecks for dense HPC or ML workloads.

4. Device Controller and OS Integration: Write-Log, Migration, and Context Switches

SkyByte (Zhang et al., 18 Jan 2025) investigates co-designed OS/SSD controller mechanisms to further optimize MemoryX implementations, particularly under flash-backed CXL-SSDs. The internal SSD DRAM buffer is split into a cacheline-level circular write log (indexed by two-level hash) and a page-level LRU data cache; log compaction merges dirty lines into pages for efficient flash writes. Moreover, SkyByte incorporates adaptive page migration: maintaining per-page access counters, promoting hot pages to host DRAM with PTE updates, and employing context-switch hints to the host OS upon projected long flash delays. Exception-triggered context switches are activated when controller-estimated delay $T_{\rm flash\_est}(p)$ exceeds the measured context-switch cost $T_{\rm cs}$ ( $\sim$ 2 μs). Quantitatively, SkyByte achieves up to 6.11 $\times$ speedup, $\sim$ 23 $\times$ I/O traffic reduction, and approaches 75% of idealized DRAM-only performance, contingent on access patterns and flash latency.

Mechanism	Role in MemoryX	Quantitative Effect
GMM caching	DRAM-cache policy	0.32–6.14% miss ↓
Asynchronous loads (CXLMemUring)	Latency-hiding parallelism	1.5μs→0.6μs load latency ↓
Write-log + migration (SkyByte)	I/O & stall hiding	6.11× speedup, 23× bytes ↓

5. Prototyping, Resource Trade-offs, and Scaling Considerations

ICGMM and SkyByte both prototype on FPGA platforms (e.g., Xilinx Alveo U50, 233 MHz). For ICGMM, the GMM policy engine’s hardware profile: 190 BRAM (14%), 117 DSP (2%), 58,353 LUT, 152,583 FF—contrasted with the much fatter LSTM cache engine (339 BRAM, 145 DSP, 85,029 LUT). GMM inference is $>$ 10,000 $\times$ faster with drastic resource savings: $\pm$ 2% BRAM, 78% DSP, and 69% LUT reduction. Production MemoryX systems favor offline EM training for GMM (using $\geq$ 10,000 traces, length-32 windows), per-application threshold tuning to balance hit rate and cache pollution, and full dataflow pipeline overlap (GMM, tag lookup, and SSD emulation decoupled). SmartSSD integration and exposing simple score APIs facilitate large-scale deployments.

6. Applications, Limitations, and Open Research Issues

MemoryX architectures are immediately applicable to large-scale memory-intensive workloads—including database systems, high-throughput ML/HPC, and LLM applications—where DRAM capacity is limiting and SSD expansion offers substantial dollar-per-GB savings. The GMM-based cache engine improves cache-miss and latency metrics, and asynchronous paradigms enable flexible offload of memory chains. Scaling to multi-host, multi-tier memory pools, and supporting advanced features (log compaction acceleration, hot-page migration, NUMA-aware coherence) are active research directions.

Limitation	Root Cause	Open Mitigation
OS dependency (SkyByte)	Exception handler injection	Kernel/driver co-design
Coherence during migration	Page migration races	NUMA-aware placement
Hardware resource ceiling	FPGAs, on-chip memory	Bloom/CAM structures
Security & privacy	Device→CPU hint leakage	Protocol vetting

A plausible implication is that MemoryX architectures, with continued co-design and protocol advances, may become the backbone for next-generation disaggregated memory pools—supplying high-capacity, low-latency address spaces for AI accelerators and composable cloud systems.

7. Future Directions

Active areas for MemoryX research include: hardware-accelerated log compaction, integration into open standards (GenZ, CXL 3.0), scalable page-level buffer indexing (bloom filters, lock-free CAM), and in-memory datacenter composition under unified OS scheduling. Dynamic tuning of thresholds, expanding mailbox mechanisms to multi-core superscalar contexts, and defining hardware–software contracts for endpoint accelerator pooling are under investigation. For production systems, effective deployment depends on tailoring migration and cache policies to workload locality, latency profiles, and host scheduling mechanisms.

In summary, MemoryX—encompassing approaches such as ICGMM, CXLMemUring, and SkyByte—represents an overview of hardware-friendliness, ML-based policy optimization, and OS/controller integration. These systems demonstrate substantial gains in latency, resource footprint, and resilience to granularity and coherence mismatches, establishing a technical foundation for scalable, heterogeneous memory expansion in contemporary computing environments.