Hyperscale CXL Tiered Memory Expander

Updated 14 March 2026

Hyperscale CXL tiered memory expander is a datacenter-scale architecture that creates an elastic, multi-tiered memory pool using diverse memory types over CXL links.
It employs FPGA-based platforms (HeteroBox) and device-side engines (HeteroMem) to emulate and manage distinct latency and bandwidth characteristics transparently.
Empirical evaluations demonstrate 5–16% speedup over software-only schemes, highlighting effective migration, profiling, and scalable performance in high-demand workloads.

A hyperscale CXL tiered memory expander is a datacenter-scale hardware and software architecture that uses the Compute Express Link (CXL) protocol and device-side accelerators to construct a multi-tiered, heterogeneous, and elastic memory pool. This system dynamically migrates data between memory tiers of varying latency and bandwidth. The HeteroBox and HeteroMem platforms exemplify the rigorous engineering required to achieve high performance, transparency, and scalability in such environments, enabling up to 16% end-to-end speedup over state-of-the-art memory management schemes while providing compatibility with unmodified software stacks (Chen et al., 26 Feb 2025).

1. System Architecture and Emulation Platform

At the core, the system is positioned between the host CPU and a set of heterogeneous backing buffers (e.g., DDR DRAM, Storage Class Memory) attached over CXL links, optionally via multi-stage CXL switches. The CXL memory pool is partitioned into logical tiers—such as Tier 0 (on-package HBM), Tier 1 (DDR DRAM), and Tier 2 (SCM/Optane)—each presenting distinct round-trip latency and bandwidth profiles to the CPU.

The HeteroBox platform implements configurable emulation of multi-tier CXL architectures using real FPGAs. By assigning contiguous regions of on-board DRAM to emulate distinct latency and bandwidth characteristics, HeteroBox can:

Tag incoming reads with virtual "ready times," queuing requests to enforce region-specific delays.
Enforce bandwidth throttling via epoch-based counters, stalling request processing when limits are reached.

This gives the CPU's memory controller the illusion of four NUMA nodes with tunable access properties, facilitating high-fidelity testing of system software and hardware-managed tiering policies (Chen et al., 26 Feb 2025).

2. Device-Side Management and Hardware Components

HeteroMem, instantiated on the FPGA, transparently implements device-side memory tiering:

Remapping Unit: Translates host physical addresses (hPA) to device physical addresses (dPA) using a small page-table cache (e.g., 2MB, typically <1% miss rate) and on-DRAM remapping tables.
Profiling Unit: Monitors each memory access. For slow tiers, hotness is tracked using a Count-Min Sketch (2 × 512k entries, ~128kB BRAM). For fast tiers, a ping-pong bitmap tracks cold pages.
Migration Engine: Atomically migrates pages—swapping a hot page from a slow tier into a cold frame in a fast tier—triggered by profiling logic, with nonblocking in-DRAM copies and synchronized updates to remapping mappings.

Host access to the address space remains flat and contiguous; migration, tiering, and profiling are entirely transparent to both OS and application code. No page faults or SVM traps are introduced. A lightweight kernel driver exposes MMIO controls and statistics via the PCI BAR, with sub-1k lines of host code and zero runtime interrupt or trap overhead (Chen et al., 26 Feb 2025).

3. Monitoring, Profiling, and Migration Algorithms

Efficient tier migration hinges on accurate, low-overhead detection of "hot" and "cold" pages:

A Count-Min Sketch is cleared every X cycles, tracking access frequency of pages in slower tiers. When the minimum across all hash lanes for a page index crosses a HOT_THRESHOLD, and the page is not already marked hot, it is enqueued for migration.
The coldness bitmap records accesses to pages in fast tiers; periodically, the bitmaps are swapped and cleared for next interval sampling, allowing detection of underutilized pages.
Migration is triggered when a hot page in a slow tier and cold page in a fast tier are paired.

Migration decisions are guided by a tier-cost model:

$\text{Cost}_1 = N \times (L_1 - L_0)$

where $N$ is the access frequency, $L_1$ , $L_0$ are latencies of Tier 1 and Tier 0, respectively. The migration proceeds if $\text{Cost}_1$ exceeds the time- and bandwidth-amortized cost: $\text{MigCost} = \frac{P}{B_{\text{mig}}} + \tau$ where $P$ is the page size, $B_{\text{mig}}$ is observed migration bandwidth, and $\tau$ incorporates migration controller overhead (Chen et al., 26 Feb 2025).

4. Resource Usage, Overheads, and Bottlenecks

The design maintains exceptional hardware efficiency:

The remapping cache uses ≈2 MB SRAM and 20k ALMs, adding 1–2 cycles on hit, and around 50–100 ns on miss.
The Count-Min Sketch profiling structure fits in ≈128 kB BRAM, updating in a single cycle.
The migration engine saturates at ~12 GB/s—approximately 13× the rate of CPU-driven data copying.
Remapping cache miss rate is consistently under 1% for standard benchmarks, translating to <3% application-level performance drag.

The primary bottleneck is internal migration bandwidth. Aggressive tier filling can trigger spontaneous migrations; this can be mitigated by enlarging the remap cache, adjusting hotness thresholds, or rate-limiting migration (Chen et al., 26 Feb 2025).

5. Evaluation Methodologies and Empirical Results

The system was validated on a comprehensive workload suite, with footprints from 4–11 GB and diverse tier splits (e.g., 4 GB fast, 12 GB slow):

Micro-benchmark: GUPS (random update test)
Database (Silo), index-lookup (Btree), scientific kernel (XSBench), graph analytics (BC, PR), HPC2017 workloads (bwaves, roms)

Key results include:

Workload	HeteroMem Speedup over SW Baseline	Graph Kernel Gain (4GB tier)	Additional Gain over NeoMem
All (geomean)	5.1–16.2%	11–33%	6–12%

HeteroBox achieves high-fidelity emulation with ±2 ns latency accuracy and linear bandwidth scaling up to 25 GB/s. Across all real workloads, HeteroMem’s device-side migration delivers a net benefit of 15–20% after accounting for remap overhead (Chen et al., 26 Feb 2025).

6. Scalability and Future Enhancements

This approach is readily extensible to multi-host and rack-scale deployments:

Remapping/migration table logic may be centralized in a CXL leaf switch, exposing a unified memory pool to multiple hosts.
No host-side extensions are needed—control and data movement reside in programmable logic or time-multiplexed FPGAs.
The system supports natural extension to multi-granularity pages (e.g., 2 MB huge pages), incorporation of prefetch-hint mechanisms, and multi-level CXL switch topologies for granular latency/bandwidth. This facilitates adaptation to CXL 3.0+ (Chen et al., 26 Feb 2025).

Identified areas for further research include capacity-aware tier placement, hardware prefetch co-design, adaptive migration policies, and leveraging device-side monitoring for rack-level pooled memory orchestration.

A hyperscale CXL tiered memory expander, designed following these principles, combines a configurable FPGA-based emulation platform (HeteroBox), a transparent, high-throughput device-side tiering engine (HeteroMem), and minimal host software. It achieves robust application acceleration, transparent OSS integration, and scalable multi-host support—delivering workload speedups of 5–16% over advanced software-only tiering, while maintaining compatibility with existing data-center infrastructure (Chen et al., 26 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FPGA-based Emulation and Device-Side Management for CXL-based Memory Tiering Systems (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperscale CXL Tiered Memory Expander.