CXLMemUring: CXL Memory Co-Design

Updated 20 December 2025

CXLMemUring is a co-design paradigm enabling asynchronous, flexible parallel access to CXL memory pools in disaggregated environments.
It leverages modified RISC-V cores, in-core mailboxes, and near-endpoint compute offloading to dynamically reduce remote-memory latency.
Profiling-guided adaptive code generation and lightweight hardware extensions optimize offload decisions, yielding up to a 30% latency reduction and improved throughput.

CXLMemUring is a hardware/software co-design architecture paradigm designed for high-throughput, asynchronous, and flexible parallel access to CXL (Compute Express Link) memory pools, especially in disaggregated or heterogeneous compute environments. It introduces novel core modifications, in-core notification and asynchrony mechanisms, near-endpoint compute offloading, and profiling-guided adaptivity, aiming to hide remote-memory latency and exploit CXL’s load/store semantics more effectively than conventional tightly-coupled memory hierarchies (Yang, 2023).

1. Architectural Overview

At its core, CXLMemUring centers on a modified out-of-order RISC-V superset (BOOMv3) pipeline, augmented with an Async Memory Unit (AMU) that manages offloaded memory operations. The data-movement pipeline is designed for asynchrony at multiple levels:

Host BOOMv3 Core: Integrates the AMU next to the load/store unit. The AMU tracks “instruction distance” for loads and determines—via profiling—a threshold beyond which loads should be offloaded rather than processed synchronously through traditional memory hierarchies.
In-Core Mailbox: The AMU enqueues offload requests in a hardware mailbox. Upon completion, the remote CXL endpoint or near-endpoint core writes to this mailbox, waking the corresponding entry in the reorder buffer (ROB) and resuming dependent instructions.
CXL Endpoint/Switch: Offload requests are transmitted via a CPU-side CXL endpoint, traverse a PCIe/CHI-based fabric, and are handled by a near-endpoint RISC-V core or similar lightweight compute element residing at or near the remote memory.
Near-Endpoint RISC-V Core: Executes pointer chasing, metadata lookups, and DMA orchestration entirely independent of the host CPU, returning data directly to the host’s L1, with minimal synchronization overhead.

Data and control flow is dynamically partitioned: requests with “short distance” proceed in order through the conventional memory stack, while “far” requests are speculatively offloaded, and their execution overlaps background computation.

2. Asynchronous Memory-Fetch Pipeline

The memory request pipeline in CXLMemUring is structured as follows:

Online/Offline Profiling: Dynamic analysis records instruction-distance $d_i$ between issue and data readiness for each static load site. Sites exceeding an “async window” threshold $W_{\text{async}}$ are flagged for offload potential.
Offload Decision: The AMU determines at decode/rename whether to issue a synchronous or asynchronous load based on profiling results and current window table configuration.
Request Enqueue and Forwarding: Asynchronous loads are queued in the mailbox (tagged with PC, destination register, etc.), packaged into CXL.io requests, and forwarded to the CXL endpoint.
Remote Handling: The near-endpoint core dequeues and processes offload requests, performing pointer traversal or address translation as required, before issuing a CXL.mem DMA.
Cache Insertion/Notification: On data arrival, the remote endpoint injects the cache line into the host L1 using standard fill mechanisms, then writes completion into the mailbox.
ROB Wakeup and Commit: The in-core AMU matches completion tags to outstanding loads, re-enables dependent instruction wakeups in the ROB, and commits execution flow.

This approach ensures that backend computations can proceed in parallel with remote fetches, substantially reducing effective memory access latency for distant loads.

3. Profiling-Guided Code Generation and Adaptation

CXLMemUring incorporates a profiling-guided adaptive code generation and rewriting mechanism:

Profiling: For each static load site $i$ , the distribution $H_i(d)$ of instruction distances is recorded, either during a static warmup phase or continuously at runtime.
Async Candidate Set: Sites with $E[H_i(d)] \geq W_{\text{async}}$ populate the async-candidate set $C$ .
JIT Patching: An MLIR-based JIT tool rewrites binary code paths at sites $i$ , inserting runtime-selected branches to choose between normal and async load paths based on $profile\_flag_i$ .
Continuous Tuning: The JIT daemon periodically updates async flags via writes to the AMU's window table. If throughput degrades ( $\Delta T < 0$ ), flags are reverted and profiling resumes, enabling convergence to an optimal set of offload points.

This tight hardware/software feedback loop enables CXLMemUring to adapt to runtime workload characteristics and memory system dynamics without programmer intervention.

4. Performance Modeling and Quantitative Analysis

CXLMemUring’s performance model decomposes the service time for offloaded memory operations: $L_{\text{total}} = L_{\text{offload}} + L_{\text{transport}} + L_{\text{fill}} + L_{\text{notify}}$ where

$L_{\text{offload}}$ : Enqueue and serialization to endpoint,
$L_{\text{transport}} \approx L_{\text{fetch}}$ : True CXL.mem round-trip latency,
$L_{\text{fill}}$ : L1 insertion and cache coherence overhead,
$L_{\text{notify}}$ : Mailbox write and ROB wakeup.

Empirical simulation using FireSim and CHI-based CXL.mem modeling yields:

Baseline synchronous CXL.mem per-load avg: $\sim$ 500 ns
CXLMemUring offload pipeline: $L_{\text{offload}} = 50$ ns, $L_{\text{transport}} = 250$ ns, $L_{\text{fill}} = 30$ ns, $L_{\text{notify}} = 20$ ns; $L_{\text{total}} \approx 350$ ns ( $\sim$ 30% reduction).

Throughput improvement is parametrized as: $\Delta T \approx \left(1 - f + f \cdot \frac{L_{\text{total}}}{L_{\text{sync}}}\right)^{-1}$ where $f$ is the fraction of loads offloaded, and $L_{\text{sync}}$ is the native synchronous miss latency.

Reported results include up to 1.4 $\times$ speedup for pointer-chase kernels, 1.2 $\times$ on bandwidth-bound streaming, 2% profiling overhead, and up to +15% switch bandwidth utilization due to increased concurrency.

5. Hardware/Software Co-Design and Scalability

CXLMemUring’s design demonstrates that minimal in-core hardware extensions (AMU, mailbox FIFO) and small near-endpoint microcontrollers suffice to enable highly asynchronous and flexible CXL memory access. Key architectural lessons:

Separation of Offload Logic: Keeping asynchrony and remote synchronization logic out of the conventional MSHR/ROB critical path prevents pipeline bloat and resilience impacts on typical workloads.
Profiling-Driven Adaptation: Profiling mitigates the risk of suboptimal offloading by targeting only long-distance, stall-prone memory operations.
Endpoint Compute Extensibility: The use of near-endpoint minicores generalizes across memory types (DRAM, flash, peer CPU) and offload functions (pointer-chasing, translation, DMA).
JIT-Driven Flexibility: Software control of offload policies via runtime patching enables workload- and system-adaptive operation without microcode or ISA changes.

This generalized template shows applicability to scalable CXL-attached memory subsystems across both server and accelerator-class systems.

6. Implications for Next-Generation MemoryX Architectures

CXLMemUring exemplifies the MemoryX paradigm—wherein memory access orchestration, movement, and computation are co-optimized for domain or workload. Its lessons extend:

Asynchrony and Latency Hiding: Selective offload of “distant” memory operations achieves significant latency hiding, even in the presence of high remote-memory access times.
Minimal Hardware, Maximal Flexibility: Lightweight mailbox and AMU architectures enable flexible software-driven policies without overhauling core pipelines.
Endpoint Compute as a Memory Orchestrator: Deploying small compute nodes near the CXL memory pool provides a substrate for advanced memory services with minimal host-CPU involvement.
Profile-Guided Adaptivity: System-wide throughput and tail-latency improvements are realizable by continuously adapting offload decisions based on real-time profile data.

These principles allow MemoryX architects to design systems that dynamically balance host-core compute, memory asynchrony, endpoint compute offload, and CXL-fabric utilization for both throughput and efficiency (Yang, 2023).

PDF Markdown Chat (Pro)

References (1)

CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CXLMemUring.

CXLMemUring: CXL Memory Co-Design

1. Architectural Overview

2. Asynchronous Memory-Fetch Pipeline

3. Profiling-Guided Code Generation and Adaptation

4. Performance Modeling and Quantitative Analysis

5. Hardware/Software Co-Design and Scalability

6. Implications for Next-Generation MemoryX Architectures

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CXLMemUring: CXL Memory Co-Design

1. Architectural Overview

2. Asynchronous Memory-Fetch Pipeline

3. Profiling-Guided Code Generation and Adaptation

4. Performance Modeling and Quantitative Analysis

5. Hardware/Software Co-Design and Scalability

6. Implications for Next-Generation MemoryX Architectures

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research