CXL-Attached Near-Data Processing (CXL-NDP)
- CXL-NDP is an architectural paradigm that integrates compute engines within CXL-attached memory devices, enabling direct offload of compute-intensive workloads.
- It utilizes innovative offloading interfaces such as memory-mapped function regions and channel controllers to optimize latency, bandwidth, and energy efficiency.
- Advanced techniques like microthreading, asynchronous protocols, and adaptive kernel sizing drive significant throughput gains and energy reductions in various applications.
Compute Express Link–Attached Near-Data Processing (CXL-NDP) is an architectural paradigm in which computation is moved physically and logically closer to memory that is attached via the CXL standard. Instead of only accessing CXL-expander memory as passive far memory, CXL-NDP enables specialized or general-purpose compute engines in the memory device itself, allowing compute offload that directly addresses bandwidth, latency, and data-movement bottlenecks typical in large-scale and memory-intensive systems.
1. Architectural Models and Key Mechanisms
CXL-NDP systems leverage CXL Type-2 and Type-3 device models. In Type-3 (memory expansion) configurations, the memory expander can include an NDP controller that orchestrates both conventional host-managed load/store access and compute offload via in-device cores or accelerators. These compute engines range from minimal RISC-V microcores (Yang, 2023), ARM Neoverse-class general-purpose CPUs (Hermes et al., 3 Apr 2024), to channel-controller–like coprocessors (Liu et al., 11 Jun 2025), and custom μthread/vector NDP cores (Ham et al., 30 Apr 2024).
Physical and Logical Pathways:
- Host CPUs or GPUs issue conventional CXL.mem requests for normal memory operations, with memory appearing contiguous in the host’s virtual address space.
- Offloading interface: The device exposes a reserved control region (e.g., memory-mapped "M²func" window (Ham et al., 30 Apr 2024), mailbox queues (Yang, 2023)) for the host to launch, monitor, and synchronize offloaded kernels.
- NDP units have direct, low-latency access to device-local DRAM and often operate behind a DRAM-side L2 cache, enabling high bandwidth and concurrent kernel execution.
- Communication: Host-device interactions occur over CXL.mem for both data and control, with some architectures also supporting asynchronous CXL.io DMA push-back for results (Lee et al., 4 Dec 2025).
Device Protocol Coordination:
A critical trade-off exists between fully device-centric polling (high latency, high overhead) and fully synchronous memory-centric offload (low latency, low overlap). To optimize for pipelining and minimal idle time, advanced protocols such as "Asynchronous Back-Streaming" are deployed, where the device streams partial results to a host-side ring buffer without synchronous round-trips, substantially collapsing both device and host idle times by up to 22× and 3.8× respectively (Lee et al., 4 Dec 2025).
2. Compute Offload Programming Models
CXL-NDP exposes offload capabilities through several programming abstractions:
- Memory-mapped function regions (M²func): The host kernel or user process issues a store to a known offset, representing a control command or kernel call. Each operation is mapped to a predictable, low-latency CXL.mem store/load cycle (latency ≈140–200 ns) (Ham et al., 30 Apr 2024).
- Channel controllers (MCC): Inspired by mainframe I/O architectures, the host loads "channel programs" into device MMIO regions. Each CP is an event-driven coroutine that operates within a full CXL.mem coherence domain, enabling bi-directional notification and load/store semantics (Liu et al., 11 Jun 2025).
- JIT/Profiling-guided Offload Transformation: Hot load-streams or pointer-chase kernels are identified through dynamic profiling and transformed into offloadable kernels with well-structured data arguments (Yang, 2023).
- General-purpose Cores (Editor’s term): The device executes offloaded (e.g., TFLite) graph operators or vector search kernels natively in a familiar ISA environment (e.g., ARM Cortex-A with full Linux toolchain (Hermes et al., 3 Apr 2024)), with data residency enforced via NUMA node affinity and explicit operator placement heuristics.
The programming model is typically zero-ISA-extension for the host. No new instructions or microcode flows are required; instead, all offload, acknowledgment, and status are mediated via reserved memory or MMIO regions exposed through CXL.
3. Compute and Data Orchestration: μThread and Vector Models
To efficiently match device compute throughput to large memory bandwidth and hide DRAM latency, most CXL-NDP designs introduce lightweight, highly multithreaded microengines:
- μThreading: Each NDP unit provisions dozens of hardware μthreads, each with minimal state, allowing ready threads to issue while others stall on memory. The design amortizes area and energy to ≈0.6 mm2 per unit in 7 nm, supporting sustained FGMT scheduling and zero context-switch overhead (Ham et al., 30 Apr 2024).
- Vector/DRAM granularity match: NDP vector engines fetch and process data at DRAM burst size, maximizing energy efficiency and bandwidth utilization.
- Adaptive Kernel Sizing: Profiling feedback dynamically adjusts kernel partitioning to avoid L1 cache pollution and ensure load-use windows match observed CXL.mem round-trip plus coherence overhead (Yang, 2023), where minimum instruction-distance D_{min} = ceil((RTT_CXL + t_{coh}) / t_{clk}).
This strategy ensures utilization remains close to peak memory bandwidth, with concurrent μthread count hiding far-memory latencies, critical for applications with pointer-chasing or irregular access.
4. Quantitative Performance and Energy Impact
CXL-NDP platforms have demonstrated substantial improvements in both absolute throughput and resource efficiency across benchmarks:
| Application | Throughput Speedup | Energy Reduction | Remarks |
|---|---|---|---|
| In-memory OLAP/filter | up to 128× | up to 87.9% | vs. CPU+passive CXL (Ham et al., 30 Apr 2024) |
| SPMV/PGRANK/SSSP (Graphs) | up to 9.7× | up to 80.3% | vs. GPU+passive CXL |
| LLM generation (e.g. OPT-30B, OPT-2.7B) | 6–7× | — | Nearly saturating DRAM BW |
| Vector DB (HNSW indexing/query, large arm node) | 6.75–7.04× | — | Offload overhead ≤8.2% (Hermes et al., 3 Apr 2024) |
| Bandwidth Utilization (synthetic) | up to 88% | — | CXLMemUring vs. 60% baseline |
| LLM inference (compressed, quantized bit-plane) | +43% throughput | –40.3% DRAM energy | (Xie et al., 3 Sep 2025) |
| MoE decoding on GPU–CXL-NDP | up to 8.7× | — | 0.13% acc. drop (3-bit) (Fan et al., 4 Dec 2025) |
Empirical energy reductions stem from both decreased data movement (in-place computation, reduced round-trips) and from workload- or plane-aware compression, e.g., up to 25% weight memory savings and 46.9% KV-cache footprint reduction in LLMs (Xie et al., 3 Sep 2025).
5. Advanced Data Movement: Compression, Quantization, and Scheduling
Bandwidth bottlenecks are mitigated through in-device data transformation schemes:
- Bit-Plane Disaggregation: Weights and activations are stored in precision-scalable planes per bit (sign, exponent, mantissa), allowing the device to serve only the needed subset for dynamically requested precision (e.g., BF16→FP5) (Xie et al., 3 Sep 2025).
- Lossless Compression: Bit-planes exhibit high run-length for zero or constant bits, making them highly compressible. Hardware engines (LZ4/ZSTD) provide lossless storage; composed compression ratio R up to 1.88× KV, 1.34× weights, directly increasing effective CXL bandwidth α=R/β as a function of quantization ratio β and compression R (Xie et al., 3 Sep 2025).
- Mixed-Precision Quantization: Especially in MoE inference, the device can apply context-aware, expert-specific quantization to NDP-resident experts (1–4 bits, allocation via budgeted minimization of MSE), yielding nearly 9× decoding speedup at sub-0.2% accuracy loss (Fan et al., 4 Dec 2025).
- Adaptive and Context-Aware Placement: MoE hot experts are dynamically pinned to HBM on GPU; the remainder are allocated to CXL-NDP device, with activation-based transfer policy minimizing cross-module movement (Fan et al., 4 Dec 2025).
6. OS and Virtualization Considerations
OS research in CXL-NDP proposes a channel-controller abstraction, integrating each NDP engine as a kernel-managed resource, surfaced as virtual channel controllers (MCCs) with well-defined syscalls. These are dynamically mapped to physical engines and isolated by IOMMU rules, enabling multi-tenancy and virtualization for far-memory offload (Liu et al., 11 Jun 2025). The programming model supports event-driven scheduling, full host-device cache coherence, and host-initiated DMA with fine-grained access control; however, limitations remain around scheduling fairness, address-space synchronization, and support for highly parallel kernels.
7. Limitations, Open Challenges, and Future Directions
- Resource Partitioning and Security: Shared CC/NDP units can introduce interference and security side-channels; static vs. dynamic partitioning trade-offs and TLB shootdown latency remain points of paper (Ham et al., 30 Apr 2024).
- Protocol Bottlenecks: While sub-μs offload is possible via CXL.mem, CXL.io-based protocols introduce μs-scale round-trips unsuitable for fine-grained NDP (Lee et al., 4 Dec 2025). Hybrid protocols that combine asynchronous DMA with local polling and minimal fences achieve near-ideal pipeline utilization.
- Programming Model Generality: General-purpose NDP cores with full Linux runtime offer broad deployment but may be suboptimal for dense, core-bound workloads. Conversely, minimal μthread/vector NDP cores are highly area/energy efficient for memory-bound tasks but require lower-level programming (Ham et al., 30 Apr 2024, Hermes et al., 3 Apr 2024).
- Application Suitability: Benefits are maximized for memory-intensive, bandwidth-constrained, or pointer-heavy workloads (graphs, in-memory analytics, sparse ML inference, vector retrieval). Compute-bound tasks and large-scale reductions may still favor host execution or require specialized orchestration (Hermes et al., 3 Apr 2024, Liu et al., 11 Jun 2025).
- Scaling and Ecosystem Integration: Multi-device scaling, data placement, and management across multiple CXL memory pools require advanced orchestration; automatic data partition and page remapping APIs are under active research (Ham et al., 30 Apr 2024). Real hardware implementations exploiting full CXL.mem coherence capabilities are only now arriving (Liu et al., 11 Jun 2025).
References
- (Yang, 2023) CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access
- (Ham et al., 30 Apr 2024) Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders
- (Xie et al., 3 Sep 2025) Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing
- (Hermes et al., 3 Apr 2024) UDON: A case for offloading to general purpose compute on CXL memory
- (Lee et al., 4 Dec 2025) Offloading to CXL-based Computational Memory
- (Liu et al., 11 Jun 2025) Mainframe-style channel controllers for modern disaggregated memory systems
- (Fan et al., 4 Dec 2025) Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems