LMB: PCIe Devices Augmented with CXL Memory
- LMB is a framework that augments PCIe devices by leveraging a CXL-linked memory pool to provide additional virtual DRAM with minimal latency impact.
- The system requires only host-resident kernel modules and standard PCIe controllers, facilitating seamless integration with SSDs, GPUs, and DPUs.
- Performance evaluations show that LMB configurations achieve near-ideal throughput and low latencies, significantly outperforming flash-backed indexing schemes.
The Linked Memory Buffer (LMB) is a framework for augmenting PCIe devices such as SSDs, GPUs, and DPUs with additional, virtually transparent, DRAM capacity by leveraging a Compute Express Link (CXL)-attached memory pool. LMB addresses the chronic problem of on-board DRAM limitations in high-performance PCIe devices by virtualizing off-chip, low-latency memory expander resources. The effect is to present devices with a significantly enlarged, high-speed working memory region with minimal impact on their transactional or bandwidth characteristics (Wang et al., 4 Jun 2024).
1. System Architecture and Functional Design
The LMB system is structured around three main actors: PCIe endpoints (non-CXL-native SSDs/GPUs), CXL-native devices that can directly utilize CXL.mem peer-to-peer (P2P), and a disaggregated pool of DRAM modules accessible via a CXL memory expander, typically positioned behind a CXL switch fabric. The heart of the system is a host-resident kernel module working in conjunction with a Fabric Manager (FM) controlling memory allocation within the expander.
- The FM aggregates multiple DRAM DIMMs behind a CXL switch, exporting APIs for dynamic allocation of Host Physical Address (HPA) ranges. Internally, HPAs are mapped to Device Physical Addresses (DPA) through an HPA→DPA indirection table.
- The kernel module consumes these FM APIs to partition memory into fixed-size chunks (e.g., 256 MB blocks), mapping them into the local physical address space for PCIe devices, or exposing CXL.mem mappings for CXL-native endpoints.
- From the device's software view (e.g., SSD firmware, GPU driver), LMB-provisioned memory appears as if it is local DRAM; transactions targeting this region are intercepted by the host and redirected as CXL.mem requests, with responses delivered at sub-microsecond latency.
Capacity is expanded from the device's native limit to , where is the amount allocated from the expander. End-to-end memory access latency follows , where denotes local DRAM hit fraction (90–99 %), (1–10 %) covers remote CXL accesses, is 190 ns (direct CXL.mem P2P) or 880 ns–1.2 µs (PCIe fallback) (Wang et al., 4 Jun 2024).
2. Implementation Modalities and System Integration
Minimal device modification is a foundational principle:
- No hardware changes or silicon modifications are required for PCIe endpoints; standard PCIe Gen4/Gen5 controllers are supported unmodified.
- On the device side:
- SSDs only require a patch to the L2P (logical-to-physical) lookup module to issue MMIO reads to the host-mapped HPA range on DRAM exhaustion. Core FTL or NAND controller logic remains unchanged.
- GPU integration extends CUDA’s UVM kernel module, supporting LMB allocations through added lmb_CXL_alloc() APIs.
- On the host side, the required infrastructure comprises a CXL switch, a DRAM-populated CXL memory expander, and the Fabric Manager (which may reside in firmware or as host software).
3. Performance Evaluation
Empirical evaluation includes FIO-based workloads using both PCIe Gen4 ×4 and Gen5 ×4 SSDs. Compared schemes:
- Ideal: All indexes reside in onboard DRAM with zero added latency.
- DFTL: Indexing is flash-backed; cache misses cost 25 μs.
- LMB-PCIe: CXL-provisioned, but index accesses pay 880–1,190 ns.
- LMB-CXL: Direct CXL.mem P2P, 190 ns per additional index access.
Key findings:
- Write throughput for LMB-PCIe and LMB-CXL matches Ideal, both outperforming DFTL by 7× (Gen4) and 20× (Gen5) due to avoidance of flash read penalties.
- Read throughput: On Gen4, LMB-CXL is within 2 % of Ideal, LMB-PCIe 13–17 % lower. On Gen5, LMB-CXL incurs 8 % (seq) and 56 % (rand) drops, while LMB-PCIe drops 62–70 %.
- Write-path IOPS for all LMB variants is >95 % of Ideal.
- Analytical throughput is modeled as ; with , penalty is %, matching observed sequential penalties (Wang et al., 4 Jun 2024).
4. Overheads, Trade-Offs, and Scalability
Additional CXL linkages introduce:
- Latency overhead per off-chip DRAM access (190 ns–1.2 μs).
- Power overhead from the CXL switch/expander, typically 15–30 W per loaded system.
- No additional board area on device PCBs; shared CXL switch/expander units are required per device group.
Capacity via LMB enables exascale indexing (for QLC SSDs or large GPU models) to fit into fast memory, sharply reducing write amplification and throughput bottlenecks typical of slower, flash-based indexing. The effective penalty for remote LMB misses remains bounded due to strong temporal locality in practical workloads.
5. Cross-Device Applications and Case Studies
LMB’s architectural paradigm readily generalizes:
- AI/ML training with large models: GPUs can allocate weight/activation buffers within the LMB pool, bypassing capacity constraints without triggering costly software page swaps.
- Key-value and memory-semantic SSDs: Large exascale index tables (LSM/hash) are efficiently placed off-die, maintaining sub-microsecond access.
- Near-Data Processing on DPUs: Large routing or graph indices are hosted in pooled CXL memory, freeing up local device DRAM for compute focus (Wang et al., 4 Jun 2024).
Related research demonstrates device-to-device streaming facilities by augmenting PCIe/CXL endpoints with an LMB and a small Stream Engine, providing direct, device-centric streams and zero-copy DMA across endpoints for further lowering latencies in pipelines such as ML inference or distributed database filters (Asmussen et al., 28 Mar 2024).
6. Limitations, Future Directions, and Research Extensions
Current LMB designs mark shared PCIe regions as uncached, thus not exploiting potential CXL.cache optimizations. Future directions include:
- Fine-grained coherence via integration with CXL.cache for further reduced latency.
- Enhanced fault tolerance through expander-level redundancy or erasure coding.
- Cross-device live migration for dynamic load balancing (memory rebalancing among SSDs, GPUs, DPUs).
- Quality-of-Service guarantees via access-control tables (SAT) implemented in expanders.
Potential for scaling to multi-host, rack-scale deployments is demonstrated by the broader CXL pooling literature (Zhong et al., 30 Mar 2025), emphasizing cost-efficient device disaggregation, single-digit-percent latency overheads, and comprehensive resource utilization improvements over traditional PCIe switch approaches.
Key references:
- "LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer" (Wang et al., 4 Jun 2024)
- "Towards Disaggregation-Native Data Streaming between Devices" (Asmussen et al., 28 Mar 2024)
- "My CXL Pool Obviates Your PCIe Switch" (Zhong et al., 30 Mar 2025)