CXL-Integrated GPU Architecture

Updated 20 December 2025

CXL-integrated GPU architecture is a protocol-rich design that combines GPU compute and memory subsystems with CXL to extend local memory capacity.
It leverages cache-coherent CXL interfaces to integrate local HBM or GDDR with remote DRAM, persistent memory, or accelerators for enhanced performance.
System software adaptations, including kernel driver modifications and unified memory management, enable transparent page allocation and efficient workload scaling.

CXL-integrated GPU architecture refers to the direct, protocol-rich fusion of Compute Express Link (CXL) with GPU memory and compute subsystems to address DRAM capacity limitations, enable hardware-level memory disaggregation, and enhance application-level scalability and efficiency. This approach leverages CXL’s cache-coherent and memory-coherent interfaces to extend GPU device memory—often HBM2/3 or GDDR6—with pooled remote DRAM, persistent memory, or near-data processing accelerators. The architecture targets large-scale AI, graph, and analytics workloads where local memory exhausts rapidly and existing expansion methods (PCIe-based RDMA, local DRAM augmentation) are inadequate or costly. The following sections synthesize key design principles, protocols, performance models, integration methodologies, representative workloads, and open challenges from published sources including "LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer" (Wang et al., 4 Jun 2024), "GPU Graph Processing on CXL-Based Microsecond-Latency External Memory" (Sano et al., 2023), "CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies" (Gouk et al., 18 Jun 2025), "Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management" (Yang et al., 25 Nov 2025), and related works.

1. Architectural Components and Integration

A CXL-integrated GPU system is characterized by tightly coupled hardware blocks:

GPU Chip: Hosts SM/warp arrays, multi-level caches (L1/L2), a native memory controller for onboard memory (HBM2/3/GDDR), and a CXL controller block as a Type-2 endpoint. The CXL block supports CXL.mem (load/store memory operations) and, optionally, CXL.cache for cache-coherence, plus request tagging (SPID), mapping tables, and protocol-specific controllers.
CXL Switch/Fabric: Routes CXL.mem and CXL.cache requests to a scalable pool of memory expanders (DRAM and/or persistent memory modules) and provides management interfaces for endpoint binding via a Fabric Manager (FM).
CXL-Linked Expanders: DRAM, flash, or PMEM modules reside behind a CXL memory expander card, offering high-density external memory. These modules include address translation (HPA→DPA via DMP), access control tables (SAT), and optional local caches.
Host CPU (optional): Drives FM for memory allocation, sets up IOMMU tables, coordinates CUDA or ROCm driver modifications (e.g., lmb_CXL_alloc, cudaMallocLMB), manages coherence commands for shared access.

Data and control flows traverse: SIMD/L1/L2 → Memory Controller → CXL Controller → CXL Switch → Memory Expander → DRAM/PMEM. Address translation, coherence enforcement, and mapping are transparently managed.

2. Memory Coherence and Protocols

CXL-integrated architectures support multiple coherence modes:

CXL.cache (Snoop Mode): GPU endpoints connect to the snoop bus via a cache-coherency port. Snoop requests for L1/L2 cached lines are served by expanders with "HasDirty" detection and globally ordered responses. This enables MESI-like states and allows efficient cacheline-level sharing among multiple endpoints.
Uncached (CXL.mem Only) Mode: CXL-resident pages bypass in-GPU caching. Software drivers enforce explicit flush/invalidate operations or issue CXL “back-invalidate” commands on writes from other agents, trading performance for implementation simplicity.
Persistent Memory Coherence: For PMEM expanders, device coherence engines (DCOH) extend MESI with “Shared-Modified” tracking and active snooping, enabling hardware-managed cacheline persistency and undo-logging for checkpointing (Kwon et al., 2023).

Protocols such as CXL.io (control/MMIO), CXL.mem (atomic load/store), and advanced features from CXL 3.x (accelerated back-invalidate, in-switch coherence directories) play crucial roles in system behavior and reliability.

3. Performance Models and Latency/Throughput Analysis

Quantitative modeling of CXL-integrated GPU architectures focuses on latency, bandwidth, and queueing phenomena:

End-to-End Load Latency:

$L_{total} = L_{HBM} + p \cdot (L_{CXL\_link} + L_{exp})$

with $L_{HBM} \sim 100$ ns, $L_{CXL\_link} \sim 95$ ns, $L_{exp} \sim 50$ ns, $p$ = fraction of remote accesses (Wang et al., 4 Jun 2024). For external CXL devices, total round-trip may increase to 2–3 µs for large configurations (Yang et al., 25 Nov 2025).

Bandwidth Aggregation:

$\frac{1}{B_{eff}} = \frac{1-p}{B_{HBM}} + \frac{p}{B_{CXL}}$

where $B_{HBM}$ and $B_{CXL}$ denote local/remote peak bandwidth.

Queueing and Saturation: Effective queue depths and outstanding credits per port are critical. For PCIe Gen4, $N_{max}=768$ requests; maintaining sufficient concurrency enables PCIe-bandwidth-bound operation up to several µs device latency (Sano et al., 2023).
Latency-Hiding Techniques: Speculative read (SR) and deterministic store (DS) mechanisms mask backend media latency by prefetching larger granularity blocks or stacking stores in local DRAM during tail events (Gouk et al., 18 Jun 2025).

Performance overheads for real-world AI workloads are modest (e.g., ResNet-50 inference: +4% frame latency, –10% bandwidth with LMB) (Wang et al., 4 Jun 2024). For graph analytics, throughput remains within 1–5% of host DRAM for microsecond-latency CXL devices, provided sufficient alignment and outstanding request space (Sano et al., 2023).

4. System Software, Runtime, and Application Design

Software layers must be adapted to expose and utilize a CXL-augmented memory architecture:

Kernel/Driver Modifications: APIs such as lmb_CXL_alloc and cudaMallocLMB provide transparent allocation of CXL-backed pages. IOMMU mapping, SPID negotiation, SAT programming, and in-driver page table extension are required (Wang et al., 4 Jun 2024).
Unified Memory Management: Systems like Cohet integrate CXL-attached XPU pools (including GPUs) into a shared, cache-coherent address space managed by Linux HMM, exposing malloc/mmap interfaces and transparent NUMA-style allocation (Wang et al., 28 Nov 2025).
Orchestration and Placement: For distributed LLM KVCache or MoE inference, index services, prefetch/eviction logic, and custom P2P CUDA kernels drive remote block access and page migration (Yang et al., 25 Nov 2025, Fan et al., 4 Dec 2025). KVCache migration may employ hash-based or round-robin placement, proactive prefetching, and LRU-based HBM eviction.
Checkpointing and Failure Recovery: For persistent memory, embedding tables and MLP parameters are actively logged via hardware-accelerated undo logic, with per-batch scheduling to maximize overlap and minimize stall time (Kwon et al., 2023).

5. Representative Workloads and Empirical Characterization

CXL-integrated GPU architecture is applicable across diverse data center workloads:

Large Model Inference and Training: With Beluga, node-local HBM and DRAM pools are extended to rack-scale via CXL switches, allowing multi-terabyte flat address space and outpacing RDMA-based access in LLM KVCache management (>7.35× throughput improvement, ~90% TTFT reduction) (Yang et al., 25 Nov 2025).
Mixture-of-Experts (MoE) Inference: CXL-attached NDP offload converts parameter traffic into smaller activation traffic, leveraging context-aware placement and mixed-precision quantization for order-of-magnitude decoding throughput enhancement with negligible accuracy loss (Fan et al., 4 Dec 2025).
Graph Analytics: Microsecond-latency external CXL memory, particularly flash-based modules, deliver ~90–95% DRAM performance despite substantial cost per GB reductions. The critical factors are fine-grained alignment, deep concurrency, and cacheline coalescing (Sano et al., 2023).
Persistent Recommendation Training: CXL Type-2 cache-coherent PMEM extension supports tens-of-terabyte embedding tables, active computing/checkpointing near data, and achieves >5× performance and ~76% energy savings relative to baseline PMEM systems (Kwon et al., 2023).
Processing-Near-Memory for LLMs: PNM accelerators within CXL memory modules handle token-page selection, attention, and partial FC operation, lifting both throughput and energy efficiency in 1M-token, 405B-parameter regimes (Kim et al., 31 Oct 2025).
Fine-Grained Remote Operations: Cohet demonstrates 5.5–40.2× speedup for remote atomic ops and ~1.86× for RPC offloading compared to DMA-based PCIe designs via hardware-calibrated CXL.cache simulation (Wang et al., 28 Nov 2025).

6. Design Guidelines, Limitations, and Forward Directions

Empirically derived best practices and trade-offs have emerged:

Memory Tiers: Hot data and working sets remain in local HBM; cold and overflow pages are placed in CXL-extended memory buffers or persistent pools (Wang et al., 4 Jun 2024).
Prefetch and Stripe: Driver-side snooping of L2 misses and data striping across multiple CXL expanders maximizes bandwidth and hides access latency.
Coherence vs Simplicity: Hardware cache-coherence (CXL.cache) is recommended for data-parallel kernels but may be disabled for bulk, read-only, or single-writer regimes (software flush/invalidate) (Wang et al., 4 Jun 2024, Yang et al., 25 Nov 2025).
Endpoint and Fabric Scaling: Multiple root ports and endpoints empower near-native DRAM performance and predictable scaling, with protocol selection and queue depth tuned per workload (Gouk et al., 18 Jun 2025).
Programming Model Simplification: Exposing CXL memory as CUDA-managed, malloc/mmap, or DAX regions reduces user-space and driver complexity, removing RDMA, bounce buffer, or explicit polling overheads (Wang et al., 28 Nov 2025).
Direct GPU–CXL Ports: Future switch designs should include native GPU ports to bypass host RC bottlenecks, achieving higher aggregate bandwidth and lower system-level latency (Yang et al., 25 Nov 2025).
Persistent Fault Tolerance: Checkpointing to PMEM via CXL with background undo-logging raises reliability and hides persistency overhead (Kwon et al., 2023).

Key limitations include the lack of hardware-assisted multi-host coherence in CXL 2.0 fabrics, finite PCIe RC bandwidth (~23–33 GB/s per adapter), and single-switch scale boundaries (~8 TB, ~16 hosts per switch) (Yang et al., 25 Nov 2025).

7. Impacts, Controversies, and Prospects

CXL-integrated GPU architectures are redefining physical and logical memory hierarchy boundaries, enabling multi-terabyte, low-latency, cache-coherent memory pools for data-intensive workloads and simplifying system and application software. The approach bridges the gap between expensive HBM and slow flash, supporting disaggregated, scalable models—particularly vital for next-generation AI, graph, and recommendation systems. While the main controversies concern potential coherence storms, endpoint scaling, and the demand for advanced protocol and fabric support (multi-tier, multi-host directory, page-level interleaving), the empirical acceleration over traditional PCIe-DMA, RDMA, and host-DRAM paradigms is robust.

A plausible implication is that, with future generations such as CXL 3.x and Gen6 PCIe switches, direct GPU porting, and peer-to-peer endpoint access, the CXL-integrated GPU will serve as a foundational substrate for unified compute-memory fabrics in exascale AI and analytics clusters. Continued consolidation of OS, driver, and protocol architectures will be essential to realize this potential.