Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies (2506.15601v1)

Published 18 Jun 2025 in cs.AR

Abstract: This work introduces a GPU storage expansion solution utilizing CXL, featuring a novel GPU system design with multiple CXL root ports for integrating diverse storage media (DRAMs and/or SSDs). We developed and siliconized a custom CXL controller integrated at the hardware RTL level, achieving two-digit nanosecond roundtrip latency, the first in the field. This study also includes speculative read and deterministic store mechanisms to efficiently manage read and write operations to hide the endpoint's backend media latency variation. Performance evaluations reveal our approach significantly outperforms existing methods, marking a substantial advancement in GPU storage technology.

Summary

  • The paper introduces a novel CXL-integrated GPU design that uses a custom silicon-based CXL controller to extend GPU memory and reduce latency.
  • It demonstrates significant performance gains over UVM and GDS, achieving up to 44.2× speedup for DRAM EPs and improved efficiency with SSD-based EPs.
  • It proposes two key optimizations—Speculative Read and Deterministic Store—to effectively prefetch data and mitigate tail latency in write operations.

The increasing size of large-scale deep learning models like LLMs and mixtures of experts presents a significant challenge: their memory demands often exceed the capacity of modern GPUs. While techniques like parallelism across multiple GPUs help, they don't fully solve the problem of fitting all necessary data (parameters, gradients, intermediate buffers) within GPU memory. Existing solutions like NVIDIA's GPUDirect Storage (GDS) [nvidia2022gpudirectstorage] allow GPUs to directly access SSDs, but they require complex low-level file system management and manual data transfers, complicating the programming model. Unified Virtual Memory (UVM) [nvidia2022cuda] simplifies programming by providing a shared virtual address space between CPU and GPU and handling automatic page migration. However, UVM suffers from high latency due to host runtime intervention for page faults.

This paper (2506.15601) proposes integrating Compute Express Link (CXL) technology directly into GPUs to expand memory capacity efficiently. CXL allows endpoint (EP) devices like DRAM or SSDs to be mapped into a cacheable memory space accessible by the host, enabling compute units to access these resources using standard memory requests. The core challenge is that GPUs lack native CXL logic. To address this, the paper introduces a novel GPU system design incorporating multiple CXL root ports and a custom CXL controller integrated at the hardware Register-Transfer Level (RTL). This custom controller is fabricated in silicon and achieves sub-two-digit nanosecond round-trip latency, significantly faster than previously reported prototypes.

The custom CXL controller is designed to support CXL 3.1 while maintaining backward compatibility with CXL 2.0/1.1. It incorporates a Flex Bus physical layer integrated with a PCIe physical coding sublayer, allowing seamless support for both PCIe and CXL layer stacks. An arbitrator state machine manages resource allocation between PCIe and CXL tasks. This controller has been successfully integrated into hardware RTL implementations of both memory expander and GPU/CPU prototypes. Compared to other CXL controller prototypes like SMT [kim2023smt] and TPP [maruf2023tpp], this silicon-based controller demonstrates over three times faster round-trip latency due to optimizations spanning the physical, link, and transaction layers.

The CXL root complex, along with a host bridge featuring multiple root ports, is integrated into the GPU architecture (demonstrated using the Vortex RISC-V-based GPU framework [tine2021vortex]). This host bridge includes an HDM decoder that maps system memory address ranges (Host Physical Addresses or HPAs) for each connected CXL EP. During initialization, firmware identifies CXL EPs, aggregates their memory spaces via HDM capability registers, and records this information in the HDM decoder. The GPU's system bus memory map is structured to include these CXL-attached memory segments. When a GPU compute unit issues a memory request targeting this segment, the CXL root complex translates it into a CXL flit, uses the HDM decoder to route it, and forwards it to the corresponding root port and controller.

To mitigate the latency of backend storage media in CXL EPs (even with the low-latency controller), the paper proposes two key optimization strategies:

  1. Speculative Read (SR): This technique utilizes the MemSpecRd feature in CXL 2.0 to prefetch data likely to be accessed soon. The implementation includes queue logic with SR and memory queues. Incoming load requests are added to the SR queue, generating MemSpecRd operations. The address format is modified to aggregate multiple memory requests (up to 4) into a single 256B-granular MemSpecRd, using the two least significant bits for length. The SR reader module issues these requests and records their addresses in a ring buffer. If a subsequent request matches a speculative address, it proceeds as a standard memory request. To prevent overwhelming EPs, the CXL controller uses CXL Quality of Service (QoS) telemetry (specifically the DevLoad field) to monitor EP workload and dynamically adjust the frequency and granularity of SR requests. A high DevLoad reduces SR traffic, while a low DevLoad increases granularity (up to 1024B) to improve prefetching efficiency. To prevent internal DRAM pollution in SSD-based EPs due to incorrect prefetch directions (e.g., reverse array access), an address window control mechanism analyzes requests in the SR and memory queues to determine an optimal address range for SR requests, rounding to the nearest 256B boundary.
  2. Deterministic Store (DS): This strategy addresses the variability and potential tail latency of write operations, especially in SSD-based EPs caused by internal tasks like garbage collection. When a write operation to an SSD is initiated, the request is concurrently sent to both a reserved space in GPU memory and the SSD. The request is immediately completed from the perspective of the compute unit ("fire-and-forget"). If the SSD experiences a delay (detected before the next write request arrives), the data is temporarily buffered in the GPU memory's reserved address using a stack structure. An address list in the system bus's internal SRAM tracks the location of buffered data. This stack is flushed to the SSD in the background when the SSD becomes available. This shields the GPU from write latency variations. For internal SSD tasks that temporarily reduce throughput, the controller monitors the DevLoad field and temporarily suspends issuing new write requests to the affected port, buffering them in GPU memory instead. Once DevLoad indicates capacity has returned, the buffered writes resume. Read requests for data temporarily buffered in GPU memory are served directly from GPU memory to avoid congestion at the EP ingress port.

The effectiveness of the proposed CXL-integrated GPU and optimization techniques was evaluated using a simulator that models the hardware prototype's behavior, leveraging data from real workloads on the Vortex GPU and RTL simulations. Memory latencies were derived from DRAMSim3 [li2020dramsim3] and characteristics of various backend media (DRAM, Optane, Z-NAND, NAND). Bus latencies were measured from the ASIC. Evaluation configurations included UVM, GPUDirect Storage (GDS), CXL (baseline), CXL-SR, CXL-DS, and an ideal GPU-DRAM baseline. Workloads included selected Rodinia benchmarks [che2009rodinia] and real-world gnn and mri applications, categorized by memory access patterns (compute, load, store intensive).

Performance evaluations show significant improvements. Compared to UVM, the CXL baseline (with DRAM EP) achieves a substantial speedup (e.g., 44.2×\times average improvement), dramatically reducing the host runtime overhead associated with page faults. For SSD-based EPs (using Z-NAND), CXL-SR provides an average performance improvement of 7.4×\times over the CXL baseline by preloading data into the EP's internal DRAM, especially benefiting workloads with sequential or locality-aware access patterns (e.g., 1D/2D array computations like vadd, saxpy, gemm, conv3). CXL-DS further improves performance, particularly for store-intensive workloads (e.g., bfs), by mitigating tail latency caused by internal SSD tasks. For a write-intensive workload (bfs) with Z-NAND, CXL-DS prevents the performance collapse observed in CXL-SR during garbage collection by buffering writes in GPU memory, thus avoiding ingress queue congestion and preventing the recurrence of frequent GCs. The paper shows that SR and DS effectively hide the backend media latency for Optane, Z-NAND, and NAND, with SR providing average gains of 7.1×\times, 8.8×\times, and 10.1×\times respectively across these media types.

The implementation of these concepts involves developing the custom CXL controller RTL, integrating it into the GPU architecture at the RTL level, and developing the corresponding firmware for EP initialization and memory mapping. The SR and DS mechanisms are implemented within the CXL controller's queue logic, requiring hardware design for address window control, queue management, and DevLoad monitoring for dynamic adjustment of SR requests and handling of buffered stores. The evaluation relies on a detailed simulator calibrated with hardware measurements.

In summary, this work demonstrates a practical approach to extending GPU memory capacity using CXL by designing a CXL-integrated GPU architecture and developing a high-performance, silicon-based CXL controller. The proposed speculative read and deterministic store mechanisms address the performance challenges posed by slower backend storage media in CXL EPs, offering significant performance gains compared to existing UVM and GDS approaches. This research represents a substantial step towards enabling larger AI models and more complex workloads on GPUs by overcoming traditional memory limitations through efficient CXL integration.