Split Buffering & Cross-Memory Offloading
- Split buffering and cross-memory offloading are techniques that use alternating buffers and strategic offloading to mitigate latency and efficiently manage mixed memory hierarchies.
- They enable the overlap of data transfer and computation, which is critical for large-scale LLM training, on-device inference, and memory-intensive applications.
- Empirical benchmarks reveal that optimal buffer management and tensor partitioning can nearly restore DRAM-only performance while minimizing high-latency overhead.
Split buffering and cross-memory offloading are pivotal methodologies for overcoming the memory capacity and performance limitations in systems where large models or data-intensive workloads are mapped across heterogeneous memory hierarchies. These techniques enable simultaneous overlap of data movement and compute or between different memory/storage regions, thus amortizing latency and maximizing hardware utilization. Deployments in LLM training/fine-tuning, on-device inference, and general-purpose CXL-attached memory systems exemplify the architectural and algorithmic principles underlying these approaches.
1. Principles and Mechanisms
Split buffering, also called double buffering, is a mechanism in which two (or more) buffer regions are alternately used for data transfer and consumption, enabling overlap of memory operations with compute or with additional memory transactions. This pipelined approach is especially useful when remote memory transactions—due to PCIe/CXL fabric, NVMe storage, or other external DRAM—incur significant access latency. Cross-memory offloading refers to the strategic partitioning and assignment of tensors, compute kernels, or entire code sections across local (DRAM) and remote (CXL, NVMe) memory domains, exploiting capacity, bandwidth, and latency trade-offs.
Practically, these concepts intertwine: split buffering enables non-blocking use of a high-latency memory region, while cross-memory offloading ensures that only latency-tolerant or capacity-heavy data is routed to slower memory, minimizing overall performance impact (Liaw et al., 4 Jul 2025, Du et al., 4 Mar 2025, Yang, 2023).
2. System Architectures and Data Paths
Several recent system architectures provide canonical exemplars:
- LLM fine-tuning with CXL-attached memory (Liaw et al., 4 Jul 2025):
- GPU DRAM is used for matrix and attention kernel execution.
- CPU DRAM (e.g., 512 GiB DDR5) hosts latency-critical parameters, gradients, and optimizer states.
- CXL-AIC memory (PCIe attached, high capacity) stores capacity-heavy, latency-tolerant activations and bf16 tensor copies.
- Data movements are orchestrated through concurrent streams: cudaMemcpyAsync for GPU↔CPU/CXL transfers, with tiers of memory bandwidth and access latency (local DRAM: 80–140 ns; CXL: 170–250 ns; PCIe DMA ∼64 GiB/s).
- On-device LLM inference with secondary storage (Du et al., 4 Mar 2025):
- On-device DRAM is tightly rationed and managed, reserving static "locked" and dynamic prefetch regions.
- Secondary storage (SSD/NVMe) holds the full parameter set, dynamically paging in required tensors using asynchronous prefetching and split buffering.
- CXL hardware codesign (Yang, 2023):
- Host CPUs (e.g., BOOMv3 RISC-V cores) are extended to offload memory accesses via split-buffered ring descriptors.
- The CXL endpoint (either an additional small core or a smart engine) issues and fulfills these memory requests, returning completions directly into L1 with hardware notification via mailbox registers and re-order buffer (ROB) bits.
| Architecture | Local Memory | Offload Target | Split Buffer Region |
|---|---|---|---|
| LLM + CXL (Liaw et al., 4 Jul 2025) | CPU DRAM (low-latency) | CXL-attached DRAM | Activations (Aᵏ) |
| On-device LLM (Du et al., 4 Mar 2025) | Device DRAM | SSD/NVMe | DRAM prefetch window |
| CXL codesign (Yang, 2023) | Host core L1 | CXL memory endpoint | L2 ring buffers |
Each architecture tunes access scheduling and buffer sizing to pipeline remote access and minimize headline latency.
3. Buffer Management and Overlapping Strategies
Split buffering mechanisms are tailored to the platform's memory hierarchy and workload characteristics:
- Double buffering for activation checkpoints in LLM tuning:
- Alternating between buffer arrays (A⁰, A¹) in CXL, while overlapping cudaMemcpyAsync prefetch operations with compute on the GPU.
- The forward pass can prefetch block i+1 while computing block i, hiding CXL round trip latency for each checkpoint (Liaw et al., 4 Jul 2025).
- Tensor-level split buffering in inference:
- In FlexInfer, IO threads utilize a sliding prefetch window (size k) of dynamic tensors, asynchronously issuing SSD reads into the DRAM buffer, while compute threads process immediately as soon as data becomes ready.
- Memory reduction is approximately (k/N) × model size, with further gains from balanced memory locking (equal splitting of layer parameters between static and dynamic allocation) (Du et al., 4 Mar 2025).
- Ring-buffered offloads in hardware:
- CXLMemUring instantiates separate request and completion buffers; while one fills, another drains, with the host and endpoint switching roles on wraparound and overlapping remote load streams with local compute. Profiling-guided autotuning selects an optimal window size N, targeting Tcomp(N) ≈ Loff (local compute time ≈ remote access latency) (Yang, 2023).
Pseudocode patterns across these systems instantiate IO and compute as concurrent streams, with atomic counters or completion signals coordinating per-buffer or per-chunk handoff.
4. Offload Partitioning and Data Placement
Performance hinges on judicious offloading decisions; direct mapping of all large tensors or compute onto high-latency memory often incurs large penalties:
- Latency-critical vs. capacity-prone split:
- Full-precision parameters, gradients, and optimizer states reside in local/low-latency memory (CPU DRAM), while checkpointed activations and intermediate fp16/bf16 copies are relegated to CXL or SSD (Liaw et al., 4 Jul 2025, Du et al., 4 Mar 2025).
- Greedy and heuristic assignment:
- FlexInfer's tensor-preservation is formalized as a knapsack: for a DRAM budget M, select tensors maximizing saved IO, subject to ∑_i size(i) ≤ M. The algorithm prefers large, compute-intensive FFN weights for locking, then attention matrices, following workload empirical benefit (Du et al., 4 Mar 2025).
- Dag-based cost minimization:
- In UDON, compute partitioning employs a weighted sum objective: for each operation i, the placement decision is chosen to minimize
where is op latency and memory movement, tuned by workload-specific (Hermes et al., 3 Apr 2024).
Judicious partitioning, informed by dynamic profiling, maximally exploits fast memory while constraining remote access to latency-tolerant workloads.
5. Performance Impact and Empirical Benchmarks
Comprehensive empirical results across platforms demonstrate the critical performance consequences of split buffering and allocation optimizations:
LLM Fine-Tuning with CXL (Liaw et al., 4 Jul 2025):
- Naive tensor interleaving in CXL reduces throughput to 72–94% of DRAM-only baseline for 7B models.
- Application of CXL-aware allocation recovers 97–99% throughput; in multi-AIC (dual-card) striping, performance returns to or exceeds DRAM-only baseline.
- STEP-heavy (optimizer) phases are most sensitive; FWD/BWD phases less so.
- On-Device LLM Inference (FlexInfer) (Du et al., 4 Mar 2025):
- Under severe DRAM constraints, split buffering and asynchronous offload yield up to 12.5× speedups over mmap baseline.
- Prefetching alone gives 35–60% gain; balanced locking provides a further 10–80% as memory budget increases.
- Properly chosen prefetch window k is sufficient to hide IO stall in all but extremely IO-bound regimes.
- CXL Memory Pool Hardware Co-design (Yang, 2023):
- Hardware split buffering (ring buffers, mailbox notification) reduces average load stalls by up to 80%, with hardware area cost ~1–2%.
- Profiling-guided window sizing and autotuning converges to maximal overlap with minimal overhead, yielding up to 2× speedup on pointer-chasing benchmarks.
| Optimization | Benchmark | Throughput (vs baseline) | Notes |
|---|---|---|---|
| CXL-aware allocation | LLM fine-tune | 97–99% (single-AIC) | Full recovery in STEP phase |
| Multi-AIC striping | LLM fine-tune | 99–101% (dual-GPU) | Full bandwidth recovery |
| FlexInfer async prefetch | LLM inference | up to 12.5× (tight DRAM) | Prefetch + balanced locking |
| CXLMemUring (split buffering) | Pointer-chasing | 2× speedup | 80% average stall reduction |
These findings demonstrate the necessity of both scheduling overlap and memory-aware data placement for maintaining scalability in high-capacity, remote-access memory systems.
6. Design Trade-offs, Advanced Practices, and Remaining Challenges
The implementation of split buffering and cross-memory offloading involves hardware–software co-design trade-offs and ongoing algorithmic advances:
- Hardware support includes minimal extensions: mailbox registers and ROB "complete" bits allow lightweight notification with minimal in-core area overhead; endpoint processors run stripped-down kernels for address calculation near memory (Yang, 2023).
- For offload orchestration, profiling-guided code generation or autotuning (window sizing based on observed Tcomp vs. Loff) maximizes effective overlap, but introduces warm-up/tuning overhead (~5% during startup) (Yang, 2023).
- Buffer sizing is a tuning parameter: undersized buffers underlap potential compute; oversized buffers waste resources and management cycles. Optimal sizes (32–64 entries) are empirically established (Yang, 2023).
- Flexible tensor preservation and balanced locking avoid phase-driven compute/IO stalls and synchronize memory pressure across the inference pipeline (Du et al., 4 Mar 2025).
Current limitations include the reliance on offline profiling (e.g., UDON (Hermes et al., 3 Apr 2024)) and, in some systems, the absence of runtime-dynamic partitioning or adaptive streaming. Further extensions under development include automatic runtime layer annotation, on-device microkernel OS, and hardware DMA for prefetching (Hermes et al., 3 Apr 2024).
7. Broader Impact and Future Directions
Split buffering and cross-memory offloading constitute generalizable patterns applicable to large-scale model training, inference, and general memory-intensive workloads. Their efficacy across LLM fine-tuning, database kernel offload, and pointer-chasing tasks highlights broad utility.
The increasing prevalence of composable memory architectures (e.g., CXL Type-2 devices with compute clusters) will further expand the potential for compute/data co-location and intelligent scheduling of split-buffered offloads. Key target areas for ongoing research include dynamic partitioning, hardware–software interface standardization, support for streaming and pipelined remote DMA engines, and the integration of these primitives into mainstream ML and data-processing frameworks (Liaw et al., 4 Jul 2025, Du et al., 4 Mar 2025, Yang, 2023, Hermes et al., 3 Apr 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free