Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linked Memory Buffer (LMB) in Learning & Hardware

Updated 4 March 2026
  • Linked Memory Buffer (LMB) is a memory structuring strategy that unifies prototype-linked sub-buffers for continual learning and CXL-attached DRAM for hardware, enhancing efficiency and resilience.
  • By leveraging Sinkhorn-based optimal transport and a Divide-and-Conquer update, LMB mitigates catastrophic forgetting and accelerates training by over 2×.
  • Empirical evaluations show that LMB expands memory capacity with minimal IOPS loss and maintains sub-microsecond tail latency compared to traditional memory solutions.

The Linked Memory Buffer (LMB) encompasses two distinct technical constructs unified by the goal of augmenting memory efficiency and performance in dynamic computing environments. In online continual learning, LMB designates a cluster-prototype–linked sub-buffer architecture that preserves semantically rich exemplars and combats catastrophic forgetting. In hardware systems, LMB as “Linked Memory Buffer” for PCIe endpoints refers to decoupled DRAM modules accessed via CXL memory expanders to overcome on-board DRAM limitations for data-centric devices. Both approaches illustrate a strategy of structuring intermediate memory to optimize access, diversity, and capacity under system constraints (Dai et al., 23 May 2025, Wang et al., 2024).

1. Conceptual Foundations of the Linked Memory Buffer

LMB in the continual learning domain arises within a dynamic dual-buffer memory framework. Inspired by biological memory systems that distinguish a rapid, adaptive short-term component (analogous to the hippocampus) from a high-capacity, long-term store (mirroring the neocortex), LMB partitions memory into class-prototype–linked sub-buffers. In high-performance device architecture, LMB leverages the Compute Express Link (CXL) protocol to supplement limited on-board device DRAM in SSDs, GPUs, and DPUs with pooled, off-device memory expanders, effectively presenting a unified memory plane with a tunable latency overhead.

2. Architecture and Operational Principles in Continual Learning

The LMB structure, as articulated for online continual learning, is delineated by the following components and operational flow:

  • Short-Term Buffer: Maintains recent data samples through reservoir sampling and is dynamically sized to accommodate λmax\lambda_{\max} total memory slots—favoring rapid adaptation.
  • Long-Term LMB: Constitutes a set Milong\mathcal{M}_i^{\text{long}} of sub-memory buffers, each indexed by a cluster prototype xh(cj)x_h^{(c_j)} paired to class cjc_j. Each sub-buffer M(cj,h)\mathcal{M}(c_j, h) contains up to λ\lambda exemplars.
  • Cluster Prototype Selection: For each class, prototypes are selected by solving the K-means objective:

min{c1,,ck}xDi(cj)minh=1,,kxch2\min_{\{c_1, \dots, c_k\}} \sum_{x \in D_i(c_j)} \min_{h=1,\dots,k} \|x - c_h\|^2

Prototypes are updated online via running mean:

ch(1η)ch+ηxnewc_h \leftarrow (1-\eta) c_h + \eta x_{\text{new}}

  • Optimal-Transportation–Based Optimisation: To maximize diversity, candidate exemplars are chosen via Sinkhorn-regularized optimal transport (OT) matching between samples and prototypes:

minPUα(a,b)P,M\min_{P \in U_\alpha(a, b)} \langle P, M \rangle

with coupling matrix PP derived by alternating scaling over exponentiated cost matrices.

  • Sample Selection within Sub-Buffers: For each prototype, samples are ranked by Sinkhorn-based scores fα(xi,cj)f_\alpha(x_i, c_j), and top-λ\lambda samples are retained within M(cj,h)\mathcal{M}(c_j, h).

This linked buffer design preserves the distributional breadth of encountered data and is resilient to mode collapse, enabling the system to rehearse on a semantically diverse sample mix and mitigate catastrophic forgetting (Dai et al., 23 May 2025).

3. Divide-and-Conquer Memory Update Optimization

The LMB introduces a computationally efficient Divide-and-Conquer (DAC) strategy for memory updates:

  • Baseline Complexity: Naive updates compute O(Nk)O(Nk) distances for NN points to kk clusters, requiring O(Nk)O(Nk) memory.
  • DAC Strategy: The DAC recursively partitions data into KkK \gg k pre-clusters, computes inter-cluster Sinkhorn distances (O(N2/K2)O(N^2/K^2)), merges closest paths under minimum size constraints, and recurses until kk clusters remain. Setting KNK \approx \sqrt{N} yields quadratic but practically manageable complexity with depth O(logkN)O(\log_k N). Empirically, training accelerates by over 2×2\times without accuracy compromise (Dai et al., 23 May 2025).

4. LMB for Device Memory Expansion Using CXL

In I/O device design, the LMB framework unifies CXL-attached DRAM as an extension of limited on-board device memory. The system comprises:

  • PCIe Device Endpoint: The device issues DMA/PIO accesses to memory addresses managed as if on-board DRAM.
  • Host CXL Agent and Switch Fabric: TLPs are forwarded to CXL.mem endpoints with access managed via Port-Based Routing and SPID tables.
  • CXL Memory Expander (GFD): DRAM is pooled into 256 MB Device Media Partitions (DMPs), mapped via a Fabric Manager (FM) API to device address spaces.
  • Address and Access Management: A lightweight device agent handles address translation, while the host kernel (“lmb_kmod”) allocates and manages remote DRAM blocks and IOMMU tables.
  • Latency Model: The mean access latency is

Ttotal=rlocalTlocal+(1rlocal)TremoteT_{\text{total}} = r_{\text{local}} T_{\text{local}} + (1 - r_{\text{local}}) T_{\text{remote}}

with Tremote190nsT_{\text{remote}} \approx 190\,\text{ns} (direct CXL) or $880$–1190ns1190\,\text{ns} (PCIe-to-CXL converted), depending on the access path (Wang et al., 2024).

5. Performance Analysis and Empirical Outcomes

For continual learning, the LMB+DAC system yields substantial improvements in class-incremental learning (Class-IL) accuracy. On CIFAR-10 (buffer=200), LMB with DAC achieves 42.7%42.7\% Class-IL versus 29.7%29.7\% for DER++ baseline, and maintains a $7$–$15$ percentage point advantage in larger buffer and imbalanced scenarios. Ablation studies demonstrate that removing either the DAC update or Sinkhorn-based OT selection yields $5$–$8$ percentage point accuracy drops and slows training by nearly 2×2\times (Dai et al., 23 May 2025).

For hardware LMB, simulated experiments on 7.68 TB SSDs reveal that:

  • **LMB-CXL achieves 0\approx 010%10\% IOPS loss compared to ideal DRAM on PCIe Gen4,
  • **LMB-PCIe incurs up to 70%70\% IOPS loss on Gen5 for random reads,
  • **Write IOPS for both LMB variants essentially match DRAM,
  • **DFTL (demand-based FTL) lags by $7$–20×20\times,
  • **Tail latency remains sub-microsecond for LMB but not for DFTL (Wang et al., 2024).

6. Limitations, Use Cases, and Future Directions

Limitations include sensitivity to LMB partitioning parameters (kk, ρ\rho) and scope to supervised settings in continual learning, and remote DRAM latency ceilings in device expansion scenarios. Scalability, while improved by DAC or DRAM pooling, may still be challenged under massive data volumes or in high-dimensional spaces; exploration of sparse OT or kernelized approximations remains open (Dai et al., 23 May 2025, Wang et al., 2024).

Primary Use Cases:

  • LMB for Continual Learning: Robust memory in task- and class-incremental regimes, especially under severe memory budgets.
  • LMB for Hardware: Scalable indirection tables for multi-TB SSDs, GPU memory expansion beyond board-limited DRAM, memory overflow relief for DPUs and accelerators.

A plausible implication is that further architectural and algorithmic advances around LMB could generalize these memory augmentation principles to broader streaming, online, and low-latency resource-constrained environments.

7. Comparative Summary

Context Buffer Architecture Optimization Technique Empirical Advantage
Continual Learning Prototype-linked sub-buffers per class Sinkhorn OT + DAC update +5–15 pp accuracy, >2×>2\times faster training
PCIe/CXL Devices On-board + CXL-linked DRAM expansion Pooled block allocation, IOMMU Up to 20×20\times IOPS over DFTL, <20%<20\% penalty vs. ideal

*pp: percentage points

Both conceptualizations of the Linked Memory Buffer demonstrate that memory structuring—whether semantic via exemplar diversity or physical via CXL expander overlays—enables substantial performance and robustness improvements in complex, data-intensive computing systems (Dai et al., 23 May 2025, Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linked Memory Buffer (LMB).