Papers
Topics
Authors
Recent
2000 character limit reached

MemoryBank Architectures

Updated 26 November 2025
  • MemoryBank is a unified construct that provides persistent, structured, and query-efficient memory for both AI models and hardware systems.
  • In LLM applications, it augments transformer models with a vector memory and dense retrieval, achieving up to 85.6% retrieval accuracy and improved contextual coherence.
  • In hardware, memory banks are independently addressable subdivisions that optimize parallel access and bandwidth, essential in DRAM and 3D-stacked architectures.

MemoryBank refers to a class of systems and architectural constructs that provide structured, persistent, and query-efficient storage and retrieval subsystems for both artificial intelligence (notably LLMs) and digital memory hardware. In the LLM domain, the MemoryBank architecture implements a persistent, externally managed vector memory used to supplement or extend the native context capacity of transformer models. In computer systems, “memory bank” is a foundational hardware abstraction: a discrete, independently addressable subdivision of main memory, DRAM, or nonvolatile memory arrays, supporting parallel access and bank-aware optimizations. The MemoryBank concept has evolved across both AI systems and digital architecture, with distinct implementations but converging in their goal of maximizing parallelism, composability, and long-horizon access within memory systems.

1. MemoryBank in LLM Architectures

MemoryBank, as formalized by Zhong et al., is a modular long-term memory substrate for LLM-based agents, introduced to address the lack of persistent and usable memory in transformer-based models over prolonged, multi-turn interactions (Zhong et al., 2023). The system combines a persistent memory vector store with dense retrieval and a biologically motivated update/forgetting heuristic.

Components

  • Memory Storage: Persistent vector store (e.g., FAISS index) of memory pieces (dialogue turns, event summaries, personality snapshots).
  • Memory Retriever: Dense similarity search over stored embeddings for top-k relevant memory retrieval.
  • Memory Updater: An Ebbinghaus-forgetting–inspired reinforcement/decay process for selective retention and pruning.

Every memory slot mm is represented as:

m={content,tm,Sm}m = \{\mathrm{content},\,t_m,\,S_m\}

with a continuously updated score:

Rm(t)=exp(tm/Sm)R_m(t) = \exp(-t_m / S_m)

Memory items are pruned when Rm(t)<θR_m(t)<\theta and reinforced (SmSm+1,tm0)(S_m \leftarrow S_m+1,\, t_m \leftarrow 0) upon retrieval.

Query Workflow

  1. Encode current context.
  2. Retrieve top-k relevant memory items by dense similarity search.
  3. Assemble prompt with retrieved snippets, global event/personality summaries, and current context.
  4. Pass prompt to LLM, generate response.
  5. Update memory store according to new turn and reinforcement signals.

Plug-and-play compatibility with both open-source (ChatGLM, BELLE) and closed-source (ChatGPT) models is demonstrated. MemoryBank supports multi-lingual use (English/Chinese) using language-specific embedding backends (Zhong et al., 2023).

Quantitative Performance

On simulated multi-day chat datasets, MemoryBank–augmented models (SiliconFriend) achieve retrieval accuracy up to 85.6%, substantial improvements in F1 (up to 0.716), and strong contextual coherence compared to vanilla LLM baselines.

2. Formalism, Indexing, and Retriever Integration

The large-scale MemoryBank paradigm is characterized by an external memory matrix MRn×dM \in \mathbb{R}^{n \times d} and content-based addressing (Mei et al., 17 Jul 2025). Each read and write follows an attention-like protocol:

  • Read: With query QRk×dQ \in \mathbb{R}^{k \times d},

A=softmax(QKTd),  r=AV ,A = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right)\,,\ \ r = A V\ ,

where K,VK,V are keys and values from MM.

  • Update: A new write key WW and value VV update MM via

ΔM=g(WMT)V,MM+ΔM\Delta M = g(W M^T)\odot V,\quad M \leftarrow M + \Delta M

where gg is a gating nonlinearity.

Indexing strategies include content-based attention, locality-sensitive hashing, clustering for approximate nearest neighbor search, and temporal decay scheduling for recency-sensitive access (Mei et al., 17 Jul 2025).

The retrieval-augmented generation (RAG) protocol integrates MemoryBank by:

  1. Encoding the user query to produce a read key.
  2. Performing dense attention (softmax/CosSim) over MM.
  3. Concatenating top-retrieved slots to the prompt.
  4. Optionally writing summary or response fragments from the latest interaction back into MM.

3. Taxonomy, Hierarchy, and Comparative Properties

The MemoryBank abstraction is positioned as level-2 (long-term) non-parametric memory, co-existing with parametric knowledge (within weights) and activation memory (KV cache) (Mei et al., 17 Jul 2025). Architectural variations can be classified along:

  • Persistence: Episodic, sessional, or cross-session storage.
  • Slot type: Single-turn utterances, summary fragments, high-level personality/metadata (Zhong et al., 2023).
  • Memory event scheduling: Uniform updating, selective reinforcement/decay, or LLM-driven controllers.

Comparative analysis with related systems shows MemoryBank excels in adaptive retention (Ebbinghaus decay), moderate update cost, and plug-and-play retrievability. Alternatives such as MemLLM (structured differentiable controllers), Self-Controlled Memory (LLM-managed event streams), and MemOS (OS-inspired context paging) offer distinct trade-offs in flexibility, scalability, and latency (Mei et al., 17 Jul 2025).

System Update Scheduling Retrieval Mechanism
MemoryBank Ebbinghaus forgetting Dense FAISS/similarity
MemLLM Differentiable control Learned attention
REMEMBERER LLM self-reflection Heuristic replay

4. MemoryBank Limitations and Successors

Empirical analysis exposes three primary limitations of standard MemoryBank systems (Zhang et al., 21 Aug 2025):

  • Coarse-grained content: Predominant use of raw dialogues, loose event summaries, and generic mood assessments can induce low lexical/semantic alignment, affecting recall especially in multi-hop or factually detailed queries.
  • Low recall/answer quality: On benchmarks such as LoCoMo, MemoryBank achieves Recall@5 of 23.53% (multi-hop: 36.47%), and F1 of 17.29, significantly lower than richer multiple-fragment systems.
  • Mismatch in LLM output: Retrieved context may not phrase-match, requiring the model to hallucinate or infer missing details.

Contemporary approaches, such as Multiple Memory System (MMS), augment MemoryBank by splitting each dialogue act into multi-fragment units (keywords, cognitive perspectives, episodic summaries, structured semantic facts) and separately structuring “retrieval” and “contextual” memory units. MMS doubles Recall@5 (up to 46.61%), nearly doubling F1 and BLEU, and directly addresses the limitations posed by coarse MemoryBank granularity (Zhang et al., 21 Aug 2025).

5. Broader MemoryBank Context: Hardware and Systems

Within hardware and systems, “memory bank” fundamentally denotes a subdivided, independently addressable and serviceable array within DRAM, SRAM, PCM, or 3D-stacked memory devices. Each bank is equipped with row buffer, data queue, and command sequencer, supporting concurrent accesses and enabling architectural parallelism (Rezaei et al., 2020, Xie et al., 2021, Song et al., 2019).

Key roles and innovations:

  • Bank-level parallelism and bandwidth: Exploited by near-bank computing units (MPU (Xie et al., 2021)) and partition-level parallelism in PCM (PALP (Song et al., 2019)) to maximize bandwidth and lower latency.
  • Conflict-free algorithms: GPU memory bank conflicts are eliminated by memory layouts and transposes that ensure warp-level access patterns hit distinct banks (Sitchinava et al., 2013).
  • Inter-bank data transfer: In 3D-stacked DRAM, bulk copy is accelerated by on-chip networks (NoM (Rezaei et al., 2020)) that circumvent bottlenecks of traditional shared bus approaches. This provides linear scalability with bank count, dramatically increasing copy throughput.
  • Security leveraging bank-level signals: Bank contention can create covert channels for cross-VM exfiltration (e.g., Bankrupt (Ustiugov et al., 2020)), highlighting security as an emergent concern in bank-aware systems.
Domain Bank Role Impact
GPU AI Avoid conflicts Optimal parallel sort/merge (Sitchinava et al., 2013)
3D DRAM Direct transfer O(N) scaling via NoM (Rezaei et al., 2020)
PCM Partition P-LP +51% perf. (PALP) (Song et al., 2019)
Security Timing channels 50–114 Kb/s cross-node Bankrupt (Ustiugov et al., 2020)

6. Open Problems and Future Directions

Open research challenges in MemoryBank-style architectures span both LLM and hardware domains (Mei et al., 17 Jul 2025):

  • Scalability: Large nn increases context recall but degrades responsiveness; scalable sparse or hierarchical attention over memory banks remains unsolved.
  • Efficient forgetting: Optimal retention/pruning balancing retention of salient facts with redundancy and saturation is an active area.
  • Multi-modal memory: Extending slots to accommodate embeddings of images, audio, and structured data types is an open direction.
  • Autonomous memory management: Differentiable controllers and self-supervised memory scheduling are underexplored.
  • Security and privacy: Persistent memory raises risk of data leakage and privacy violation, requiring robust access control.
  • Theory and evaluation: Formal bounds on memory size, update frequency, and task-specific utility are lacking; long-term memory benchmarks (e.g., LongMemEval, LoCoMo) are immature.

Advances are likely to combine hybrid parametric and non-parametric mechanisms, cross-agent memory sharing, multi-tiered and graph-structured memory banks, and continual learning compatibility for lifelong, context-aware AI systems.

7. Historical and Cross-Domain Perspective

MemoryBank originated as a hardware architecture concept—banked main memory for overlapping, low-latency access. In AI, the term encodes both the hardware metaphor and a non-parametric, persistent, extensible memory substrate for augmenting transformer models. Across computer systems and artificial intelligence, the common attribute remains the compositionality, parallelism, and adaptability of independently accessible memory segments, whether in the physical device fabric or in vector-embedding–backed LLM tools.

The recent emergence of attention-based, vectorized, and hierarchically managed MemoryBank systems for LLMs signals the unification of deep learning and systems design perspectives. MemoryBank and its descendants remain central to the ongoing evolution of scalable, lifelong, and context-sensitive reasoning in artificial intelligence and to the fundamental bandwidth/latency tradeoffs in large-scale digital memory subsystems.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MemoryBank.