Mixture-of-Memories (MoM) Architecture

Updated 15 December 2025

Mixture-of-Memories (MoM) architecture is defined by multiple specialized memory modules that are dynamically combined to enhance recall and minimize interference.
It leverages routing and gating mechanisms for efficient, task-dependent activation, supporting integration with heterogeneous external and internal memory sources.
The design enables transparent memory editing and robust plug-in augmentation, achieving scalable, linear complexity for long-context and retrieval-intensive tasks.

Mixture-of-Memories (MoM) architecture refers to a family of neural and hybrid systems that utilize multiple, distinct memory modules or states, whose outputs are dynamically combined (mixed) according to task- or data-dependent routing, gating, or aggregation mechanisms. MoM provides expanded information capacity, reduced memory interference, transparent retrieval/editing, and flexible integration of heterogeneous external or internal memory sources. The concept has been instantiated across sequence models, retrieval-augmented language modeling, associative memory architectures, and plug-and-play zero-shot retrieval, each exhibiting distinct design principles and mathematical formalizations.

1. Core Principles and Motivations

MoM emerges from the need to address fundamental memory bottlenecks of single-memory neural architectures. Traditional architectures—transformer self-attention, linear attention, state-space models, and their variants—compress sequence history into a dense, fixed-size state, leading to memory interference, limited recall, and poor transparency in memorization (Du et al., 19 Feb 2025, Zanzotto et al., 18 Feb 2025). Inspired by biological systems, especially the ability of the hippocampus to maintain parallel, interference-minimized submemories via theta–gamma oscillatory coding, MoM adopts the use of multiple independent or scenario-aware memory stores, each potentially specializing in distinct information types or time-scales (Du et al., 19 Feb 2025, Zhao et al., 16 Oct 2025).

In the context of retrieval-augmented generation (RAG) and dense retrieval, MoM further addresses the challenge of representing knowledge from heterogeneous or dynamically evolving external corpora. By enabling a mixture over multiple “memories”—such as document representations from different domains or sources—the architecture supports plug-and-play augmentation and robust zero-shot generalization (Ge et al., 2023, Zhao et al., 16 Oct 2025).

2. MoM Architecture in Linear Sequence Models

Linear sequence models, including linear attention and state-space architectures, often conflate all history into a single memory matrix $M \in \mathbb{R}^{d \times d}$ , resulting in destructive overwriting and impaired long-range recall. MoM, as formulated in Du et al. (2024) (Du et al., 19 Feb 2025), deploys $M$ independent memory states $\{M^i\}_{i=1}^M$ , augmented by a shared memory $M^s$ . A router network projects the incoming token vector $x_t$ to a set of scores $s_t$ via $W_g$ , producing a softmax distribution $\alpha_t$ over memories. Only the top- $k \ll M$ memories are activated for each token, with normalized mixture weights $g_t^{(j)}$ . Each selected memory $M^{i_j}$ is updated using key/value projections, while the output at time $t$ is formed by a weighted mixture of the relevant memory states and the shared memory.

The computational process retains linear complexity in sequence length for training (each token updates at most $k+1$ memories, $O(n d^2)$ ), and achieves constant time per-token inference, $O(d^2)$ , independent of $n$ (Du et al., 19 Feb 2025). This yields substantial improvements on recall-intensive tasks (e.g., SQuAD, TriviaQA, Drop) compared to singular-memory linear architectures, and narrows the gap with full transformers, particularly for long-context applications.

3. Plug-in Mixture-of-Memory in Dense Retrieval and RAG

In retrieval-augmented settings, MoM enables the seamless combination of multiple heterogeneous external memories (e.g., Wikipedia, news, reviews), facilitating zero-shot generalization and plug-in extensibility (Ge et al., 2023, Zhao et al., 16 Oct 2025). In the MoMA framework (Ge et al., 2023), a T5-based dual-encoder provides query and document embeddings, and for each external corpus $\mathcal{C}_m$ , an Augmentation Retrieval Module (ARM) retrieves top- $K$ candidates. A gating network $h_\psi$ computes mixture weights $\alpha(q)$ over the $M$ memories for each query $q$ . The augmented query embedding $\widetilde{q}$ is assembled as a sum of the base query and a weighted average over all retrieved memory neighbors.

During training, MoMA uses a joint objective: pseudo-positives are built by union over memory top-1 retrievals, and hard negatives are pooled from the remaining top neighbors. The model is trained using an InfoNCE-style contrastive loss, regularizing both the base encoder and the gating network. Notably, plug-in augmentation at inference—adding a new corpus and corresponding ARM/gating logit—requires no model retraining. Experimental results on BEIR benchmarks confirm that MoMA outperforms both vanilla dense passage retrieval (DPR) and single-memory retrievers, with relative nDCG@10 gains up to 8.3% on TREC-COVID, validating the necessity of selective multi-memory mixture for robust generalization (Ge et al., 2023).

4. Associative Mixture-of-Memories and Transparent Memorization

In associative memory LLMs, Mixture-of-Memories is instantiated as a hierarchy or parallel stack of explicitly editable, key–value associative memory modules, each operating at different n-gram or chunk granularities (Zanzotto et al., 18 Feb 2025). The MeMo framework deploys $L$ parallel Correlation Matrix Memories (CMMs), with each memory $M^{(l)}$ storing key–value pairs constructed from sliding windows of input embeddings. Each layer's output is read using learned gates, generating a mixture $r_t = \sum_{l=1}^L g_l r_t^{(l)}$ based on context and retrieved contents.

A key property of these architectures is explicit support for memory editing and forgetting: removing any $(k^*, v^*)$ pair from $M^{(l)}$ is achieved by subtracting their outer product from the CMM matrix, supporting post-hoc model editing and transparency unavailable in standard transformer memory. Orthogonality regularization and memory decay losses further control capacity and retention properties, while scalability is maintained via factorized key/value storage and attention-based retrievals. Capacity analysis indicates each CMM can store $O(d^2)$ distinct key–value pairs, with layered MoM extending the effective sequence length coverage (Zanzotto et al., 18 Feb 2025).

5. Scenario-Aware Document Memory Mixture in RAG

The MoM framework for scenario-aware document memories (Zhao et al., 16 Oct 2025) redefines RAG preprocessing as an active, expert-like memory extraction process. Rather than relying on static chunking, a guiding LLM generates a logical outline $O$ for each document, segments the text into atomic chunks $A$ , and distills concise core content $C$ . Multiple candidates are generated via diverse decoding paths; selection uses composite metrics—chunk clarity and extraction completeness—fused via Reciprocal Rank Fusion. The optimal candidate is paired with a "chain-of-memory" (CoM) path by prompting the LLM to reconstruct its reasoning process, providing a path for supervising compact SLMs.

Unique to this MoM is a three-layer document retrieval at serving: separately indexing outlines ( $O$ ), core content ( $C$ ), and atomic chunks ( $A$ ), and fusing their retrievals probabilistically. This approach is shown, via probabilistic modeling and tail-bound theorems, to minimize information loss relative to single-vector fusion, and yields empirical gains for long-context question answering and semantic search. Results indicate that clarity of atomic chunks correlates strongly with final answer quality, and information support experiments confirm that multi-layer memories provide richer grounding contexts (Zhao et al., 16 Oct 2025).

6. Comparative Analysis and Empirical Highlights

Mixture-of-Memories architectures exhibit advantages over single-memory and standard transformer models:

Capacity Expansion: Parallel or hierarchical memories support storage/retrieval of more distinct patterns and longer/hierarchical dependencies.
Reduced Interference: Routing input tokens to orthogonal or sparsely-activated memories mitigates overwriting, paralleling theoretical mechanisms in the brain and improving recall (Du et al., 19 Feb 2025).
Plug-in and Dynamic Memory Adaptation: External memories or corpora can be added at inference with minimal computational or retraining burden (Ge et al., 2023).
Transparency and Editability: Associative MoM enables post-hoc removal or modification of memorized content, supporting forensic and compliance use-cases (Zanzotto et al., 18 Feb 2025).
RAG and Semantic Retrieval: Scenario-aware MoM supports expressively structured, expert-informed memories and achieves superior information retrieval in zero-shot and domain-adaptive settings (Zhao et al., 16 Oct 2025).

Performance on benchmarks consistently shows strong or state-of-the-art results compared to both parameter-matched baselines and larger models, across question answering, language modeling, and document retrieval tasks (Ge et al., 2023, Du et al., 19 Feb 2025, Zhao et al., 16 Oct 2025).

7. Limitations and Future Directions

MoM architectures introduce additional system complexity, notably in the design and tuning of routing/gating networks, management of per-memory parameters, and coherence of output mixtures. Hyperparameter sensitivity (number of memories $M$ , activation sparsity $k$ ) requires careful tuning. Furthermore, the efficiency gains of linear sequence MoM can be offset when applied to tasks requiring dense, global attention.

A plausible implication is that future research will focus on dynamic memory creation/destruction, adaptive granularity, and learned hierarchical memory routing. Integrating MoM with neurosymbolic systems, continual learning, and privacy/editability requirements are promising directions, especially as transparent and modular memory becomes increasingly relevant in ethical AI scenarios.

Key References:

MoMA: Plug-in Mixture-of-Memory Augmentation for Zero-Shot Dense Retrieval (Ge et al., 2023)
Linear Sequence Modeling with Mixture-of-Memories (Du et al., 19 Feb 2025)
MeMo: Associative Memory Mechanisms with Mixture-of-Memories (Zanzotto et al., 18 Feb 2025)
Scenario-Aware Mixtures of Document Memories for RAG Systems (Zhao et al., 16 Oct 2025)