Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Memory Mosaics v2: Neural Memory Architectures

Updated 9 July 2025

Memory Mosaics v2 are advanced neural architectures that orchestrate associative memory networks with adaptive retrieval and compositional in-context learning.
They implement innovative design features such as adaptive kernel bandwidth, gated recurrent key extraction, and a hierarchical three-level memory system to efficiently manage both local and global dependencies.
They demonstrate superior performance compared to transformers by achieving robust new-knowledge storage and over 10% higher accuracy in in-context learning tasks on large-scale datasets.

Memory Mosaics v2 are large-scale neural architectures built around the orchestration of associative memory networks with compositional and in-context learning capabilities, distinguished by architectural advances in memory representation and retrieval. Originating from the integration of associative memory concepts and kernel-based similarity mechanisms, Memory Mosaics v2 demonstrate competitive performance to transformer architectures on persistent knowledge tasks and surpass them on new-knowledge storage and in-context adaptation tasks, particularly at the scale of LLMs and real-world datasets (2507.03285).

1. Architectural Principles and Innovations

Memory Mosaics v2 reformulate the associative memory paradigm through three key architectural modifications:

Adaptive Bandwidth for Kernel Smoothing:

Unlike the original fixed- $\beta$ kernel in Gaussian smoothing, Memory Mosaics v2 implement a bandwidth parameter that adapts as a function of the memory size: $\beta = \beta_1 \cdot n^\alpha + \beta_0$ with learnable $\beta_0 > 0$ , $\beta_1 > 0$ , and $0 < \alpha < 1$ . This adjustment shifts the bias–variance balance in conditional expectation retrieval as memory grows, optimizing target similarity estimation without manual retuning (2507.03285).

Gated, Time-Variant Key Extraction:

The key extraction process is upgraded from a time-invariant leaky average to a weighted, data-driven, recurrent mechanism. At each time $T$ :

$\mathbf{k}_T = \text{Norm}(\bar{\mathbf{k}}_T), \quad \bar{\mathbf{k}}_T = g_T \cdot \tilde{\mathbf{k}}_T + \lambda_T \cdot \bar{\mathbf{k}}_{T-1}$

with

$\tilde{\mathbf{k}}_T = W_\varphi x_T, \quad g_T = \exp(W_g x_T), \quad \lambda_T = \exp(-|W_\lambda x_T|)$

This structure imparts contextual sensitivity, allowing key similarity to reflect both semantic and sequential relations (2507.03285).

Hierarchical Three-Level Memory System:
- Persistent memory: Stores global, long-lived knowledge via a two-layer feed-forward network (SwiGLU activation).
- Short-term memory: Retains very recent contextual information (windowed).
- Long-term memory: Encodes tokens from further back in the sequence, skipping over the near-past and partially overlapping with short-term memory.
- This explicit management of memory granularity replaces the role of learned positional encodings, supporting both local and distant dependency modeling (2507.03285).

2. Scaling and Training Dynamics

Scaling Memory Mosaics v2 demonstrates that their principal properties persist even at the scale of modern LLMs:

Model and Data Scale:

Models are instantiated at up to 10B effective parameters, deployed with 32 layers, 4096 hidden dimensions, and 32 heads—matching or exceeding architectures like llama-8B. These models are trained on a datamix exceeding one trillion tokens, supporting contexts up to 32,768 tokens (2507.03285).

Training Strategies:

Training is structured to manage computational complexity connected with associative memory mechanisms through adaptive bandwidth, recurrent gating, and careful memory partitioning. Hyperparameters are meticulously matched to transformer baselines to enable direct performance comparison (2507.03285).

Addressed Challenges:

Key training challenges include limiting associative memory overhead in deep and wide models, ensuring the efficiency of hierarchical memory access for long contexts, and balancing memory usage with retrieval fidelity.

3. Core Mechanisms of Associative Memory

The core operation within each memory unit is a form of differentiable associative retrieval, most often implemented as a learned, kernel-based conditional expectation. The general pattern is: $f(\mathbf{k}; \{(\mathbf{k}_i, v_i)\}) = \frac{\sum_{i=1}^{n} e^{-\beta\|\mathbf{k} - \mathbf{k}_i\|^2} v_i}{\sum_{i=1}^{n} e^{-\beta\|\mathbf{k} - \mathbf{k}_i\|^2}}$ or, for constant-norm keys, in an inner-product (softmax) variant directly analogous to transformer attention (2405.06394).

Memory units are equipped with feature extractor networks $\varphi$ and $\psi$ for keys and values, respectively. In deep networks, these associative reads are composed temporally (via stacking) and architecturally (via mixing and recombination layers).

4. Evaluation and Comparative Performance

Memory Mosaics v2 are evaluated along three principal model capabilities:

Persistent-Knowledge Storage:

Models match the performance of transformers on tasks assessing knowledge acquired during training (e.g., 19 language benchmarks including OBQA, ARC, and SQuAD). Both architectures demonstrate similar abilities to recall training data when prompted (2507.03285).

New-Knowledge Storage and Retrieval:

In multi-document question-answering and retrieval scenarios where unseen information must be processed and stored at inference time, Memory Mosaics v2 significantly outperform transformers. This advantage persists even when transformers are trained on up to eight times more data (2507.03285).

In-Context Learning:

On few-shot classification and meta-learning tasks, especially those with semantically anonymous labels designed to suppress transfer from training-time memorization, Memory Mosaics v2 achieve over 10% higher accuracy than transformers and retain performance or improve with additional context, in contrast to transformers which can exhibit degradation (2507.03285).

The table below organizes the evaluation dimensions and representative findings:

Evaluation Dimension	Transformer Performance	Memory Mosaics v2 Performance
Training Knowledge	Comparable	Comparable
New Knowledge Storage	Lower, not improved by scale	Significantly higher; robust to sample size
In-Context Learning	Prone to degradation	Retains/gains accuracy with more context

5. Comparison with Other Mosaic Architectures

Earlier "Mosaics" and projection-based mosaics focus primarily on hardware-efficient implementation and inference resilience:

Temporal Reuse in Neuromorphic Inference:

In neuromorphic implementations, Mosaics leverage weight and synaptic reuse across time to trade hardware for computational time, inherently stabilizing noisy crossbar networks and providing attractor-basin noise immunity. Such architectures demonstrate substantially reduced energy costs—13–38 $\times$ less per inference—compared to CNNs, along with competitive accuracy under combined noise perturbations (2003.10396).

Composite Projection Pruning for LLMs:

The Mosaic system introduces a fine-grained pruning paradigm combining unstructured and structured pruning at the projection level. By leveraging the Projection Outlier Distribution, Mosaic models achieve up to 84.2% lower perplexity, 31.4% higher accuracy, and up to 68% lower GPU memory use relative to coarse-grained baselines. These advances facilitate deployment on edge and resource-constrained hardware with minimal accuracy compromise (2504.06323).

Editor's term: "Associa-pruning" may be used for this synergy of memory-efficient architecture and model pruning.

6. Significance and Future Prospects

The findings associated with Memory Mosaics v2 have substantive implications for LLM design, computational efficiency, and learning theory:

Architectural Progress Over Data Scaling:

Evidence suggests that architectural enhancements (adaptive memory mechanisms, recurrent and hierarchical memory representations) can be more impactful for new-task adaptation and contextual generalization than scaling data or model size alone (2507.03285).

Transparent and Compositional Generalization:

The explicit disentanglement of sub-tasks across multiple memory units ("predictive disentanglement") provides insight into the learned representations and fosters compositional behavior beyond what can be directly traced in transformer attention maps (2405.06394).

Prospective Research Directions:
- Further scaling of memory mosaic architectures.
- Exploration of dynamic routing, multi-level memory hierarchies, and more advanced nonlinear feature extraction for memory units.
- Efficient memory retrieval for long-sequence processing, potentially by integrating methods such as fuzzy hashing.
- Application to domains beyond language, including multimodal and reinforcement learning contexts (2405.06394, 2507.03285).

These advances collectively indicate Memory Mosaics v2 as a promising foundation for the next generation of LLMs, balancing interpretability, adaptability, and computational efficiency.

PDF Markdown Chat (Upgrade)

References (4)

Memory Mosaics at scale (2025)

Memory Mosaics (2024)

Evaluating complexity and resilience trade-offs in emerging memory inference machines (2020)

Mosaic: Composite Projection Pruning for Resource-efficient LLMs (2025)