Memory Mosaics at scale (2507.03285v1)

Published 4 Jul 2025 in cs.AI

Abstract: Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to LLM sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications ("Memory Mosaics v2"), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.

Summary

The paper introduces Memory Mosaics v2, replacing transformer attention with adaptive associative memory to enhance long-context model performance.
It demonstrates significant gains in new-knowledge storage and in-context learning, outperforming transformers by 12–15 percentage points on key tasks.
The study validates innovations like dynamic bandwidth scheduling and a three-level memory hierarchy to achieve scalable and efficient large-scale language models.

Memory Mosaics at Scale: An Expert Overview

The paper "Memory Mosaics at scale" (2507.03285) presents a comprehensive paper on scaling associative memory-based architectures—specifically, Memory Mosaics—to LLM regimes. The work introduces Memory Mosaics v2, a model that replaces transformer attention with associative memory modules, and demonstrates its efficacy at the scale of 10 billion parameters and one trillion training tokens. The authors provide a rigorous empirical comparison with transformer baselines, focusing on three evaluation axes: persistent-knowledge storage, new-knowledge storage, and in-context learning.

Architectural Innovations

Memory Mosaics v2 incorporates several architectural modifications over its predecessor:

Adaptive Bandwidth in Associative Memory: The kernel bandwidth parameter $\beta$ in the Gaussian kernel regression is dynamically scheduled as a function of the number of stored key-value pairs, following asymptotic kernel density estimation theory. This allows the model to balance bias-variance trade-offs as memory size changes during inference and training.
Gated Time-Variant Key Feature Extractor: Inspired by advances in recurrent and state-space models, the key extractor employs input-dependent gating and time-variant averaging, enhancing the model's ability to encode semantically similar contexts with greater invariance.
Three-Level Memory Hierarchy: The architecture explicitly separates persistent memory (dense feedforward layers), long-term associative memory, and short-term associative memory. This design is motivated by empirical findings that attention patterns in transformers are highly position-dependent, whereas associative memory-based attention in Memory Mosaics is more position-invariant for distant tokens.

Training and Evaluation Protocol

Memory Mosaics v2 models were trained at two scales (1.5B and 8B parameters) on up to one trillion tokens, with context lengths up to 32k. Baseline transformers with identical parameter counts and training regimes were used for direct comparison. The evaluation protocol is notable for its explicit separation of:

Persistent-Knowledge Storage: Standard language understanding benchmarks (e.g., ARC, MMLU, SQuAD) are used to assess the ability to store and retrieve knowledge from the training set.
New-Knowledge Storage: Multi-document question answering tasks (from the RULER benchmark) are used to test the model's ability to store and retrieve information presented only at inference time.
In-Context Learning: Classic multiclass classification tasks (Banking77, Tacred, GoEmotions) are used in a few-shot setup, with both semantic and anonymous label variants, to directly measure the model's ability to learn new tasks from demonstration.

Empirical Findings

Persistent-Knowledge Storage

On 19 standard language benchmarks, Memory Mosaics v2 matches transformer performance, confirming that the persistent memory component is functionally equivalent to transformer feedforward layers for storing training knowledge. Ablation studies (removing long-term memory) show that most benchmarks do not require long-term memory, validating the evaluation axis separation.

New-Knowledge Storage

On multi-document QA tasks, Memory Mosaics v2 significantly outperforms transformers, especially as context length increases. For example, at 32k context length, Memory Mosaics v2 exceeds transformer accuracy by 12–15 percentage points. Notably, Memory Mosaics v2 can extrapolate to longer contexts without fine-tuning, whereas transformers with RoPE position encoding fail to generalize beyond their training context length.

In-Context Learning

In few-shot classification, Memory Mosaics v2 demonstrates robust improvements as the number of demonstration shots increases, while transformer performance plateaus or degrades. The margin is especially pronounced in the anonymous label setting, where Memory Mosaics v2 outperforms transformers by more than 10%. Increasing transformer training data by 8x (from 1T to 8T tokens) narrows the gap in some semantic tasks but does not close it, particularly for tasks requiring genuine new-task learning.

Data Scaling and Efficiency

A key empirical claim is that simply increasing transformer training data does not replicate the new-task learning capabilities of Memory Mosaics v2. Even with 8x more data, transformers lag behind in new-knowledge storage and in-context learning, especially in settings that minimize reliance on prior semantic knowledge. This result challenges the prevailing paradigm of scaling data and compute as the primary path to improved generalization.

Implementation Considerations

Computational Overhead: Memory Mosaics v2 introduces a modest increase in parameter count and FLOPs per token (approximately 10–15%) due to the explicit memory hierarchy. However, the design allows for post-training pruning of long-term memory for tasks that do not require it, reducing inference cost.
Memory Management: The associative memory modules require efficient storage and retrieval of key-value pairs, particularly for long contexts. The authors suggest future work on approximate nearest neighbor search (e.g., fuzzy hashing) and hierarchical memory to further scale context length.
Prompt Robustness: The evaluation protocol includes prompt ablations to control for prompt sensitivity, a known issue in transformer-based in-context learning.

Theoretical and Practical Implications

The results provide strong evidence that associative memory architectures, when properly scaled and engineered, can match transformers on standard benchmarks while substantially outperforming them on tasks requiring rapid adaptation to new information. The explicit separation of memory types offers a more interpretable and modular approach to sequence modeling, with potential benefits for continual learning, retrieval-augmented generation, and long-context reasoning.

The findings also suggest that the current transformer paradigm may be fundamentally limited in its ability to perform new-task learning via in-context adaptation, and that architectural innovation—rather than brute-force scaling—may be necessary for further progress.

Future Directions

Efficient Memory Scaling: Research into scalable, approximate memory retrieval mechanisms will be critical for deploying Memory Mosaics at even larger scales and longer contexts.
Hierarchical and Modular Memory: Integrating hierarchical memory structures and exploring task-specific memory allocation could further enhance adaptability and efficiency.
Broader Task Coverage: Extending evaluation to more diverse and open-ended tasks, including program synthesis and multi-modal reasoning, will test the generality of the approach.
Integration with Retrieval-Augmented Models: Combining Memory Mosaics with external retrieval systems may yield further gains in knowledge-intensive tasks.

Conclusion

"Memory Mosaics at scale" provides a compelling demonstration that associative memory-based architectures can be scaled to LLM regimes and deliver superior new-task learning and in-context adaptation compared to transformers, even when controlling for data and compute. The work highlights the importance of architectural transparency and modularity, and sets a new direction for research in large-scale sequence modeling beyond the transformer paradigm.

PDF Markdown

Related Papers

Birth of a Transformer: A Memory Viewpoint (2023)
Memory Mosaics (2024)
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory (2024)
Memory Layers at Scale (2024)
ATLAS: Learning to Optimally Memorize the Context at Test Time (2025)

Tweets

https://twitter.com/fly51fly/status/1942695115820261797

YouTube

Show All Videos