Memory Mosaics at scale (2507.03285v1)

Published 4 Jul 2025 in cs.AI

Abstract: Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to LLM sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications ("Memory Mosaics v2"), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Memory Mosaics v2, which replaces transformer attention with advanced associative memory modules featuring adaptive kernel bandwidth and time-variant key extraction.
It demonstrates significant gains in long-context reasoning and new-task adaptation, outperforming transformers even when trained on considerably less data.
The model employs a three-level memory hierarchy to efficiently allocate information, ensuring robust performance without fine-tuning across varying context lengths.

Memory Mosaics at Scale: An Expert Overview

The paper "Memory Mosaics at scale" (2507.03285) presents a comprehensive paper on scaling associative memory-based architectures—specifically, Memory Mosaics—to LLM regimes, demonstrating their efficacy on real-world datasets and tasks. The work introduces Memory Mosaics v2, a refined architecture that incorporates adaptive kernel bandwidth, a gated time-variant key extractor, and a three-level memory hierarchy. The authors systematically evaluate the model across persistent knowledge storage, new knowledge storage, and in-context learning, providing strong empirical evidence for its advantages over standard transformer architectures.

Architectural Innovations

Memory Mosaics v2 replaces the attention mechanism in transformers with associative memory modules, leveraging key-value storage and retrieval via kernel regression. The key architectural modifications are:

Adaptive Bandwidth in Kernel Smoothing: The bandwidth parameter $\beta$ in the Gaussian kernel is dynamically scheduled as a function of the number of stored key-value pairs, optimizing the bias-variance trade-off for memory-based retrieval.
Gated Time-Variant Key Feature Extractor: Inspired by advances in recurrent and state-space models, the key extractor employs input-dependent gating and time-variant averaging, enhancing the semantic consistency of key representations across varying input patterns.
Three-Level Memory Hierarchy: The model explicitly separates persistent memory (dense feedforward layers), long-term associative memory, and short-term associative memory. This design enables efficient allocation of information according to its temporal relevance and invariance, facilitating both knowledge retention and rapid adaptation.

Empirical Evaluation

The evaluation framework is structured around three axes:

Persistent-Knowledge Storage and Retrieval: On 19 standard language benchmarks, Memory Mosaics v2 matches transformer performance, confirming that the persistent memory component is competitive with transformer feedforward layers for storing training knowledge.
New-Knowledge Storage and Retrieval: On multi-document question-answering tasks (e.g., the RULER benchmark), Memory Mosaics v2 significantly outperforms transformers, especially as context length increases. Notably, even when transformers are trained on 8x more data, they do not match the performance of Memory Mosaics v2 trained on 1T tokens.
In-Context Learning: On classic multiclass classification tasks (Banking77, Tacred, Goemotion), Memory Mosaics v2 demonstrates robust in-context learning, with accuracy improving as more demonstration shots are provided. In contrast, transformers often plateau or degrade with additional context. The margin exceeds 10% in several settings, and the advantage persists even when label semantics are anonymized.

Numerical Results and Claims

Context Extrapolation: Memory Mosaics v2, trained on 4k context, extrapolates to 32k and 64k without fine-tuning, maintaining strong performance. Transformers with RoPE fail to generalize in this regime.
Data Efficiency: To match Memory Mosaics v2's new-knowledge and in-context learning performance, transformers require at least 8x more training data, and still underperform on tasks that require rapid adaptation to new label semantics.
Computation and Parameter Efficiency: While Memory Mosaics v2 uses slightly more parameters and FLOPs than transformers (due to explicit memory modules), the trade-off is justified by the substantial gains in new-task learning and context generalization.

Practical Implications

The findings have several practical ramifications:

Enhanced Adaptability: Memory Mosaics v2 is well-suited for applications requiring rapid adaptation to new tasks or domains, such as personalized assistants, dynamic retrieval-augmented generation, and continual learning systems.
Long-Context Reasoning: The architecture's ability to store and retrieve information over long contexts without degradation is advantageous for document-level QA, legal and scientific analysis, and multi-turn dialogue.
Resource Allocation: The explicit memory hierarchy allows practitioners to tune memory allocation (persistent, long-term, short-term) according to application requirements, potentially reducing inference costs by pruning unused memory modules post-training.

Theoretical and Methodological Implications

Associative Memory as a Foundation: The work demonstrates that associative memory, when properly scaled and parameterized, can serve as a viable alternative to attention for large-scale sequence modeling, offering greater transparency and compositionality.
Bias-Variance Control in Memory Retrieval: The adaptive bandwidth mechanism provides a principled approach to managing the trade-off between memorization and generalization in memory-based models.
Prompt Robustness: The model's in-context learning is less sensitive to prompt engineering, suggesting a more stable mechanism for task adaptation compared to transformers.

Limitations and Future Directions

Computational Overhead: The increased parameter count and FLOPs may pose challenges for deployment in resource-constrained environments. Further work on memory compression (e.g., fuzzy hashing, hierarchical memory) is warranted.
Scaling to Frontier Models: While the paper demonstrates results up to 10B parameters, scaling to 100B+ regimes will require advances in distributed memory management and efficient retrieval algorithms.
Task Diversity: The evaluation focuses on classification and QA; broader assessment on generative tasks, reasoning, and multi-modal inputs would further validate the approach.

Speculation on Future Developments

Hybrid Architectures: Integration of associative memory modules with transformer attention or state-space models could yield architectures that combine the strengths of each paradigm.
Efficient Memory Indexing: Techniques from approximate nearest neighbor search and learned indexing may further reduce the computational cost of large-scale associative memory.
Continual and Lifelong Learning: The explicit separation of memory types positions Memory Mosaics as a promising foundation for systems that must learn and adapt over time without catastrophic forgetting.

Conclusion

Memory Mosaics v2 establishes associative memory networks as a scalable, data-efficient, and robust alternative to transformers for LLMing. The architecture's superior performance on new-knowledge and in-context learning tasks, combined with its principled memory design, opens new avenues for research and deployment in adaptive AI systems. The work challenges the prevailing "more data, more compute" paradigm, advocating for architectural innovation as a path to more general and efficient intelligence.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Scaling Laws for Associative Memories (2023)
Memory Mosaics (2024)
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers (2024)
Memory Layers at Scale (2024)
Modern Methods in Associative Memory (2025)

Authors (2)

Tweets

https://twitter.com/fly51fly/status/1942695115820261797

YouTube

Show All Videos

alphaXiv

Memory Mosaics at scale (22 likes, 0 questions)