Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mosaic Memory (MosaicMem) in Deep Learning

Updated 2 July 2026
  • Mosaic Memory is a compositional memory architecture that replaces traditional attention with explicit associative memory blocks for scalable storage and retrieval.
  • It employs adaptive kernel bandwidth and a three-tier memory system (STM, LTM, PM) to optimize performance and maintain interpretability.
  • Empirical results show superior in-context learning and extrapolation, with efficient handling of long sequences compared to standard transformers.

Mosaic Memory (MosaicMem) encompasses a set of architectures, systems, and phenomena in deep learning and computational hardware that leverage “mosaic”-style memory mechanisms—compositional arrays or networks of explicit memory blocks—for enhanced storage, retrieval, or generalization. The term spans neural attention architectures that replace traditional mechanisms by associative or content-addressable memory, GPU memory management techniques that optimize page-size allocation through mosaic-inspired coalescing, and dynamic memory representations for robust generation in high-dimensional or sequence models. While implementations vary across domains, Mosaic Memory techniques consistently utilize explicit partitioning, compositional retrieval, and adaptive association to transcend the limitations of flat or implicit memory models.

1. Associative Memory Networks and the Mosaic Memory (MosaicMem) Architecture

Mosaic Memory, also called “Memory Mosaics” or “MosaicMem,” refers in machine learning primarily to neural architectures in which the conventional attention blocks of a transformer are replaced by explicit associative memory units designed as networks of key–value pairs (Zhang et al., 4 Jul 2025, Zhang et al., 2024).

Instead of implicit KV-caches or undifferentiated attention, each memory block directly stores a growing set of {(ki,vi)}\{(k_i, v_i)\} pairs, where kik_i are content-encoded keys and viv_i are values derived from recent or future context. Query retrieval is realized through a kernel-based smoother or soft-lookup operating over all stored keys, typically implemented as Gaussian-kernel regression:

f(k;M)=i=1nexp(βkki)j=1nexp(βkkj)vif(k; M) = \sum_{i=1}^n \frac{\exp(\beta k^\top k_i)}{\sum_{j=1}^n \exp(\beta k^\top k_j)} v_i

This construction enforces permutation-invariance of memory contents, a clear bias–variance tradeoff via the bandwidth parameter β\beta, and mechanistically disentangles storage from sequential position—contrasting with the inherent entanglement arising in standard transformer positional encoding.

Key architectural enhancements in “Memory Mosaics v2” include:

  • Adaptive kernel bandwidth β(n)=β1nα+β0\beta(n) = \beta_1 n^\alpha + \beta_0 that adapts retrieval sharpness to memory size.
  • Gated, time-variant key extraction, using RNN-style gating to make recency semantics content-adaptive.
  • Three-level per-layer memory: distinct Short-Term (STM), Long-Term (LTM), and Persistent Memory (PM) modules that differentiate token-local, sequence-distant, and global persistent knowledge.

Memory growth is linear in sequence length and inference-phase writes are strictly Hebbian, i.e., new (kT,vT)(k_T, v_T) are appended on observation, with all parameter optimization handled by backpropagation during training.

2. Scaling Strategies and Empirical Performance

Memory Mosaics were originally demonstrated on medium-scale datasets and architectures (e.g., GPT-2 scale, \sim1B params) (Zhang et al., 2024), but have since been scaled to LLaMA-8B/10B models (\sim9.9B params, 32 layers, context length up to 32k, and 1T tokens of training data) (Zhang et al., 4 Jul 2025).

In direct apples-to-apples comparisons against transformer models of identical size and regimen, v2 variants show:

  • Persistent-Knowledge (training data) retrieval parity: On 19 standard tasks (ARC, OBQA, MMLU, BoolQ, etc.), averaged performance of \approx52.2% with no loss of performance when LTM is ablated, confirming that STM+PM suffice for training-knowledge storage and retrieval.
  • New-Knowledge storage and extrapolation: On benchmarks requiring recall or reasoning over 32k-context multi-document QA, Memory Mosaics v2 achieve 53.4% after fine-tuning at long context, while transformers plateau at 41.1% even when trained on 8× more data (8T tokens) with advanced attention mechanisms.
  • Superior in-context (few-shot) learning: On few-shot tasks (Banking77, Tacred, GoEmotions), v2 models yield a roughly 10 percentage-point accuracy advantage over transformers as the number of shots increases, a gap unclosed even when transformer training data is octupled.

In all settings, the advantages of v2 architectures on new-knowledge and in-context tasks cannot be replicated by increasing transformer training data.

3. Analytical and Theoretical Distinctions

Several empirical and theoretical observations highlight the impact of Mosaic Memory:

  • Capacity decoupling: Explicit associative memory breaks the link between storage capacity and network depth/learned position bias, allowing uniform and direct access to arbitrarily distant context.
  • Adaptive retrieval: The learned, sequence-size-dependent bandwidth kik_i0 ensures the system maintains optimal retrieval sharpness regardless of memory set size; a fixed attention temperature cannot provide a similar adaptive tradeoff.
  • Semantic recency and feature separation: Gated, content-dependent key extraction mitigates token-position entanglement, yielding semantically stable representations regardless of local token spacing.
  • Segmented context modeling: The STM/LTM/PM hierarchy ensures that early context is not drowned out, and that cross-sequence knowledge is managed separately from within-sequence context.
  • Transparency and interpretability: Each memory output is an explicit conditional expectation, making internal behavior interpretable at the slot and head level—contrasting with the nontransparent mixing of standard transformer heads and feed-forward blocks.

4. Limitations and Open Directions

Despite superior generalization and context scaling, Mosaic Memory v2 exhibits notable overhead:

  • Parameter and computational cost: Three memory modules per layer increase parameter count by 1–2% and compute by 10–15% per token.
  • Memory growth and forgetting: Associative memory grows linearly with sequence length with no compression or explicit forgetting; handling contexts kik_i1100k will require approximate or hierarchical memory representations.
  • Hyperparameter inheritance: Current hyperparameters are inherited from transformer tuning; v2-specific optimization may yield further gains.
  • Requirements for frontier scaling: Extending to 70B+ models or multimodal generative tasks will necessitate new software/hardware solutions and aggressive memory compression (e.g., sparse or hierarchical retrieval).

The evaluation in v2 focuses on classification and QA-type tasks; generative performance on story completion, code synthesis, or dialogue beyond long context remains to be characterized.

5. Mosaic Memory in Broader Context: Interpretive, Hardware, and Retrieval Perspectives

The “mosaic memory” motif also appears in other domains:

  • LLM Data Memorization Phenomenon: “Mosaic memory” describes the finding that LLMs memorize not by rote repetition, but by assembling fragments from partially overlapping “fuzzy duplicates” within the training corpus. Formally, even multiple sequences differing from a reference by up to kik_i2 tokens (under a fixed tokenizer) can collectively induce memorization almost as strong as exact duplicates (Shilov et al., 2024). Empirically, 10 fuzzy variants with kik_i3 replacements produce ROC-AUC kik_i40.87, nearly matching exact duplication, and such clusters evade sequence deduplication strategies. This process is syntactic, not semantic, with privacy and copyright implications.
  • GPU Memory Management (Mosaic on GPUs): “Mosaic” in GPU memory management (Ausavarungnirun et al., 2018) refers to a system that breaks the tradeoff between large-page TLB efficiency and fine-grained demand paging. By allocating small base pages (4 KB) and in-place page table coalescing into large pages (2 MB), Mosaic achieves near-ideal address translation with minimal stall, demonstrating a hardware–software partitioning reminiscent of associative memory partitioning in neural models.
  • Diffusion Models and Video Worlds: “MosaicMem” designates hybrid explicit–implicit memory systems for world-consistent video generation in diffusion models (Yu et al., 17 Mar 2026), where persistent 2D image patches are lifted into a 3D scene mosaic, then recomposed per view using projective coordinate adjustments and patch-based retrieval. The core unifying mechanism is compositional (mosaic) memory slotting with direct, content-driven retrieval for joint stability and dynamic fidelity across sequence rollouts.

6. Significance and Impact

Mosaic Memory approaches establish a general paradigm of compositional, explicit associative storage that yields:

  • Enhanced extrapolation and generalization: Decoupling storage from network structure allows scalable in-context and out-of-distribution reasoning.
  • Interpretability and transparency: The explicitly stored and retrieved key–value/slot histories allow for diagnosis and control over memory behavior.
  • Resilience to data duplication attacks and privacy vulnerabilities: In LLMs, mosaic memory architectures and phenomena render naive deduplication insufficient for privacy protection.
  • Generalization to hardware and generative models: External memory partitioning and compositional retrieval unify the architectural philosophy of MosaicMem in neural systems and systems-level memory management.

The core mosaic principle—partition, explicit storage, content-addressable and adaptive retrieval—transcends individual models or platforms and establishes a rigorous template for memory architectures in modern computational systems.

References

Zhang et al., “Memory Mosaics at scale” (Zhang et al., 4 Jul 2025) Shilov et al., “The Mosaic Memory of LLMs” (Shilov et al., 2024) Yuan et al., “Mosaic: An Application-Transparent Hardware-Software Cooperative Memory Manager for GPUs” (Ausavarungnirun et al., 2018) Wu et al., “MosaicMem: Hybrid Spatial Memory for Controllable Video World Models” (Yu et al., 17 Mar 2026) Shen et al., “ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation” (Meng et al., 10 Jun 2026) Rosen et al., “Memory Mosaics” (Zhang et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mosaic Memory (MosaicMem).