Modular Generative Models with Decoupled Memory

Updated 12 November 2025

The paper introduces GMem as a modular generative modeling approach that separates memory operations from core generation, enhancing scalability and efficiency.
It leverages diverse memory addressing methods, including stochastic and deterministic techniques, to support continual, few-shot, and diffusion-based learning.
Empirical results highlight improvements such as 97.19% accuracy on Split-MNIST and state-of-the-art FID scores in image generation with significantly reduced training time.

Modular Generative Models with Decoupled Memory (GMem) designate a family of approaches in generative modeling where memory is implemented as a distinct, external module, and memory operations (read, write, address) are decoupled from the core generative computation. This modularization isolates sample-efficient semantic retention, supports scalability, and enables functional separation between memorization, retrieval, and generation. GMem methods underpin advances in lifelong learning with explicit task-wise generative memory, ultra-efficient diffusion generation, and variational memory-augmented models, unifying a thread through MyGO (Ji et al., 29 Aug 2025), variational memory addressing (Bornschein et al., 2017), generative temporal memory (Gemici et al., 2017), and scalable diffusion frameworks (Tang et al., 2024).

1. Architectural Foundations

At the heart of GMem architectures is the explicit bifurcation of network capacity: the generative core (e.g., a transformer, VAE decoder, or classification backbone) is disentangled from a memory system, which is typically implemented as an addressable buffer or bank of vectorized “snippets,” generators, or templates. The key properties are:

Memory Representation: Memory may be a learned bank of feature vectors (Tang et al., 2024), a set of trained generative models (conditional GANs per task) (Ji et al., 29 Aug 2025), or a dynamic matrix of temporally structured latents (Gemici et al., 2017).
Decoupling: The memory module is read/written by externalized controllers. For example, in diffusion, memory is immutable and conditions the denoising network via projections (Tang et al., 2024). In continual learning, each task’s conditional GAN acts as an independent generator, indexed as needed (Ji et al., 29 Aug 2025).
Memory Addressing: Both stochastic (categorical) and deterministic (nearest-neighbor, softmax attention) addressing mechanisms are supported. Examples include variational discrete addressing (Bornschein et al., 2017) and hard-max cosine similarity (Tang et al., 2024).
Integration with Generation: Memory readouts or pseudo-data condition the generative model, either by direct injection (e.g., additive or FiLM style) or by generating inputs for knowledge distillation.

This explicit modularity yields parameter efficiency, improved scalability, and enables flexible memory growth without increasing the generative model’s size.

2. Memory Mechanisms and Addressing

Memory addressing in GMem frameworks supports content retrieval, stochastic sampling, and, in some cases, hierarchical memory composition.

Addressing by Similarity: Nearest-neighbor search in unit-norm vector memory banks (Tang et al., 2024), softmax over dot products/cosine similarity for content-based attention (Gemici et al., 2017), or through learned embedding projections (Bornschein et al., 2017). In practice:

$\text{Address logits:}\ \alpha_a = S(h(m_a), h^q(x)), \qquad q(a|x) \propto \exp(\alpha_a)$

Stochastic Addressing: Interpreted as sampling a memory index, as in variational memory (Bornschein et al., 2017), or using VIMCO for discrete latent gradients.
Deterministic Addressing: Hard-max for speed and scaling in diffusion (Tang et al., 2024), with optional snippet dropout for implicit regularization.
Hierarchical and Dynamic Memories: Multi-level addressing enables hierarchical templates (Bornschein et al., 2017), while dynamic updates permit streaming and continuous consolidation (Gemici et al., 2017).

The memory read operation directly selects or composes memory vectors, which are then projected and fused with generative network activations.

3. Learning Objectives and Optimization

Each GMem instantiation adapts the generative model’s training to reflect memory modularity:

Generative Memory (MyGO, (Ji et al., 29 Aug 2025)): Each task’s generator $G_t$ is trained as a conditional GAN:

$L_D = \mathbb{E}_{x \sim p_{\mathrm{data}}(x|y)}[-\log D_t(x,y)] + \mathbb{E}_{z \sim \mathcal{N}(0,I)}[-\log(1 - D_t(G_t(z,y),y))]$

$L_G = \mathbb{E}_{z \sim N(0,I)}[-\log D_t(G_t(z,y),y)]$

Only the generator is preserved after training per task; discriminators are discarded.

Offline Distillation (MyGO): In the “sleep” phase, pseudo-samples from G-mem modules are used for logit-MSE distillation into a fixed-size core network:

$L_{\text{distill}} = \frac{1}{N} \sum_i \| z^T(\hat{x}_i) - z^S(\hat{x}_i) \|_2^2$

Variational Addressing (Bornschein et al., 2017): Variational lower bound with latent memory address $a$ and code $z$ :

$\mathcal{L}(x) = \mathbb{E}_{q(a,z|x)}[\log p(x|z, m_a)] - KL[q(a|x) \parallel p(a)] - \mathbb{E}_{q(a|x)}[KL[q(z|x, m_a) \parallel p(z|m_a)]]$

Diffusion-based GMem (Tang et al., 2024): The denoiser $v_\theta$ is conditioned on retrieved memory vector $\mathbf{s}$ , minimizing a memory-conditioned velocity prediction objective:

$\mathcal{L}(\theta) = \int_0^T \mathbb{E}\|\, v_\theta(\mathbf{x}_t,t,\mathbf{s}) - (\dot{\alpha}_t \mathbf{x}_0 + \dot{\sigma}_t \boldsymbol{\epsilon}) \|^2 dt$

End-to-end Joint Training: Gradients propagate into network and projection layers, but memory banks are usually fixed post-initialization or after an initial pretraining/fine-tuning phase.

4. Application Domains and Empirical Performance

Lifelong and Few-Shot Learning: MyGO demonstrates strong performance in continual learning tasks (Split-MNIST, Split-AG News) (Ji et al., 29 Aug 2025). Catastrophic forgetting is substantially mitigated: on Split-MNIST, MyGO achieves average accuracy $97.19\%$ across five tasks (vs $66.08\%$ in standard fine-tuning). On Split-AG News, MyGO outperforms fine-tuning in retaining old task accuracy ( $76.64\%$ vs $63.11\%$ ).

Diffusion Models: GMem achieves state-of-the-art diffusion sample quality and efficiency. On ImageNet $256\times 256$ , GMem obtains FID $=1.53$ in $160$ epochs ( $\sim$ 20 hours), outperforming LightningDiT (FID $=2.17$ in $800$ epochs). The parameter efficiency is significant: GMem with $0.675$B parameters matches or beats SiT-XL/2 (1.8B) at much lower computational cost. Training speedups of up to $46.7\times$ , and sampling with $10\times$ fewer function evaluations are observed (Tang et al., 2024).

Variational Memory Models: On Omniglot few-shot tasks, GMem architectures with stochastic memory achieve negative log-likelihoods of $75$ nats (1-shot) and $69$ nats (8-shot), outperforming vanilla VAEs (NLL $\approx 90$ nats) (Bornschein et al., 2017).

Long-term Temporal Structure: Generative temporal models with memory establish state-of-the-art results on recall, dependency, and navigation benchmarks, outperforming pure RNN/LSTM baselines due to explicit decoupling of memory and core computation (Gemici et al., 2017).

5. Scalability, Efficiency, and Privacy

Parameter Growth: GMem’s modularity enables core architectures to remain constant in size while memory can scale linearly or sublinearly:

In MyGO, each G-mem module costs $\sim0.4$ MB; total storage for $T=50$ tasks stays $<25$ GB, substantially lower than raw data storage (Ji et al., 29 Aug 2025).
In diffusion, memory bank size ( $n=1.28$ M, $\sim$ 1.8GB) improves FID without impacting transformer backbone size (Tang et al., 2024).

Query and Update Complexity: Hard-max and soft attention addressing enable sublinear retrieval; snippet dropout and approximate nearest neighbor-search enhance robustness and speed (Tang et al., 2024, Bornschein et al., 2017).

Privacy Preservation: By storing only model parameters or feature snippets—not raw inputs—these architectures protect data privacy. Recovering training data from generator weights or bank vectors is information-theoretically infeasible for sufficiently expressive models (Ji et al., 29 Aug 2025).

Training and Sampling Acceleration: Empirical evidence indicates orders-of-magnitude speedups: GMem diffusion delivers high-quality samples after $\sim10$ hours (vs days for baselines), and sampling requires as few as $25$ function evaluations to reach low FID (Tang et al., 2024).

6. Variants, Extensions, and Limitations

Hierarchical and Dynamic Memory: GMem frameworks are extensible to hierarchical stacks (multi-level addressing), dynamic writes (for streaming or RL), and hybrid memory addressing schemes (Bornschein et al., 2017, Gemici et al., 2017).

Applicability to Modalities: Architectures are adaptable across domains. Conditional GANs operate on embeddings for text tasks (e.g., AG News in MyGO), while variational memories are applicable to audio or time series by swapping encoders/decoders (Ji et al., 29 Aug 2025, Bornschein et al., 2017).

Limitations:

Some architectures require fixed memory post-initialization, constraining adaptation to new data.
Hybrid retrieval and content-based writes introduce complexity in controller design (Gemici et al., 2017).
No single memory configuration dominates every task; task structure may favor introspective, content-based, or hybrid mechanisms (Gemici et al., 2017).
There is a modest storage growth in sequence-of-tasks regimes, though this remains tractable in practice.

Plausible avenues for future work include approximate memory retrieval for extreme scaling, smoothing/retrofitting of memory entries for improved knowledge consolidation, and architectural advances integrating flows, discrete variables, or auxiliary inference.

7. Synthesis and Outlook

GMem architectures instantiate a growing paradigm of modular generative models with external memory, wherein learning, storing, and retrieving complex data distributions are realized via explicit and decoupled memory systems. This approach unites advances from continual learning, few-shot generative modeling, and efficient large-scale diffusion, supported by theoretically justified objectives, empirical speed and accuracy gains, and strong privacy guarantees. Emerging hybrid and hierarchical memory systems, together with efficient addressing, are expected to expand GMem’s applicability to lifelong, multi-modal, and resource-constrained generative modeling across domains.