One Pass Is Not Enough: Recursive Latent Refinement for Generative Models
Abstract: Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A Simple, Teen-Friendly Explanation of “One Pass Is Not Enough: Recursive Latent Refinement for Generative Models”
1) What is this paper about?
This paper is about teaching computers to create better pictures from scratch. The authors say that many image-generating AIs focus too much on making a few super-sharp images and not enough on covering all the different kinds of images that exist (the full variety). They introduce a new idea called the Recursive Token Mapper (RTM), which helps generators make images that are both high quality and more diverse.
2) What questions are the researchers trying to answer?
They’re mainly asking:
- How can we build image generators that make many different kinds of realistic images, not just a few types?
- Can we improve both image quality and diversity at the same time?
- Is there a better way to turn random noise (the starting point of many generators) into the “style code” that guides the final picture?
To judge success, they look beyond a common score called FID. FID often rewards sharpness but can hide the problem of low variety. They prefer also using:
- Precision: What fraction of generated images look real?
- Recall: How much of the real-world variety does the model cover?
3) How do they do it? (Methods explained simply)
Most modern image generators start with a random number (noise) and convert it into a hidden instruction set (a “style code”) that the generator uses to draw a picture. Usually, this conversion is done in one quick pass through a small neural network. The authors say “one pass is not enough.”
Their idea, RTM, is like writing multiple drafts of an essay:
- First, you make a rough outline (coarse idea).
- Then you go over it again and again, each time fixing mistakes and adding details—improving structure first, then textures, then colors, and so on.
In computer terms:
- The old way: a single “mapping network” decides everything about the style code in one go.
- The new way (RTM): a small block of computation is reused many times on the hidden code, steadily polishing it. This is called “recursive refinement.”
They test RTM inside two kinds of generators:
- IMLE-based generators: These make sure every real training image has a nearby generated image. That means the model can’t “forget” rare types (it fights mode collapse—when the model keeps generating only a few kinds of images).
- StyleGAN-based generators: Popular, fast generators trained with an adversary (a classifier that tries to tell real from fake).
To keep the explanation concrete:
- Think of the “latent code” (noise) as a secret recipe.
- The “mapping network” is the translator that turns that recipe into a detailed cooking plan (the style code).
- RTM is like checking and refining the plan multiple times before cooking, so the final dish (image) turns out both tasty (high quality) and varied (you can cook many different dishes).
4) What did they find, and why does it matter?
Across several standard datasets (like CIFAR-10 of tiny objects, CelebA-HQ of faces, and AFHQ animal faces), they found that:
- RTM improves both quality and diversity at the same time.
- With IMLE training, RTM reached the highest Precision and Recall among strong competitors, while keeping FID competitive.
- Plugging RTM into StyleGAN2 and StyleGAN2-ADA (without changing anything else) also improved key measures, showing RTM’s benefits are general.
- You can even run more refinement steps at test time (after training) for a small extra boost, or fewer steps to go faster—with no retraining needed.
Why this matters:
- Many models look good by repeating a few sharp-looking images. RTM helps cover more of the real-world variety (better Recall) while still keeping images realistic (good Precision).
- Fast, one-step generation is preserved—useful for practical systems that need speed.
To keep the measures straight, here’s a short, plain-English guide:
- Precision: Of the pictures the model makes, how many look truly real?
- Recall: Of the kinds of real pictures out there, how many kinds can the model produce?
- FID: A single number mixing quality and variety—but it can be fooled by producing a few extra-sharp images and ignoring rare types.
5) What’s the bigger impact?
- Better coverage of rare or unusual examples makes generators more useful for things like data augmentation, education, and scientific imaging.
- Keeping generation to one fast step is practical for apps and devices with limited compute.
- As with any image generator, there are risks (e.g., deepfakes). This work is an architectural improvement at moderate image sizes, but responsible use still matters.
In short: The paper shows that taking several small “thinking steps” to refine the hidden plan before drawing can make the final images both better-looking and more varied. One pass was not enough—multiple smart passes make a real difference.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concrete list of what remains missing, uncertain, or unexplored, distilled to guide future work:
- Scalability to large datasets: IMLE’s nearest-neighbour matching and RS-IMLE’s rejection stage scale poorly with dataset size; feasibility on ImageNet or larger remains untested and would likely require approximate search, batching, or curriculum strategies.
- Training cost and memory profiling: Only inference latency is reported; end-to-end training wall-clock, peak memory, and throughput versus H/L, pool size m, and dataset size are not quantified.
- Sensitivity to recursion depth: No systematic ablation of training-time H and L across a broad grid (beyond a small inference-time sweep); unclear how depth affects convergence speed, final quality, and compute–performance trade-offs in both IMLE and GAN regimes.
- Credit assignment under short gradients: The choice to backpropagate through only the final step is not ablated; it is unclear whether partial/full unrolling or gradient checkpointing improves training dynamics or coverage.
- Fairness to stronger MLP baselines: There is no controlled comparison against deeper/wider/residual MLP mappers with matched parameter count and compute, leaving the specific benefit of recursion vs. depth/width unclear.
- Role of weight tying vs. recursion: It is unknown whether gains come from weight sharing, iterative refinement, or both; comparisons to untied iterative stacks or looped feed-forward networks are missing.
- Block choice beyond few-shot: The MLP-Mixer vs. attention choice is only validated on few-shot; attention-based RTM on large datasets (CIFAR-10, CelebA-HQ, AFHQ) is not evaluated for quality/speed trade-offs.
- Tokenization design: The impact of token count, channel width, positional schemes, and initialization on quality/diversity is not ablated.
- Decoder interactions: RTM is shown with a few decoders (ConvNeXt-style, StyleGAN2/ADA); integration with other strong decoders (e.g., StyleGAN3, U-Net backbones) is untested.
- Conditional generation: No results for class-conditional, attribute-conditional, or text-to-image settings; how RTM interacts with conditioning pathways and label embeddings is unknown.
- Class-wise coverage: CIFAR-10 class-wise precision/recall (or confusion-style coverage) is not reported; whether RTM improves long-tail classes or rare poses remains unverified.
- Diversity diagnostics beyond PR: No analysis of sample uniqueness, duplicate rates, birthday paradox tests, or manifold-volume estimates; diversity claims rely solely on PR/DC metrics.
- Metric consistency: Different evaluation pipelines are used across tables (e.g., StudioGAN vs. custom Inception features), limiting cross-table comparability; sensitivity to k in PR/DC, choice of feature extractor, and reference split (train vs. test) is not examined.
- Generalization of inference-time H: Only IMLE runs are swept for inference-time H; effects in adversarial (StyleGAN) training or on AFHQ are not reported, and long-H stability is untested.
- Failure mode analysis: On AFHQ, Density decreases while Recall increases with RTM; root-cause analysis (e.g., shift toward low-density modes) is not provided.
- Privacy/memorization risk: IMLE’s nearest-neighbour pairing could increase memorization; membership inference, nearest-neighbour overlap, and reconstruction similarity analyses are absent.
- Prior design: Only Gaussian latents are considered; whether alternative priors (mixtures, learned priors, normalizing flows over z) improve coverage or controllability with RTM is unknown.
- Halting policy: Adaptive per-sample compute (learned halting) is proposed but not realized; a halting mechanism compatible with IMLE matching and its training signal remains an open design problem.
- Intermediate-state interpretability: The hypothesized coarse-to-fine refinement is not empirically validated (e.g., probing early vs. late step semantics, trajectory smoothness in w-space).
- Latent geometry and disentanglement: Effects on w-space structure (linear interpolations, style mixing, editing directions, identity/pose disentanglement) are not studied.
- Robustness and stability: Variance across seeds, sensitivity to hyperparameters, and training stability curves (especially in GAN settings) are not reported; confidence intervals are missing.
- Pool size m and RS-IMLE ε: No sensitivity analysis for pool size, matching threshold ε, or rejection rate; their impact on precision/recall and compute is unclear.
- Large-scale/high-resolution synthesis: Performance at >512×512 or megapixel resolutions is untested; whether RTM scales favorably in memory/compute and maintains coverage at high resolution is unknown.
- Cross-family applicability: Whether RTM-like recursive mapping benefits other one-step families (e.g., normalizing flows) or serves as a conditioning mapper in diffusion/flow models is unexplored.
- Downstream utility: The practical benefits of improved coverage for data augmentation or scientific imaging are not validated via downstream task metrics.
- Reproducibility artifacts: Code/weights availability, exact seeds, and full training configs are not specified here; without them, replication of gains and sensitivity analysis is hindered.
- Theoretical underpinnings: No formal analysis (e.g., optimization landscape, Lipschitz properties, fixed-point behavior) explains why recursion improves coverage/fidelity in latent mapping; establishing such theory remains open.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage the paper’s findings (recursive latent refinement with RTM) and observed benefits (higher precision and recall, one-step inference, drop-in integration with StyleGAN2/ADA and IMLE).
- RTM plug-in for existing StyleGAN pipelines
- Sectors: software, media/entertainment, e-commerce
- What: Replace the StyleGAN2/ADA mapping MLP with RTM to improve image diversity and quality without altering decoders, losses, or training schedules.
- Tools/products/workflows: “RTM-Mapper” package for StudioGAN/StyleGAN codebases; CI tests that report Precision/Recall alongside FID; A/B tests on asset libraries.
- Assumptions/dependencies: Access to training code/checkpoints; moderate retraining budget; benefits validated on CIFAR-10, AFHQ-v1, and CelebA-HQ—not (yet) ImageNet-scale.
- Few-shot content generation with better coverage (RS-IMLE + RTM)
- Sectors: creative industries, gaming, advertising, small businesses
- What: Generate brand-consistent avatars, mascots, or product shots from small exemplars, while reducing mode collapse typical in few-shot settings.
- Tools/products/workflows: “Coverage-first few-shot generator” SaaS; brand onboarding flow (upload 20–200 images → generate catalog/variations); style-presets using IMLE’s one-to-one latent assignment.
- Assumptions/dependencies: Small curated exemplars; adherence to IP/consent; IMLE’s nearest-neighbour pool search scales with dataset size.
- Diversity-first synthetic data augmentation
- Sectors: robotics, manufacturing, retail, document/OCR, healthcare (low-risk, non-diagnostic)
- What: Create balanced, diverse synthetic sets that reduce class imbalance and long-tail undercoverage, improving robustness of downstream classifiers.
- Tools/products/workflows: “Coverage-certified augmentation” pipeline that logs Precision/Recall/Density/Coverage; data loaders that interleave real and IMLE+RTM samples for rare classes.
- Assumptions/dependencies: Domain shift risk must be evaluated; for medical use, limit to non-diagnostic tasks unless clinically validated.
- Real-time/on-device content generation (1-step inference)
- Sectors: mobile apps, AR/VR, gaming
- What: One-pass generation enables avatars, stickers, NPC portraits, textures, and backgrounds on consumer GPUs/edge devices with latency budgets.
- Tools/products/workflows: Mobile SDK with adjustable H (refinement steps) knob to trade latency vs. small fidelity improvements; asset streaming for live events.
- Assumptions/dependencies: Model sizes compatible with device memory; resolution needs may exceed current benchmarks (32–512 px).
- Production evaluation upgrade: Precision/Recall as first-class metrics
- Sectors: ML platforms, MLOps, enterprise AI governance
- What: Adopt PR and PRDC (Density/Coverage) in dashboards to detect mode collapse masked by low FID.
- Tools/products/workflows: “Coverage Monitor” service; acceptance gates in CI/CD requiring Recall thresholds before deployment; periodic bias/coverage reports.
- Assumptions/dependencies: Consistent feature extractor for metrics (Inception-v3 settings); metric education across teams.
- GAN training stabilization via recursive mapping
- Sectors: media/entertainment, e-commerce
- What: Reduce mode collapse in adversarial training (StyleGAN2/ADA + RTM) while retaining or improving FID/IS.
- Tools/products/workflows: Turnkey “RTM-GAN” training template; hyperparameter sweeps that vary H and L.
- Assumptions/dependencies: Same discriminators/regularizers; improvements shown on CIFAR-10 and AFHQ-v1.
- Latency–quality knobs at inference (variable H without retraining)
- Sectors: SaaS content platforms, real-time rendering, A/B testing
- What: Dynamically dial inference steps H to meet latency SLAs or slightly boost fidelity during off-peak hours—no fine-tuning required.
- Tools/products/workflows: API parameter h_steps; autoscaler that increases H for VIP workloads.
- Assumptions/dependencies: Gains plateau beyond modest H increases; most improvements modest but practical.
- Privacy-aware dataset sharing via synthetic surrogates
- Sectors: finance (document layouts), retail (product imagery), public sector (urban scenes)
- What: Use RTM-based generators to create diverse synthetic surrogates for model pretraining where sharing raw data is restricted.
- Tools/products/workflows: “Synthetic sandbox” for partner evaluation; PR/coverage badges for shared datasets.
- Assumptions/dependencies: Does not guarantee privacy by default; requires privacy risk assessment and potential additional safeguards.
- Academic baselines and benchmarks emphasizing coverage
- Sectors: academia, open-source
- What: Use RTM as a standard mapper in StyleGAN/IMLE baselines; report PR/PRDC in addition to FID.
- Tools/products/workflows: Reproducible Colab kits; leaderboards ranking jointly on FID and Recall.
- Assumptions/dependencies: Community adoption; consistent metric protocols.
- Content moderation and bias diagnostics for generative systems
- Sectors: platforms, trust & safety
- What: Use Recall/Coverage diagnostics to spot demographic under-representation and reduce “collapsed” outputs that may reflect bias.
- Tools/products/workflows: Coverage disparity reports by attribute; threshold-based retraining triggers.
- Assumptions/dependencies: Requires labeled or attribute-annotated references; ethical review of attribute use.
Long-Term Applications
These opportunities require further research, scaling, or engineering (e.g., higher resolutions, larger datasets, conditional control, regulatory validation).
- Scaling IMLE + RTM to ImageNet/industrial datasets
- Sectors: software, foundation models
- What: Achieve coverage-first training at large scale by accelerating nearest-neighbour matching (e.g., approximate search, cluster-based assignment).
- Tools/products/workflows: Distributed IMLE poolers; vector DB integration; memory-efficient feature encoders.
- Assumptions/dependencies: Algorithmic and systems advances to curb IMLE’s matching cost; substantial compute.
- Conditional and controllable generation with recursive mapping
- Sectors: media, e-commerce, design tooling
- What: Extend RTM to conditional settings (class-, text-, or layout-conditional) to improve coverage in controllable generators.
- Tools/products/workflows: “RTM-Conditioner” that maps noise+condition to styles; plug-ins for ControlNet-like modules.
- Assumptions/dependencies: Architecture adaptation and training recipes; evaluation protocols for conditional coverage.
- Multi-modal RTM mappers (video/audio/3D)
- Sectors: film/animation, AR/VR, robotics simulation
- What: Recursive latent refinement for temporal or spatial tokens to boost diversity in video, audio, and 3D asset generation.
- Tools/products/workflows: RTM-Video mapper in GAN/IMLE video models; token-grid RTM for NeRFs/3DGANs.
- Assumptions/dependencies: Efficient token mixing at large sequence lengths; stable decoders; new metrics for coverage in time/3D.
- Rare-event simulation for autonomy and safety
- Sectors: autonomous driving, robotics, safety research
- What: Generate long-tail scenarios (near-misses, rare weather/lighting) with better coverage to stress-test perception/planning stacks.
- Tools/products/workflows: “Long-tail scenario bank” with coverage reports; loop-in-the-simulator training.
- Assumptions/dependencies: High-resolution, conditional control, and validated realism; integration with simulators.
- Clinically validated medical augmentation for rare conditions
- Sectors: healthcare
- What: Use coverage-first synthesis to bolster training data for rare pathologies, reducing false negatives in diagnostic models.
- Tools/products/workflows: FDA/CE-compliant pipelines; clinical trials to validate benefit; traceable PR metrics.
- Assumptions/dependencies: Rigorous clinical validation and governance; privacy/ethics compliance; domain-shift assessment.
- Learned halting for compute allocation to rare modes
- Sectors: platforms, edge AI
- What: Add a halting head (as in HRM/TRM) so the model allocates more refinement steps to hard/rare latents, improving coverage under fixed budgets.
- Tools/products/workflows: Dynamic H per-sample in serving; scheduler aware of coverage deficits.
- Assumptions/dependencies: New loss design compatible with IMLE; stability and fairness analysis.
- Standards and audits that require coverage metrics
- Sectors: policy, procurement, compliance
- What: Incorporate Precision/Recall/Density/Coverage into model reporting standards for generative AI used in public or high-stakes contexts.
- Tools/products/workflows: “Coverage Statement” in model cards; third-party audits with reproducible protocols.
- Assumptions/dependencies: Regulator/industry consensus; robust, agreed-upon metric implementations.
- Hardware/software co-design for recursive mappers
- Sectors: semiconductors, edge devices
- What: Optimize kernels for repeated small-block execution (RTM) and short-gradient training, enabling low-power, on-device generation.
- Tools/products/workflows: Compiler passes that fuse inner/outer cycles; SRAM-friendly token mixing.
- Assumptions/dependencies: Vendor support; sufficient market pull from on-device generative apps.
- Enterprise “diversity-first” content pipelines
- Sectors: marketing, retail, media ops
- What: Production systems that continuously monitor asset diversity coverage and retrain with RTM-enhanced mappers when drift/collapse is detected.
- Tools/products/workflows: Coverage SLAs in content ops; automated retraining triggers; dataset curation based on under-covered modes.
- Assumptions/dependencies: Metadata/analytics maturity; willingness to invest in ML observability.
- Synthetic data marketplaces with coverage guarantees
- Sectors: data economy, analytics
- What: Offer synthetic datasets annotated with coverage metrics and “rare-mode” certifications for downstream training.
- Tools/products/workflows: Data catalogs exposing PR/PRDC; buyer-side validation kits.
- Assumptions/dependencies: Legal/ethical frameworks; trust in third-party audits.
Notes on assumptions and dependencies (cross-cutting)
- Scale and domains: Results shown for CIFAR-10 (32×32), CelebA-HQ (256×256), and AFHQ-v1 (512×512). Higher-resolution, diverse domains will need further validation.
- Training cost: IMLE’s nearest-neighbour pool matching is the main bottleneck for large datasets; GAN-based training with RTM avoids this but reintroduces adversarial dynamics.
- Metrics: Precision/Recall and PRDC depend on consistent feature extractors and evaluation protocols; teams must standardize these to compare models fairly.
- Legal/ethical: Ensure consent/IP for training data; synthetic data does not automatically confer privacy—perform dedicated privacy risk assessments.
- Integration: RTM is a “drop-in” mapper; decoders, losses, and augmentation pipelines remain as-is (eases adoption but still requires retraining to realize gains).
Glossary
- Adaptive Instance Normalization (AdaIN): A normalization layer that modulates feature statistics per instance using style-dependent affine parameters to control visual attributes. "via Adaptive Instance Normalization~\citep{huang2017adain}"
- AFHQ-v1: A high-resolution animal faces dataset commonly used for generative image modeling benchmarks. "AFHQ-v1 at "
- Consistency Models (CD/CT): One-step generative models trained to enforce consistency across noise scales, enabling fast sampling; CD and CT denote distillation and training variants. "Consistency Models (CD, 1-NFE)~\citep{song2023consistency}"
- ConvNeXt: A modern convolutional network architecture with design choices inspired by vision transformers, used here as decoder blocks. "ConvNeXt-style blocks~\citep{liu2022convnext}"
- DDPM (Denoising Diffusion Probabilistic Model): A diffusion-based generative model that iteratively denoises from noise to data. "DDPM~\citep{ho2020ddpm}"
- Density (PRDC metric): A kNN-based metric quantifying how densely generated samples populate regions around real data points, complementing precision/recall. "Density and Coverage~\citep{naeem2020prdc}"
- EDM (Elucidated Diffusion Models): A diffusion modeling framework with improved training and sampling procedures for image generation. "EDM~\citep{karras2022edm}"
- Flow Matching (FM): A family of methods that learn time-dependent vector fields transporting noise to data, trained by matching probability flows. "Flow Matching (FM)~\citep{lipman2023fm}"
- Fréchet Inception Distance (FID): A widely used metric that compares distributions of Inception features for real and generated images; lower is better. "Fréchet inception distance (FID)~\citep{heusel2017fid}"
- Generative Adversarial Networks (GANs): Generative models trained via a two-player minimax game between a generator and a discriminator. "generative adversarial networks (GANs)~\citep{goodfellow2014gan,karras2020stylegan2}"
- Hierarchical Reasoning Model (HRM): A recursive architecture with nested cycles and a halting mechanism, adapted here conceptually for recursive mapping. "the Hierarchical Reasoning Model (HRM)~\citep{wang2025hrm}"
- Implicit Maximum Likelihood Estimation (IMLE): A training paradigm that ensures every training image has a nearby generated sample, preventing mode collapse. "Implicit Maximum Likelihood Estimation (IMLE)"
- Inception Score (IS): A metric that evaluates image quality and diversity using the entropy of classifier predictions on generated images. "Inception Score~\citep{salimans2016is}"
- k-nearest-neighbour Precision and Recall: KNN-based measures assessing sample fidelity (precision) and data coverage (recall) independently of FID. "the nearest-neighbour Precision and Recall"
- LPIPS (Learned Perceptual Image Patch Similarity): A perceptual distance metric used as part of the reconstruction loss for training. "an LPIPS~\citep{zhang2018lpips} perceptual term"
- MLP-Mixer: An all-MLP architecture that mixes information across tokens and channels; used here as the shared recursive block. "the MLP-Mixer-style token-mixing block of~\citet{tolstikhin2021mlpmixer}"
- Mode collapse: A failure mode where a generator focuses on few modes, producing low-diversity outputs while possibly maintaining high fidelity. "mode collapse"
- PixelNorm: A normalization that scales each pixel’s feature vector to unit norm across channels, commonly used in style-based generators. "After PixelNorm,"
- Recursive Token Mapper (RTM): The proposed recursive mapping network that refines latent tokens over multiple cycles to produce the style vector. "the Recursive Token Mapper (RTM)"
- Rejection-Sampling IMLE (RS-IMLE): An IMLE variant that rejects latents whose generated images are too close to training images to better match the inference prior. "Rejection-Sampling IMLE (RS-IMLE)~\citep{vashist2024rsimle}"
- Self-attention (multi-head self-attention): A mechanism enabling tokens to attend to each other; the TRM variant replaces token mixing with attention. "multi-head self-attention on the token grid"
- SwiGLU: A gated MLP activation (SiLU-gated linear unit) that improves expressivity in feed-forward blocks. "a SwiGLU MLP"
- Tiny Recursive Model (TRM): A compact recursive architecture with nested H×L cycles and deep supervision, adapted here for latent refinement. "the Tiny Recursive Model (TRM)~\citep{jolicoeurmartineau2025trm}"
- Truncated backpropagation through time: A training technique that limits gradient propagation through the unrolled computation to save memory, used here for recursion. "analogous to truncated backpropagation through time."
- Universal Transformers: Transformer models that apply the same block recurrently across depth/steps, related to the recursive design used here. "Universal Transformers~\citep{dehghani2019universal}"
Collections
Sign up for free to add this paper to one or more collections.