Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 127 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 421 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Scale-Wise VAR is Secretly Discrete Diffusion (2509.22636v1)

Published 26 Sep 2025 in cs.CV and cs.LG

Abstract: Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.

Summary

The paper’s main contribution is proving that scale-wise VAR is mathematically equivalent to a deterministic discrete diffusion process.
It introduces the SRDD framework with innovations like Classifier-Free Guidance, token resampling, and scale distillation to boost image quality and reduce computational cost.
Empirical results reveal significant improvements in FID and IS scores across benchmarks, alongside superior zero-shot generalization in image editing tasks.

Scale-Wise VAR as Discrete Diffusion: Theoretical Foundations and Empirical Advances

Introduction

This paper establishes a formal equivalence between scale-wise Visual Autoregressive Generation (VAR) and discrete diffusion models, introducing the Scalable Visual Refinement with Discrete Diffusion (SRDD) framework. The authors demonstrate that, when equipped with a Markovian attention mask, VAR is mathematically identical to a structured discrete diffusion process. This reinterpretation enables the direct transfer of theoretical and practical advances from the diffusion literature to autoregressive visual generation, resulting in improved sample fidelity, efficiency, and zero-shot generalization.

Theoretical Connection: VAR and Discrete Diffusion

The central theoretical contribution is the proof that Markovian VAR—where each scale is conditioned only on its immediate predecessor—implements a deterministic discrete diffusion process. The forward process in VAR, which progressively refines images from coarse to fine scales, mirrors the signal-to-noise ratio (SNR) increase characteristic of diffusion models.

Figure 1: Scale-wise generation of VAR: The SNR increases through the generation process, similar to the diffusion process.

The loss function for Markovian VAR is shown to be equivalent to the evidence lower bound (ELBO) of discrete diffusion in the limiting case of deterministic transitions. This equivalence is established by analyzing the KL divergence between the forward posterior and model prediction, which reduces to cross-entropy at the non-deterministic token positions. The Markovian attention mask eliminates architectural inefficiencies present in the original VAR, which conditioned on all previous scales, and provides a principled explanation for observed empirical gains.

Methodological Innovations: SRDD and Diffusion-Inspired Enhancements

Building on the theoretical connection, the authors introduce SRDD, a Markovian variant of VAR that leverages discrete diffusion principles. Four key enhancements are proposed:

Classifier-Free Guidance (CFG): SRDD incorporates CFG, enabling controlled trade-offs between sample diversity and fidelity. Unlike VAR, which exhibited unstable behavior under strong guidance, SRDD demonstrates monotonic improvements in FID and IS up to a saturation point.

Figure 2: Effect of cfg on SUN397 Dataset: We Present the effect of cfg on FID and IS Score

Token Resampling (MR): Inspired by masked diffusion models, SRDD refines low-confidence tokens at each scale, leading to sharper images and faster convergence. The ablation paper reveals that a resampling threshold of $p_{\mathrm{resample}}=0.01$ optimally balances coverage and context preservation.

Figure 3: Ablation paper illustrating the effect of MR: We experiment with different threshold $p_{\mathrm{resample}$ and the number of refinement steps.

Simple Resampling (SR): Multiple refinement passes per scale further improve sample quality, analogous to increasing the number of denoising steps in diffusion models.
Scale Distillation: SRDD supports progressive distillation, allowing the removal of intermediate scales without significant loss in perceptual quality. This reduces inference cost and memory usage, achieving up to $1.75\times$ speedup and $3\times$ memory reduction.

Figure 4: Effect of distillation on reducing the number of scales

Empirical Results: Benchmarking and Qualitative Analysis

SRDD is evaluated on MiniImageNet, SUN397, FFHQ, and AFHQ datasets, outperforming LDM, DiT-L/2, and the original VAR across all metrics. Notably, SRDD achieves a 20.2% FID reduction on MiniImageNet and 31.1% IS improvement on FFHQ relative to VAR. The Markovian attention mask and diffusion-inspired enhancements yield consistent gains in both conditional and unconditional generation.

Qualitative comparisons highlight SRDD's ability to produce images with sharper edges, cleaner textures, and fewer artifacts compared to competing models.

Figure 5: Qualitative Comparison of DiT-L/2, VAR and Ours: SRDD; We do not compare with LDM because LDM model didn't converge.

Figure 6: Qualitative Comparison on AFHQ Datasets, LDM, DiT-L/2, VAR and SRDD: DiT-L/2 didn't converge on AFHQ Datasets.

Figure 7: Qualitative Comparison of FFHQ Datasets LDM, DiT-L/2, VAR and SRDD.

Figure 8: Qualitative Comparison on SUN397 Datasets, LDM, DiT-L/2, VAR and SRDD.

Non-curated generations further demonstrate SRDD's robustness and diversity.

Figure 9: Non-curated example images generated by the proposed SRDD approach for the MiniImagenet Dataset.

Figure 10: Non-curated example images generated by the proposed SRDD approach for the AFHQ Dataset.

Figure 11: Non-curated example images generated by the proposed SRDD approach for the FFHQ Dataset.

Zero-Shot Generalization and Editing

SRDD exhibits superior zero-shot performance in inpainting, outpainting, and super-resolution tasks, outperforming VAR in LPIPS, FID, PSNR, and SSIM metrics. The Markovian decoder requires no additional training for these tasks, underscoring its generality and robustness.

Implementation Details and Scaling Considerations

SRDD is implemented as a decoder-only Transformer with a Markovian attention mask, sharing a single VQ codebook across all scales. Training is performed with AdamW, a batch size of 224, and gradient clipping, on 4 NVIDIA A6000 GPUs for 200 epochs. The Markovian mask reduces memory cost and enables efficient scaling with model size, maintaining strong scaling behavior as parameters increase.

Limitations and Future Directions

The authors acknowledge limitations in compute budget, dataset scope, and codebook expressiveness. Future work may explore larger-scale pretraining, learned resampling policies, hybrid continuous-discrete diffusion pipelines, and the integration of advances in discrete diffusion theory. The SRDD framework provides a pathway for inheriting future breakthroughs in discrete diffusion modeling.

Conclusion

This work rigorously connects scale-wise VAR to discrete diffusion, providing both theoretical clarity and practical efficiency. The Markovian formulation eliminates architectural inefficiencies, enables principled adoption of diffusion techniques, and achieves state-of-the-art performance across multiple benchmarks. SRDD opens new directions for scalable, unified visual generation and offers a robust foundation for future research in autoregressive and diffusion-based generative modeling.