Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (2404.02905v1)

Published 3 Apr 2024 in cs.CV and cs.AI

Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

Exploring Next-Scale Prediction for Scalable Image Generation with Visual AutoRegressive Modeling

Introduction to VAR

Recent advancements in autoregressive (AR) models have significantly propelled the fields of natural language processing and computer vision forward. However, the traditional approach to applying these AR models to images, which relies on raster-scan token prediction, exhibits limitations in terms of efficiency and efficacy. In the paper titled "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction," a novel paradigm named Visual Autoregressive (VAR) modeling is introduced. This approach reimagines the process for image-based AR models by adopting a coarse-to-fine, or next-scale prediction methodology. Key findings demonstrate that VAR models not only perform with greater efficiency but also yield higher-quality image generations when compared to existing AR and diffusion transformer models.

VAR Methodology

The VAR framework pivots from the conventional pixel-wise or token-wise prediction to a scale-wise, or resolution-wise, prediction process. It operates by first decomposing an image into multiple, increasingly finer scales, and then sequentially generating each scale's content conditioned on all coarser scales. This approach manifests a hierarchical understanding of images that aligns more closely with natural image formation and perception processes.

  1. Tokenization and Quantization: A multi-scale quantized autoencoder is developed for converting images into hierarchical scale token maps, employing a shared codebook across scales to ensure a consistent vocabulary.
  2. Next-Scale Prediction Model: A VAR transformer, built on a decoder-only transformer architecture akin to GPT-2 but modified with Adaptive Normalization (AdaLN) for adaptability in the visual domain, models the conditional distribution of finer scale tokens given coarser ones, facilitating parallel token generation within each scale.

Empirical Validation

Performance Benchmarking

On the ImageNet 256×256 and 512×512 benchmarks, VAR significantly outperforms the baseline AR models and diffusion transformers in terms of image quality (evidenced by improved Fréchet Inception Distance (FID) and Inception Score (IS)) and inference speed. Particularly noteworthy is the acceleration in inference time — up to 20 times faster than conventional AR models — without compromising the generative quality.

Scalability and Generalizability

  • Scaling up VAR models reveals clear power-law scaling laws, demonstrating predictable improvements in performance with increased model size. This scaling efficiency mirrors the desirable properties seen in LLMs, suggesting potential for even greater advancements with larger VAR models.
  • VAR's adaptability is further highlighted through zero-shot generalization capabilities. The model demonstrates proficiency in downstream tasks such as image in-painting, out-painting, and editing without task-specific tuning, indicating a promising direction for AR models in diverse visual generative tasks.

Discussion and Future Work

The VAR framework proposes a significant shift in how AR models are conceptualized and implemented for image generation tasks, addressing core inefficiencies and scaling limitations of prior approaches. By efficiently leveraging hierarchical, multi-scale representations of images, VAR not only improves generative performance but also opens avenues for further explorations into more complex and large-scale visual generation tasks.

Future work will explore the integration of VAR with text-prompted generation tasks and its extension to video generation, capitalizing on its scalability and efficiency. The remarkable initial results achieved by VAR underscore its potential as a cornerstone for next-generation generative models in the AI domain.

Conclusion

"Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" presents a groundbreaking approach to autoregressive image generation that surpasses existing methods in efficiency, effectiveness, and scalability. The VAR model's adeptness at generating high-quality images at accelerated speeds, its adherence to power-law scaling laws, and its zero-shot generalization capabilities across various tasks mark a significant advancement in the use of AR models for complex image generation challenges. This research opens new pathways for leveraging the power of autoregressive models in the visual domain and sets a foundation for future explorations in multi-modal artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Alpha-VLLM. Large-dit-imagenet. https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/f7fe19834b23e38f333403b91bb0330afe19f79e/Large-DiT-ImageNet, 2024.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  6. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  7. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  8. Video generation models as world simulators. OpenAI, 2024.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  11. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  12. Pixart: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  13. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  18. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  19. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  20. P. Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  22. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  23. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  24. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  25. J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  26. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  27. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  28. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  29. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  30. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  31. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  32. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  33. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  34. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  35. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  36. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
  37. Self-conditioned image generation via generating representations. arXiv preprint arXiv:2312.03701, 2023.
  38. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  39. D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
  40. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  41. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  42. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  43. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  45. W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  46. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  47. Improving language understanding by generative pre-training. article, 2018.
  48. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  50. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  51. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  52. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  53. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  54. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  55. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  56. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  57. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  58. Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  59. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137, 2021.
  60. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  61. Designing bert for convolutional networks: Sparse and hierarchical masked modeling. arXiv preprint arXiv:2301.03580, 2023.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  64. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  65. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  66. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
  67. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
  68. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
  69. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  70. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  71. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  72. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.
  73. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
  74. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  75. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  76. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Keyu Tian (6 papers)
  2. Yi Jiang (171 papers)
  3. Zehuan Yuan (65 papers)
  4. Bingyue Peng (11 papers)
  5. Liwei Wang (239 papers)
Citations (102)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com