Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (2404.02905v2)

Published 3 Apr 2024 in cs.CV and cs.AI

Abstract: We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

References (76)

Citations (102)

View on Semantic Scholar

Summary

The paper proposes a novel VAR framework that employs hierarchical next-scale prediction to enhance image generation efficiency and quality compared to conventional autoregressive methods.
It introduces a multi-scale quantized autoencoder with a shared codebook for consistent tokenization and a modified GPT-2 transformer with Adaptive Normalization for visual data.
Empirical results on ImageNet benchmarks demonstrate up to 20x faster inference and improved metrics, underscoring VAR's scalability and zero-shot generalization in diverse tasks.

Exploring Next-Scale Prediction for Scalable Image Generation with Visual AutoRegressive Modeling

Introduction to VAR

Recent advancements in autoregressive (AR) models have significantly propelled the fields of natural language processing and computer vision forward. However, the traditional approach to applying these AR models to images, which relies on raster-scan token prediction, exhibits limitations in terms of efficiency and efficacy. In the paper titled "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction," a novel paradigm named Visual Autoregressive (VAR) modeling is introduced. This approach reimagines the process for image-based AR models by adopting a coarse-to-fine, or next-scale prediction methodology. Key findings demonstrate that VAR models not only perform with greater efficiency but also yield higher-quality image generations when compared to existing AR and diffusion transformer models.

VAR Methodology

The VAR framework pivots from the conventional pixel-wise or token-wise prediction to a scale-wise, or resolution-wise, prediction process. It operates by first decomposing an image into multiple, increasingly finer scales, and then sequentially generating each scale's content conditioned on all coarser scales. This approach manifests a hierarchical understanding of images that aligns more closely with natural image formation and perception processes.

Tokenization and Quantization: A multi-scale quantized autoencoder is developed for converting images into hierarchical scale token maps, employing a shared codebook across scales to ensure a consistent vocabulary.
Next-Scale Prediction Model: A VAR transformer, built on a decoder-only transformer architecture akin to GPT-2 but modified with Adaptive Normalization (AdaLN) for adaptability in the visual domain, models the conditional distribution of finer scale tokens given coarser ones, facilitating parallel token generation within each scale.

Empirical Validation

Performance Benchmarking

On the ImageNet 256×256 and 512×512 benchmarks, VAR significantly outperforms the baseline AR models and diffusion transformers in terms of image quality (evidenced by improved Fréchet Inception Distance (FID) and Inception Score (IS)) and inference speed. Particularly noteworthy is the acceleration in inference time — up to 20 times faster than conventional AR models — without compromising the generative quality.

Scalability and Generalizability

Scaling up VAR models reveals clear power-law scaling laws, demonstrating predictable improvements in performance with increased model size. This scaling efficiency mirrors the desirable properties seen in LLMs, suggesting potential for even greater advancements with larger VAR models.
VAR's adaptability is further highlighted through zero-shot generalization capabilities. The model demonstrates proficiency in downstream tasks such as image in-painting, out-painting, and editing without task-specific tuning, indicating a promising direction for AR models in diverse visual generative tasks.

Discussion and Future Work

The VAR framework proposes a significant shift in how AR models are conceptualized and implemented for image generation tasks, addressing core inefficiencies and scaling limitations of prior approaches. By efficiently leveraging hierarchical, multi-scale representations of images, VAR not only improves generative performance but also opens avenues for further explorations into more complex and large-scale visual generation tasks.

Future work will explore the integration of VAR with text-prompted generation tasks and its extension to video generation, capitalizing on its scalability and efficiency. The remarkable initial results achieved by VAR underscore its potential as a cornerstone for next-generation generative models in the AI domain.

Conclusion

"Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" presents a groundbreaking approach to autoregressive image generation that surpasses existing methods in efficiency, effectiveness, and scalability. The VAR model's adeptness at generating high-quality images at accelerated speeds, its adherence to power-law scaling laws, and its zero-shot generalization capabilities across various tasks mark a significant advancement in the use of AR models for complex image generation challenges. This research opens new pathways for leveraging the power of autoregressive models in the visual domain and sets a foundation for future explorations in multi-modal artificial intelligence.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775722143491269007

https://twitter.com/YouJiacheng/status/1847417749569708247

https://twitter.com/DmitryBaranchuk/status/1865106947307114736

https://twitter.com/arankomatsuzaki/status/1775705589722563027

https://twitter.com/appenz/status/1869094961356132804

https://twitter.com/TheTechOasis1/status/1905642911623967123