Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Autoregressive Models in Vision: A Survey (2411.05902v2)

Published 8 Nov 2024 in cs.CV and cs.CL

Abstract: Autoregressive modeling has been a huge success in the field of NLP. Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

Citations (1)

Summary

  • The paper presents a comprehensive survey of autoregressive models in vision, detailing their evolution and diverse applications.
  • It explores pixel-level, token-level, and scale-based approaches to generate visual content while addressing computational challenges.
  • The study outlines future research directions in improving tokenization, adopting continuous representations, and enhancing model architectures.

A Survey on Autoregressive Models in Vision

The paper presents a comprehensive survey on autoregressive models in computer vision, examining their evolution, methodology, and applications. Autoregressive models, which have achieved significant success in NLP, have recently emerged as a promising approach to generating high-quality visual content. This survey navigates through the diverse landscape of autoregressive models in vision, offering insights into various methodologies and applications.

Insights into Autoregressive Models

Autoregressive models generate data by predicting each element in a sequence based on preceding elements, optimizing conditional probabilities. Historically successful in NLP, they possess inherent strengths in capturing long-range dependencies and delivering contextually relevant outputs. The extension of autoregressive models to computer vision is characterized by varied sequence representation levels: pixel-level, token-level, and scale-level, reflecting the hierarchical complexity of visual data.

  1. Pixel-based Models: These models utilize pixel-level sequences for image generation, where a sequence of pixels predicts the next pixel value. While pioneering efforts like PixelRNN have achieved notable success, they face computational challenges due to high sequence lengths and cost inefficiencies.
  2. Token-based Models: Inspired by NLP, these models compress images into discrete tokens via vector quantization. Models like VQ-VAE and its successors employ a discrete latent space for autoregressive generation, enabling more efficient high-resolution content processing.
  3. Scale-based Models: VAR demonstrates a hierarchical approach, predicting images across multiple scales rather than a single-level raster scan. This method improves spatial handling and enhances computational efficiency.

Applications Across Domains

Autoregressive models have broad applicability across diverse tasks in computer vision:

  • Image Generation: Encompassing pixel-wise, token-wise, and scale-wise methodologies, these models excel in generating coherent, diverse images by sequentially predicting image components.
  • Video Generation: Building on spatial models, autoregressive techniques extend to temporal sequences, facilitating unconditional and conditional video content generation. Models like MoCoGAN and VideoGPT capture temporal dynamics more effectively.
  • 3D and Multimodal Generation: Beyond 2D, these models address 3D scene generation, motion prediction, and cross-modal tasks, including image-text alignment and multimodal inputs.
  • Emerging Domains: Particular attention is given to applications in embodied AI and medical imaging, where autoregressive models show promise in enhancing navigational and analytical capabilities.

Challenges and Future Directions

This survey identifies several future avenues for advancing autoregressive models:

  • Tokenization and Representation: Designing powerful tokenizers that efficiently compress visual data is crucial. Future research may extend beyond traditional vector quantization techniques to explore more sophisticated methods that enhance model performance and scalability.
  • Discrete vs. Continuous Representation: While autoregressive models traditionally employ discrete tokens, exploring continuous models may offer improvements in adaptability and integration with multimodal systems.
  • Model Architectures: Exploring architectures imbued with inductive biases that cater to the spatial complexities of vision tasks could complement the vanilla LLM frameworks currently in use.
  • Downstream Applications: Scalability in downstream tasks remains limited. Developing versatile autoregressive frameworks analogous to diffusion models may catalyze broader adaptability and application.

Conclusion

This exhaustive examination underscores the transformative potential of autoregressive vision models, bridging NLP success to visual domains. It highlights their methodological diversity and extensive applications while pointing towards challenges and future research directions essential for unlocking their full capability. As this field continues to evolve, contributions such as this are invaluable in guiding the trajectory of research and application in vision-based autoregressive modeling.