DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation (2410.08159v1)

Published 10 Oct 2024 in cs.CV and cs.LG

Abstract: Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process that gradually adds noise to the input. We argue that the Markovian property limits the models ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model with the same architecture as standard LLMs. DART does not rely on image quantization, enabling more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and text-to-image generation tasks, offering a scalable, efficient alternative to traditional diffusion models. Through this unified framework, DART sets a new benchmark for scalable, high-quality image synthesis.

Authors (8)

Jiatao Gu (84 papers)
Yuyang Wang (111 papers)
Yizhe Zhang (127 papers)
Qihang Zhang (23 papers)
Dinghuai Zhang (41 papers)
Navdeep Jaitly (67 papers)
Josh Susskind (38 papers)
Shuangfei Zhai (50 papers)

Citations (5)

View on Semantic Scholar

Summary

Overview of DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

The paper introduces DART, a denoising autoregressive transformer designed for scalable text-to-image generation. This model represents a significant advancement in integrating autoregressive (AR) approaches with non-Markovian diffusion models to enhance visual generation efficiency and scalability.

Key Contributions

The authors propose a novel model that unifies AR and diffusion within a non-Markovian framework, departing from the traditional diffusion models' reliance on Markovian properties. Traditional diffusion models, while successful, suffer from inefficiencies due to their Markovian constraints that limit the utilization of generation trajectories. DART addresses this by leveraging the full trajectory in its non-Markovian process.

Methodology

Non-Markovian Framework: DART employs a non-Markovian approach that enables effective utilization of the generative trajectory during training and inference, contrasting with the Markovian property which restricts diffusion models.
Autoregressive Modeling: The model employs token-level autoregressive modeling to capture dependencies between image tokens, providing improved control over image quality.
Flow Matching: A flow-based refinement model enhances expressiveness and smooths transitions, further contributing to the model’s efficiency and flexibility.

The paper's approach allows DART to handle complex, high-resolution visual tasks efficiently. By not relying on image quantization and combining text and image data, DART achieves notable performance on standard benchmarks.

Results and Implications

The paper reports competitive results in class-conditioned and text-to-image generation tasks. DART demonstrated an ability to achieve FID scores of 3.98 on ImageNet, surpassing many existing models under constrained computational resources. This efficiency makes DART particularly valuable for generating complex scenes, where the cost-effective approach allows broader accessibility and application.

The integration of AR with diffusion provides a scalable framework for high-quality image synthesis and points toward significant future developments. Given evolving architectural advancements in AR models and increased computational resources, one can envision further scaling of DART for more extensive applications such as video generation or detailed scene rendering.

Future Directions

The approach outlined in DART opens several avenues for future work.

Scalability: Exploring more efficient architectures and enhancing long-context modeling could extend DART’s applicability to more complex tasks like video generation.
Multi-modal Tasks: DART’s ability to seamlessly incorporate multi-modal generation tasks suggests potential for expanding into comprehensive multi-modal models, benefiting applications such as enhanced neural interfaces or detailed virtual environments.
Integration with LLMs: Further research could investigate integrating DART into larger-scale LLM pipelines to exploit the full potential of unified generative frameworks.

In conclusion, DART provides a promising hybrid approach by marrying the strengths of autoregressive and diffusion models, paving the way for more efficient, scalable, and high-quality visual generation. This work contributes significant insights into how non-Markovian models can enhance autoregressive efficiency, suggesting a broader potential for future research and application in AI-driven creativity and productivity tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/thoma_gu/status/1844912584077521199

https://twitter.com/thoma_gu/status/1882638411715977318

https://twitter.com/torchcompiled/status/1877368545371107783

https://twitter.com/arXivGPT/status/1845542100726661485

https://twitter.com/ThinkDi92468945/status/1845543329163088267

https://twitter.com/gpbhupinder/status/1844940422222589989