DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer (2507.04947v1)

Published 7 Jul 2025 in cs.CV and cs.AI

Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents a novel masked autoregressive image generation method using a Deep Compression Hybrid Tokenizer that achieves a 32× spatial compression ratio.
It employs a three-stage adaptation training strategy that leverages both discrete and continuous token paths for robust and high-fidelity image reconstruction.
Experimental results demonstrate 1.5-7.9× higher throughput and 2.0-3.5× lower latency, validating its efficiency and superior performance on benchmark datasets.

Efficient Masked Autoregressive Image Generation with DC-AR

DC-AR represents a significant advancement in masked autoregressive (AR) image generation, specifically tailored for text-to-image synthesis. In contrast to diffusion models, which have dominated this sphere due to their continuous nature and computational efficiency, DC-AR leverages a novel approach that emphasizes both efficiency and high-quality image reconstruction through a hybrid tokenization strategy.

Introduction to DC-AR

The foundation of DC-AR lies in the innovative Deep Compression Hybrid Tokenizer (DC-HT), which provides a 32× spatial compression ratio. DC-HT improves upon existing masked AR models by introducing a tokenization process that maintains high reconstruction fidelity and cross-resolution adaptability. By implementing DC-HT with a three-stage adaptation training strategy, DC-AR overcomes prevalent limitations found in previous models concerning throughput and latency.

In comparison to leading autoregressive and diffusion models, DC-AR achieves 1.5-7.9× higher throughput and 2.0-3.5× lower latency, setting new benchmarks in computational efficiency. This is important for practical applications where resource constraints are a critical factor.

Figure 1: Qualitative Comparison of Text-to-Image Generation Results Between DC-AR and Other Generative Models.

Methodology

Deep Compression Hybrid Tokenizer

DC-HT is pivotal to DC-AR's performance. It not only reduces the spatial dimensions effectively (32× compression) but also supports a hybrid tokenization framework. This framework accommodates both discrete and continuous token paths, employed during different stages of image reconstruction. The discrete path benefits the generation of broad structural elements, while the continuous path refines these elements, ensuring high fidelity.

The three-stage adaptation strategy enhances the training of DC-HT. Initially, the continuous warm-up stage sets an appropriate baseline. This is followed by discrete learning, focusing on stable latent space comprehension. Finally, an alternate fine-tuning stage ensures that the decoder can effectively interpret both discrete and continuous inputs.

Hybrid Masked Autoregressive Model

DC-AR combines the strengths of DC-HT with a novel masked autoregressive generation process. The training involves both mask-prediction objectives and diffusion loss refinement, using discrete tokens for the structural layout and residual tokens for intricate detailing. At inference, DC-AR utilizes a progressive unmasking approach, unique for its efficiency—delivering optimal results with merely 12 steps, a marked efficiency improvement over purely continuous token-based models.

Experimental Evaluation

In experiments on various benchmarks, DC-AR demonstrates state-of-the-art quality with a generation FID (gFID) of 5.49 on MJHQ-30K, and a competitive score of 0.69 on GenEval for prompt alignment. These advancements validate the effectiveness of the proposed hybrid tokenization and generation framework.

Advantage Over Other Models

The unique design of DC-AR, combining discrete and continuous tokens, sets it apart from other frameworks. It allows for maintaining high image quality with significantly reduced latency, crucial for deployment in real-world scenarios where speed is parallel to quality. This efficiency does not compromise the model's ability to generalize across image resolutions, an attribute critical for scalable and adaptable AI applications.

Conclusion

DC-AR illustrates a robust advancement in autoregressive image generation by introducing high-efficiency tokenization and a hybrid autoregressive framework that leverages discrete and continuous token strengths. The result is a model that combines computational efficiency with high-quality output, offering significant practical benefits and setting the stage for future developments in efficient text-to-image synthesis frameworks. This research opens pathways for further innovations that could integrate or enhance similar hybrid strategies in related domains.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/Jinbin_Bai/status/1943216878177968135

alphaXiv

DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer (20 likes, 0 questions)