Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models (2504.17789v2)

Published 24 Apr 2025 in cs.CV

Abstract: Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal LLMs (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

Summary

Token-Shuffle: Enhancing High-Resolution Image Generation with Autoregressive Models

The paper "Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models" presents a methodological advancement to address inherent limitations in generating high-resolution images using autoregressive models (ARMs), specifically within the Multimodal LLMs (MLLMs) framework. The research acknowledges the current dominance of diffusion-based models in high-resolution image generation and seeks to bring parity through an innovative approach targeting inefficiencies in AR models.

Overview

Autoregressive models, renowned for their success in natural language processing, predict the next token in a sequence through sequential data processing. Efforts to extend ARMs to image generation encounter significant hurdles, predominantly due to the voluminous token count required by high-resolution images. This challenge is amplified by the quadratic computational complexity and substantial inference costs associated with handling large numbers of tokens.

Token-Shuffle aims to mitigate the computational bottleneck by introducing a novel token processing method that leverages the dimensional redundancy of visual vocabularies inherent in MLLMs. The method involves two operations: token-shuffle and token-unshuffle. The token-shuffle operation merges spatially close visual tokens into fewer fused tokens, reducing the input token count for the Transformer architecture. Conversely, the token-unshuffle operation disentangles these tokens post-Transformer processing to restore the spatial arrangement necessary for output.

Numerical Results and Performance

The implementation of Token-Shuffle within a 2.7B parameter ARM model demonstrates significant efficiency improvements and quality in image generation. The method achieves a resolution of 2048 $\times$ 2048 for image synthesis, a feat previously attainable only by diffusion models. In benchmark testing, the approach scores an overall of 0.77 on complex prompts within the GenAI benchmark, surpassing autoregressive models like LlamaGen by 0.18 and diffusion models such as Latent Diffusion Models (LDM) by 0.15.

Implications and Future Directions

The introduction of Token-Shuffle could serve as a foundational design for advancing the efficiency of high-resolution image generation in MLLMs. This model provides an alternative pipeline that aligns with the next-token prediction paradigm yet facilitates the generation of local token sequences in batch through a shuffled structure. Such an approach reduces computation costs significantly while preserving the fidelity of generated images.

With the demonstrated efficacy of Token-Shuffle in elevating the performance of autoregressive models for image synthesis, future research could explore scaling this method to larger LLM architectures (e.g., 7B or 30B models) to further enhance image generation capabilities. Other areas worth investigating include supporting flexible image resolutions and diverse aspect ratios, similar to approaches being explored with diffusion models. Addressing visual flaws through improved global information capture in the autoregressive framework remains a critical area for potential advancement.

In summary, Token-Shuffle represents a meaningful contribution to bridging the resolution and efficiency gap between autoregressive models and diffusion models in image generation, potentially setting the stage for future innovations in MLLM applications.

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1915662069807906975

https://twitter.com/Synced_Global/status/1915826391150403959

https://twitter.com/zuckerbarge/status/1915681165966553364

https://twitter.com/arxivsanitybot/status/1915762132798550097

https://twitter.com/zuckerbarge/status/1916153684737626454

https://twitter.com/GptMaestro/status/1928053907562930308

YouTube

Show All Videos