GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation (2504.08736v1)

Published 11 Apr 2025 in cs.CV

Abstract: In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

The paper, "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation," addresses a critical challenge in the development of large-scale visual tokenizers for autoregressive image generation. The research specifically tackles the paradox where enhancing the model capacity of visual tokenizers typically improves image reconstruction fidelity but adversely impacts downstream generation quality. This dilemma highlights the intricate balance between reconstruction accuracy and the complexity of latent token space, which has not been effectively resolved in the existing literature.

Challenges and Innovative Approaches

The primary challenge identified by the authors is the increasing complexity of the latent space as visual tokenizers are scaled, which complicates the learning process for subsequent autoregressive (AR) models. To counteract this issue, the paper introduces GigaTok, a 3 billion parameter visual tokenizer. The core innovation lies in semantic regularization, which aligns the tokenizer's features with semantically meaningful features from a pre-trained visual encoder like DINOv2. This regularization strategy acts as a constraint, limiting the complexity of the latent space and thereby facilitating improvements in both reconstruction and downstream generation tasks.

Key Strategies for Scaling

To effectively scale visual tokenizers, the authors propose several key strategies:

1D Tokenizers: Utilizing 1D tokenizers rather than 2D structures for improved scalability.
Asymmetric Scaling: Prioritizing the scaling of the decoder over the encoder, which is deemed more effective due to the decoder's pivotal role in generating detailed reconstructions.
Entropy Loss: Applying entropy loss to stabilize the training of billion-scale tokenizers, thereby achieving high codebook utilization and enabling convergence for very large models.

These methodological innovations allow GigaTok to achieve state-of-the-art performance across several metrics, including reconstruction quality, downstream AR generation, and even representation learning as indicated by linear probing accuracy.

Implications and Future Directions

The implications of this research extend both practically and theoretically in the domain of autoregressive image generation. Practically, the ability to achieve high fidelity in image reconstruction while maintaining or improving generation quality opens new avenues for deploying AR models in real-world applications, such as image editing or visual content generation, where high-resolution outputs are crucial.

Theoretically, the findings underscore the importance of balancing complexity in the latent space with semantic consistency, specifically in models of scale. This may inspire further research into how semantic alignment techniques can be leveraged across other domains involving generative models or multimodal tasks.

Additionally, the scalability of the tokenizer discussed in this paper indicates potential paths forward in unifying multimodal understanding and generation tasks. The improved representation quality observed in downstream AR models trained with GigaTok hints at significant crossover potential for tasks that require nuanced understanding and generative capabilities spanning text, images, and potentially other modalities.

Conclusion

The paper's contributions are significant in that they address a long-standing challenge in scaling visual tokenizers for AR image generation, suggesting strategies and solutions with practical benefits. As the field of AI continues to grapple with scaling up model architectures efficiently and effectively, insights from this research provide a tangible path forward in balancing performance metrics across diverse application areas. Future explorations could further elucidate the intricate dynamics between model scalability, semantic regularization, and the generative capacity of large-scale neural networks.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

Tweets

https://twitter.com/XihuiLiu/status/1912096928734822477

https://twitter.com/iScienceLuvr/status/1911639046687957407

https://twitter.com/gm8xx8/status/1911717413353132227