Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens (2501.07730v1)

Published 13 Jan 2025 in cs.CV

Abstract: Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

PDF Abstract

Democratizing Text-to-Image Generative Models with Compact Tokenization

The paper addresses a significant challenge in the domain of text-to-image (T2I) generation, specifically concerning the training complexity and data constraints of existing models. Current models typically require extensive computational resources and access to high-quality proprietary datasets, which limits their accessibility. To tackle this, the authors propose a novel approach involving a Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok) and a suite of corresponding models termed MaskGen.

TA-TiTok and MaskGen

The cornerstone of this research is the development of the TA-TiTok, an efficient image tokenizer designed to integrate textual information during the decoding stage. This design advancement allows the model to achieve faster convergence and superior performance compared to its predecessors. Notably, TA-TiTok employs a simplified one-stage training process, which enhances scalability across large datasets without resorting to complex two-stage methods.

Building on this foundation, the MaskGen family comprises models that utilize TA-TiTok to execute masked generative modeling. MaskGen distinguishes itself by being trained exclusively on open-source datasets, effectively democratizing access to robust text-to-image generation capabilities. The open-data, open-weight nature of these models may catalyze further research and development within the community, bridging gaps between high-performance models and accessible technology.

Key Innovations and Results

Several innovative aspects highlight the paper’s contributions:

One-Stage Training: The model's simplified training pipeline improves efficiency and scalability, particularly advantageous for large-scale datasets.
Text-Aware De-tokenization: By factoring in textual input at the de-tokenization stage, the model aligns generated images more closely with textual conditions, capturing both low-level details and high-level semantics.
Continuous VAE Extensions: Beyond the discrete tokenization variant, the paper introduces continuous Variational Autoencoder (VAE) representations to improve reconstruction quality, while maintaining efficiency.

Performance evaluation of MaskGen on benchmarks like MJHQ-30K and GenEval indicates competitive quality vis-a-vis models trained on proprietary datasets. Notably, using continuous tokens, MaskGen achieves better Fréchet Inception Distance (FID) scores, indicative of higher-fidelity image generation, while sustaining a lower computational footprint.

Implications and Future Directions

By leveraging open data and optimizing token efficiency, the research holds promise for widespread implementation in both academic and industrial settings. The methodologies suggested stand to make state-of-the-art image generation more attainable, notwithstanding resource constraints or proprietary data dependencies.

Future work will likely explore scaling these models to higher resolutions and refining the convergence speed further. The adoption of TA-TiTok and MaskGen tools, coupled with released training codes and weights, sets a foundation for the ongoing evolution toward more inclusive technology, potentially influencing other areas like video generation and enhancement of other machine learning frameworks.

Conclusion

The paper presents a meaningful stride towards accessible generative models in imaging. By reducing dependency on exclusive data and computational resources, it opens avenues for enhanced research participation and application development. The release of MaskGen models and the underlying TA-TiTok tokenizer promises to support burgeoning innovation in generative AI, driven by open-source principles.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Dongwon Kim (37 papers)
Ju He (24 papers)
Qihang Yu (44 papers)
Chenglin Yang (15 papers)
Xiaohui Shen (67 papers)
Suha Kwak (63 papers)
Liang-Chieh Chen (66 papers)