Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis (2410.08261v3)

Published 10 Oct 2024 in cs.CV

Abstract: We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL. By incorporating a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions, Meissonic substantially improves MIM's performance and efficiency. Additionally, we leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images. Extensive experiments validate Meissonic's capabilities, demonstrating its potential as a new standard in text-to-image synthesis. We release a model checkpoint capable of producing $1024 \times 1024$ resolution images.

PDF HTML Abstract

Overview of "Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis"

The paper "Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis" presents a novel approach to text-to-image synthesis using a non-autoregressive masked image modeling (MIM) methodology. The authors introduce Meissonic, a model that aims to bridge the efficiency and performance gap between traditional diffusion models and autoregressive methods by leveraging architectural innovations and optimized data handling strategies.

Key Contributions

The authors address the limitations of existing models, such as inefficiency and low performance at high resolutions, by presenting several innovative solutions:

Enhanced Transformer Architecture: The integration of multi-modal and single-modal transformer layers enhances the learning efficiency and performance of MIM. By effectively capturing cross-modal interactions, the architecture improves the bridge between language and vision tasks.
Advanced Positional Encoding: Meissonic incorporates Rotary Position Embedding (RoPE) to encode positional information, addressing context disassociation issues in transformers, particularly with large token contexts.
Variable Masking Rate as a Sampling Condition: By introducing a discrete masking rate as an adaptive parameter, the model dynamically adapts through sampling stages, significantly improving image detail retention.
High-Quality Data and Micro-Conditioning: The model uses high-quality training data with precise captions and integrates micro-conditions such as original image resolution, crop coordinates, and human preference scores to enhance model stability and fidelity.
Feature Compression Layers: These layers manage computational efficiency, enabling the generation of high-resolution images up to $1024 \times 1024$ pixels without relying on large-scale processing resources.

Results and Evaluation

The paper provides extensive quantitative and qualitative evaluations of Meissonic. Key performance metrics include human preference scores (HPS v2.0), GenEval benchmarks, and Multi-Dimensional Human Preference Scores (MPS). Meissonic demonstrates a competitive edge over leading models like SDXL in both image quality and text-image alignment, evidenced by the quantitative results.

The model's efficiency is noted, particularly its ability to run on consumer-grade GPUs with only 8GB of VRAM. This is a significant achievement considering its performance levels are on par, or exceed, those of models like SDXL which demand far greater computational resources.

Implications and Future Directions

The introduction of Meissonic carries potential implications for the development of more efficient and accessible text-to-image models. By achieving high-resolution generation with reduced resources, Meissonic paves the way for wider applicability of such technologies in consumer-grade products. Moreover, its architecture can inspire further exploration into integrating multi-modal transformers for cross-modal tasks.

Future developments could involve refining the balance between model size and performance, as well as enhancing adaptability to various downstream tasks. Investigations into further reducing reliance on complex data preprocessing might also be considered.

In conclusion, Meissonic represents a forward movement in the synthesis of high-quality, high-resolution images, merging efficiency with performance through carefully orchestrated architectural and methodological advancements. This work contributes to ongoing efforts to develop more unified language-vision models capable of operating with minimal resource overhead.