aMUSEd: An Open MUSE Reproduction (2401.01808v1)

Published 3 Jan 2024 in cs.CV

Abstract: We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.

PDF HTML Abstract

Insights into aMUSEd: An Open MUSE Reproduction for Lightweight Text-to-Image Generation

The current paper titled "aMUSEd: An open MUSE reproduction" explores the development of aMUSEd, an open-source and computationally efficient masked image model (MIM) engineered for text-to-image generation. This research examines the efficacy of MIM as a prominent alternative to diffusion models, which commonly dominate the text-to-image generative domain. By leveraging only 10% of the parameters employed by MUSE, the authors seek to highlight the advantages of MIM, particularly regarding inference efficiency and interpretability, and they provide comprehensive open-source materials, including checkpoints and training code, to spur further exploration in the scientific community.

Technical Framework

aMUSEd is built on the MUSE architecture, featuring a concise parameter configuration aimed at performance optimization. The authors have utilized a CLIP-L/14 text encoder and a U-ViT backbone for efficient text conditioning and image tokenization modeling, respectively. They introduced a VQ-GAN without self-attention layers, supporting a lower computational footprint by employing a fixed number of inference steps governed by a cosine-based masking schedule. The experimental design also integrates a multistage training regime that begins with a focus on 256x256 image resolutions and is subsequently elevated to 512x512, ensuring scalability across different resolution requirements.

One key differentiator in aMUSEd’s architecture is its reliance on masked image modeling to concurrently predict masked image tokens, eschewing the complex iterative sampling procedures quintessential to diffusion models. This unique approach allows aMUSEd to generate images in as few as 12 inference steps, significantly reducing computational costs while maintaining image fidelity.

Experimental Evaluation

The authors offer an empirical evaluation demonstrating that aMUSEd achieves superior inference speeds compared to non-distilled diffusion models and maintains competitiveness against few-step distilled diffusion models, particularly at higher batch sizes. For example, the model generates images over 3 times faster than traditional diffusion models like Stable Diffusion 1.5, with substantial reductions in end-to-end generation time evident across smaller batch sizes as well.

However, while the CLIP scores for the models exhibit competitive performance, the assessments denote that aMUSEd presently lags behind other diffusion models on metrics such as Fréchet Inception Distance (FID) and Inception Score (ISC). A notable finding from the subjective evaluations is that aMUSEd fares well with low-detail images but may require targeted prompting to achieve competitive quality in highly detailed scenes.

Task Transfer and Stylization

Beyond text-to-image generation, aMUSEd shows potent zero-shot capabilities in related tasks such as image variation, in-painting, and video generation. These abilities extend the adaptability of the model to varied multimedia contexts without necessitating ad-hoc task-specific modifications or retraining. Furthermore, the integration of Styledrop enables efficient style transfer with minimal training steps and computational resources, illustrating a forward-thinking application of MIM in style adaptation.

Future Directions

The work conclusively posits that aMUSEd paves the way for more computationally efficient and accessible models in text-to-image generation. The open-source nature of aMUSEd, along with reproducible code and model weights, establishes a foundation for subsequent research and potential industrial application. Future avenues could explore enhancing image quality metrics via improved training regimes, potentially leveraging the extensive LLMing research landscape to refine token prediction confidence and uncertainty estimates.

In essence, aMUSEd not only augments the current knowledge base on masked image modeling but also extends an invitation to the broader research community to delve into MIM’s potential as a viable alternative to existing generative paradigms. Through this contribution, the authors encourage more streamlined, adaptable, and resource-conscious methodologies in image synthesis tasks.