Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (2408.02657v1)

Published 5 Aug 2024 in cs.CV

Abstract: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key insight is that a simple decoder-only transformer with multimodal Generative PreTraining (mGPT), utilizing the next-token prediction objective on massive interleaved text-image sequences, can learn broad and general multimodal capabilities, thereby illuminating photorealistic text-to-image generation. Building on these pretrained models, we propose Flexible Progressive Supervised Finetuning (FP-SFT) on high-quality image-text pairs to fully unlock their potential for high-aesthetic image synthesis at any resolution while maintaining their general multimodal capabilities. Furthermore, we introduce Ominiponent Supervised Finetuning (Omni-SFT), transforming Lumina-mGPT into a foundation model that seamlessly achieves omnipotent task unification. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like flexible text-to-image generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multiturn visual question answering. Additionally, we analyze the differences and similarities between diffusion-based and autoregressive methods in a direct comparison.

PDF HTML Abstract

Overview of Lumina-mGPT: A Context-Aware Multimodal Generative Model

The paper "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining" by Liu et al. introduces Lumina-mGPT, a series of multimodal autoregressive models notable for their ability to execute a vast array of vision and language tasks efficiently. These models are particularly adept at generating high-quality photorealistic images from textual descriptions. Unlike previous autoregressive image generation models, Lumina-mGPT leverages a pretrained decoder-only transformer initialized with multimodal Generative PreTraining (mGPT). This approach capitalizes on next-token prediction across extensive text-image sequences to acquire broad multimodal competencies.

Key Contributions

Pretrained Decoder-Only Transformer:
- Lumina-mGPT employs a pretrained decoder-only transformer as its core architecture. This transformer is initialized using mGPT, which has been trained on large-scale interleaved text-image data using a next-token prediction objective. This enables Lumina-mGPT to acquire versatile and generalizable multimodal capabilities, a significant departure from the randomly initialized transformers traditionally used in autoregressive image generation.
Flexible Progressive Supervised Finetuning (FP-SFT):
- The authors introduce FP-SFT, a novel finetuning strategy where models are progressively finetuned on high-quality image-text pairs at increasing resolutions. This method enhances image generation by gradually exposing the model to higher resolution data, thereby improving image quality without compromising the model’s general multimodal capabilities.
Omnipotent Supervised Finetuning (Omni-SFT):
- Omni-SFT is designed to extend Lumina-mGPT's capabilities beyond text-to-image generation. The finetuning process incorporates diverse tasks such as segmentation, depth estimation, and multi-turn visual question answering, transforming Lumina-mGPT into a foundation model for task unification.

Strong Numerical Results and Claims

The paper presents robust numerical results highlighting Lumina-mGPT's capabilities:

Photorealistic Image Generation:
- Lumina-mGPT can generate images at arbitrary resolutions, a notable achievement given the challenges faced by traditional autoregressive models in scalable high-resolution image synthesis.
- Visual comparisons demonstrate Lumina-mGPT's superiority over contemporary models like LlamaGen and Parti, showcasing higher aesthetic quality and finer visual details.
Architectural Simplicity and Versatility:
- The paper presents a compelling argument for the efficacy of a simple decoder-only architecture complemented by multimodal generative pretraining. Lumina-mGPT’s unified framework simplifies the generation process, unlike the verbose encoder-decoder architectures employed by other models.

Theoretical and Practical Implications

The introduction of Lumina-mGPT has significant implications for both theoretical research and practical applications in AI:

Theoretical Implications:
- This work challenges the prevailing notion that diffusion models are superior for photorealistic image generation by presenting an autoregressive model that achieves comparable, if not superior, results.
- The paper suggests a promising path forward where multimodal generative pretraining can be an effective initialization strategy for large-scale autoregressive models, potentially leading to more efficient and capable models.
Practical Applications:
- Lumina-mGPT's ability to generate diverse, high-quality images from textual descriptions can be transformative for industries such as media, entertainment, and e-commerce, where visual content generation is crucial.
- The model's task unification capability opens new avenues for integrating multiple vision-language tasks into a single framework, simplifying the deployment of AI systems that need to perform a range of image-related and language-related tasks.

Future Developments in AI

Considering the advancements brought by Lumina-mGPT, several future research directions and practical enhancements can be envisaged:

Scaling Data and Computational Resources:
- As noted in the paper, larger datasets and increased computational resources could further enhance Lumina-mGPT's performance, especially in multilingual understanding and complex multi-turn interactions.
Inference Time Optimization:
- Although effective, the autoregressive nature of Lumina-mGPT poses challenges for inference speed. Integrating efficient sampling techniques and inference-time optimizations can significantly reduce generation latency, making the model more practical for real-time applications.
Enhancing Base Representations:
- Incorporating larger and more diversified datasets during pretraining could improve the model’s understanding of various languages and complex visual concepts, thus expanding its applicability and robustness.

In conclusion, Lumina-mGPT introduces a robust framework for multimodal text-to-image generation and task unification by leveraging multimodal generative pretraining within a simple decoder-only architecture. This work paves the way for future research to further refine and optimize such models, thereby pushing the boundaries of what is achievable in the fields of image generation and multimodal AI.