Lumina-Image 2.0: A Unified and Efficient Image Generative Framework (2503.21758v1)

Published 27 Mar 2025 in cs.CV

Abstract: We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

PDF Abstract

Lumina-Image 2.0 represents an advancement in text-to-image (T2I) generation, building upon the Lumina-Next framework (Qin et al., 27 Mar 2025 ). Its development is guided by two primary principles: unification of architecture and data processing, and efficiency in training and inference. The framework introduces architectural modifications, a specialized captioning system, and optimized training/inference procedures to achieve competitive performance with a relatively small model size.

Unification: Architecture and Data

A core tenet of Lumina-Image 2.0 is unification, manifested in both its model architecture and data preparation strategy.

Unified Next-DiT Architecture

The framework employs a Unified Next-DiT (Diffusion Transformer) architecture. Unlike approaches that might use separate encoders for text and image modalities before fusion, this architecture processes text and image tokens jointly within a single sequence. This design inherently facilitates cross-modal interaction throughout the network's depth. By treating inputs as a unified sequence, the model can potentially learn more complex and nuanced relationships between textual descriptions and visual representations. This unified sequence processing approach also simplifies the architecture and potentially allows for easier extension to other generative tasks beyond T2I, as suggested by the aim for "seamless task expansion." The specific mechanisms for tokenization (how text and image patches are converted into tokens) and the details of the transformer blocks integrating these unified sequences are key implementation aspects detailed in the associated code release.

Unified Captioner (UniCap)

Recognizing the critical role of high-quality text-image pairs in training T2I models, Lumina-Image 2.0 introduces a dedicated captioning system named UniCap (Unified Captioner). The premise is that standard web-scraped captions often lack the detail or accuracy needed for optimal T2I training. UniCap is specifically designed to generate comprehensive and accurate captions tailored for T2I tasks. The use of UniCap aims to improve the semantic alignment between the text prompts and the generated images during training. This improved alignment is purported to accelerate model convergence and enhance the final model's ability to adhere closely to input prompts. The effectiveness of UniCap relies on its ability to produce richer, more descriptive text than typically available, thereby providing a stronger supervisory signal during the diffusion model's training phase.

Efficiency: Training and Inference

The second guiding principle is efficiency, addressing the substantial computational costs often associated with training and deploying large generative models.

Multi-Stage Progressive Training

To manage the computational demands of training, Lumina-Image 2.0 utilizes multi-stage progressive training strategies. This often involves starting training at lower resolutions or with simpler model configurations and gradually increasing complexity. For instance, training might commence on smaller image patches or feature maps and progress to higher resolutions in later stages. This approach allows the model to learn foundational features efficiently before refining details at higher computational cost. The specific stages, resolutions, and schedule for transitioning between them are crucial parameters determining the overall training efficiency and final model quality.

Inference Acceleration

Beyond training, the framework incorporates inference acceleration techniques. While the abstract does not specify the exact methods, common techniques in diffusion models include optimized sampling schedules (e.g., reducing the number of diffusion steps), model distillation, or employing faster sampler algorithms (like DPM-Solver++ or DDIM with fewer steps). The goal is to reduce the latency and computational cost of generating an image from a text prompt without significantly degrading the output quality. The claim is that these techniques are applied "without compromising image quality," suggesting a careful selection and tuning of acceleration methods.

Performance and Evaluation

Lumina-Image 2.0's performance was assessed using standard academic benchmarks for T2I generation and evaluations on public platforms ("text-to-image arenas"). The framework, despite having only 2.6 billion parameters, is reported to achieve strong performance. This relatively modest parameter count (compared to some multi-billion parameter T2I models) highlights the claimed efficiency of the architectural design (Unified Next-DiT) and the training methodology (UniCap, progressive training). The specific quantitative results on benchmarks like MS-COCO (FID, IS scores) or qualitative comparisons from user studies would provide further context for the "strong performance" claim. The authors have made the training details, code, and pre-trained models available at https://github.com/Alpha-VLLM/Lumina-Image-2.0, enabling reproducibility and further investigation by the research community.

Conclusion

In summary, Lumina-Image 2.0 introduces a T2I framework centered around a unified transformer architecture (Unified Next-DiT) processing joint text-image sequences and enhanced training data via a specialized captioner (UniCap) (Qin et al., 27 Mar 2025 ). It couples these unification strategies with multi-stage progressive training and inference acceleration techniques to achieve computational efficiency. The reported strong performance with a 2.6B parameter model suggests a potentially effective balance between model capability and resource requirements in the domain of text-to-image synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (23)

Qi Qin (20 papers)
Le Zhuo (25 papers)
Yi Xin (28 papers)
Ruoyi Du (17 papers)
Zhen Li (334 papers)
Bin Fu (74 papers)
Yiting Lu (29 papers)
Jiakang Yuan (18 papers)
Xinyue Li (34 papers)
Dongyang Liu (14 papers)
Xiangyang Zhu (20 papers)
Manyuan Zhang (14 papers)
Will Beddow (1 paper)
Erwann Millon (1 paper)
Victor Perez (7 papers)
Wenhai Wang (123 papers)
Conghui He (114 papers)
Bo Zhang (633 papers)
Xiaohong Liu (117 papers)
Hongsheng Li (340 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Alpha-VLLM/Lumina-Image-2.0 (543 stars)

Tweets

https://twitter.com/bdsqlsz/status/1914587177381716373

YouTube

Show All Videos