- The paper introduces a lightweight double fusion model that integrates pre-trained VLMs and Diffusion Transformers for unified multimodal tasks.
- The methodology employs a mixture-of-experts design with dual-pathway integration of ViT and VAE tokens to balance semantic and visual details.
- Experiments demonstrate that LightBagel achieves competitive benchmarks in text-to-image generation and image editing while reducing computational demand.
LightBagel: A Lightweight, Double Fusion Framework for Unified Multimodal Understanding and Generation
Introduction
The paper discusses the development and evaluation of LightBagel, a unified multimodal model designed to combine the strengths of existing models specialized in understanding and generation. Unlike many contemporary approaches that require extensive computational resources and large-scale training, LightBagel emphasizes efficiency through a novel architecture termed "Double Fusion". This framework achieves competitive performance across various benchmarks by integrating publicly available models with specialized functionalities for both language and vision tasks.
Model Architecture
The cornerstone of LightBagel's architecture is the mixture-of-experts (MoE) style design that interleaves multimodal self-attention blocks with Visual-LLMs (VLMs) for understanding and Diffusion Transformers (DiTs) for generation. Specifically, the architecture employs pre-trained VLMs such as QWen2.5-VL-7B and generation models like Wan2.2-TI2V-5B. The integration of these models is achieved through zero-initialized multimodal self-attention blocks, which facilitate interaction between the outputs of distinct tasks without compromising their inherent strengths.
The Dual-Pathway design leverages Visual Encoder Tokens (ViT) for high-level semantics and Variational Autoencoder (VAE) tokens for low-level features, enriching cross-modal synthesis. This mechanism maintains the balance between detailed visual attributes and abstract semantic content, ensuring fidelity in text-to-image generation and image editing tasks.
Datasets and Training
To optimize LightBagel's performance, a carefully curated training dataset comprising both real and synthetic image-text pairs was utilized. The dataset emphasizes diversity and balance, providing high-quality sources for text-to-image generation and image-editing tasks. LightBagel was trained using the AdamW optimizer with specific configurations aimed at reducing computational demand while maximizing model efficiency. The training phases were stratified to progressively balance different data qualities, enhancing adaptability across variations in task complexity.
Experiments
The efficacy of LightBagel was evaluated through extensive experimentation across image understanding, creation, and modification tasks. This included benchmarks like GenEval for compositional synthesis, DPG-Bench for complex text-to-image demands, and GEdit-Bench and ImgEdit-Bench for image editing.
For image understanding, LightBagel demonstrated its competence by preserving pre-trained model strengths, achieving scores on par with top-performing models. In text-to-image generation tasks, it showcased strong compositional understanding and prompt adherence. LightBagel outperformed or matched the performance of models trained with significantly more computational resources. In image editing, the model efficiently balanced semantic coherence with task-specific transformations, outstripping several benchmarks.
Ablation Studies
Several architectural decisions were validated via comprehensive ablation studies, focusing on features like visual tokenizers and multimodal integration strategies. Results indicated that a combination of both VLM and VAE tokens optimized editing outcomes, while a deeper fusion approach consistently outperformed more superficial integration depths. These findings support the robustness and adaptability of the LightBagel framework across different modalities and tasks.
Conclusion
LightBagel demonstrates that with strategic architectural innovations, high performance in unified multimodal modeling can be achieved without the commensurate scale and computational requirements typical of contemporary methods. By making the model, datasets, and code publicly available, the research promotes transparency and encourages further advancements in developing versatile, efficient, and accessible multimodal systems in AI. The results provide empirical insights into the future directions for designing unified multimodal models that optimize both token efficiency and performance.