LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation (2510.22946v2)

Published 27 Oct 2025 in cs.CV

Abstract: Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.

Summary

The paper introduces a lightweight double fusion model that integrates pre-trained VLMs and Diffusion Transformers for unified multimodal tasks.
The methodology employs a mixture-of-experts design with dual-pathway integration of ViT and VAE tokens to balance semantic and visual details.
Experiments demonstrate that LightBagel achieves competitive benchmarks in text-to-image generation and image editing while reducing computational demand.

LightBagel: A Lightweight, Double Fusion Framework for Unified Multimodal Understanding and Generation

Introduction

The paper discusses the development and evaluation of LightBagel, a unified multimodal model designed to combine the strengths of existing models specialized in understanding and generation. Unlike many contemporary approaches that require extensive computational resources and large-scale training, LightBagel emphasizes efficiency through a novel architecture termed "Double Fusion". This framework achieves competitive performance across various benchmarks by integrating publicly available models with specialized functionalities for both language and vision tasks.

Model Architecture

The cornerstone of LightBagel's architecture is the mixture-of-experts (MoE) style design that interleaves multimodal self-attention blocks with Visual-LLMs (VLMs) for understanding and Diffusion Transformers (DiTs) for generation. Specifically, the architecture employs pre-trained VLMs such as QWen2.5-VL-7B and generation models like Wan2.2-TI2V-5B. The integration of these models is achieved through zero-initialized multimodal self-attention blocks, which facilitate interaction between the outputs of distinct tasks without compromising their inherent strengths.

The Dual-Pathway design leverages Visual Encoder Tokens (ViT) for high-level semantics and Variational Autoencoder (VAE) tokens for low-level features, enriching cross-modal synthesis. This mechanism maintains the balance between detailed visual attributes and abstract semantic content, ensuring fidelity in text-to-image generation and image editing tasks.

Datasets and Training

To optimize LightBagel's performance, a carefully curated training dataset comprising both real and synthetic image-text pairs was utilized. The dataset emphasizes diversity and balance, providing high-quality sources for text-to-image generation and image-editing tasks. LightBagel was trained using the AdamW optimizer with specific configurations aimed at reducing computational demand while maximizing model efficiency. The training phases were stratified to progressively balance different data qualities, enhancing adaptability across variations in task complexity.

Experiments

The efficacy of LightBagel was evaluated through extensive experimentation across image understanding, creation, and modification tasks. This included benchmarks like GenEval for compositional synthesis, DPG-Bench for complex text-to-image demands, and GEdit-Bench and ImgEdit-Bench for image editing.

For image understanding, LightBagel demonstrated its competence by preserving pre-trained model strengths, achieving scores on par with top-performing models. In text-to-image generation tasks, it showcased strong compositional understanding and prompt adherence. LightBagel outperformed or matched the performance of models trained with significantly more computational resources. In image editing, the model efficiently balanced semantic coherence with task-specific transformations, outstripping several benchmarks.

Ablation Studies

Several architectural decisions were validated via comprehensive ablation studies, focusing on features like visual tokenizers and multimodal integration strategies. Results indicated that a combination of both VLM and VAE tokens optimized editing outcomes, while a deeper fusion approach consistently outperformed more superficial integration depths. These findings support the robustness and adaptability of the LightBagel framework across different modalities and tasks.

Conclusion

LightBagel demonstrates that with strategic architectural innovations, high performance in unified multimodal modeling can be achieved without the commensurate scale and computational requirements typical of contemporary methods. By making the model, datasets, and code publicly available, the research promotes transparency and encourages further advancements in developing versatile, efficient, and accessible multimodal systems in AI. The results provide empirical insights into the future directions for designing unified multimodal models that optimize both token efficiency and performance.