BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (2505.09568v1)

Published 14 May 2025 in cs.CV and cs.AI

Abstract: Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.

PDF Abstract

The paper "BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset" (Chen et al., 14 May 2025 ) introduces BLIP3-o, a suite of unified multimodal models designed to excel in both image understanding and image generation tasks within a single framework. Motivated by the potential of combining autoregressive and diffusion models, the authors conduct a systematic paper into the optimal design choices for such unified architectures.

The core concept involves an autoregressive model processing textual prompts to generate intermediate visual features, which then condition a diffusion model to produce image pixels. This pipeline aims to leverage the reasoning and in-context learning capabilities of LLMs for image generation. The paper investigates three key design axes:

Image Representations: Comparing low-level pixel features (like those from VAEs) versus high-level semantic features (like those from CLIP). The paper finds that CLIP image features offer more compact and informative representations, leading to higher training efficiency and better generative quality. CLIP features allow both understanding and generation tasks to operate within the same semantic space, facilitating unification.
Training Objectives: Evaluating Mean Squared Error (MSE) versus Flow Matching for aligning the autoregressive model's output (predicted visual features) with the ground-truth image features. While MSE is simpler, it tends to produce deterministic outputs. Flow Matching, by incorporating stochasticity from diffusion, better models the distribution of continuous visual features, resulting in greater sample diversity and enhanced visual quality.
Training Strategies: Comparing joint multitask training (understanding and generation together) with sequential training (first understanding, then generation with a frozen autoregressive backbone). Sequential training is found to be practically advantageous as it preserves image understanding capabilities while developing strong image generation abilities, allowing dedicated training capacity for generation without negative interference from understanding tasks.

Based on these findings, BLIP3-o adopts a design using a frozen LLM (specifically, Qwen 2.5 VL) as the autoregressive backbone for understanding, coupled with a diffusion transformer (based on Lumina-Next architecture) that generates CLIP image features using Flow Matching. This approach leverages the MLLM's reasoning abilities while generating high-quality, semantically aligned images.

The training of BLIP3-o follows a two-stage sequential strategy:

Stage 1: Pretraining for Image Generation: The diffusion transformer is trained using the Flow Matching objective on a large dataset of image-caption pairs. For the 8B model, this includes 25 million open-source images (CC12M, SA-1B, JourneyDB) and 30 million proprietary images, with detailed captions generated by Qwen2.5-VL-7B-Instruct. The 4B model uses only the 25 million open-source images. Both datasets include a mix of detailed and shorter captions to improve generalization.
Stage 2: Instruction Tuning for Image Generation: To improve the model's ability to generate specific content (e.g., complex human gestures, common objects, landmarks, simple text) and enhance visual aesthetics, a targeted instruction tuning dataset (BLIP3o-60k) is curated. This dataset consists of ~60k high-quality prompt-image pairs, many generated by prompting GPT-4o or sourced from JourneyDB and DALL·E 3. This stage helps the model align better with human preferences and improves visual quality. The authors note that the model adapts quickly to AI-generated data styles.

The paper presents extensive evaluation results. In image understanding benchmarks (VQAv2, MMBench, MMMU, etc.), BLIP3-o 8B achieves state-of-the-art performance across most evaluated datasets, demonstrating the effectiveness of the frozen MLLM backbone and CLIP-based approach. For image generation, BLIP3-o achieves strong performance on metrics like GenEval and WISE. While automated metrics like DPG-Bench show mixed results compared to baselines, a human evaluation on DPG-Bench prompts demonstrates that BLIP3-o significantly outperforms Janus Pro in both visual quality and prompt alignment according to human judges.

A key practical contribution of the paper is the full open-sourcing of BLIP3-o models, code, training scripts, and both the pretraining and instruction tuning datasets. This aims to facilitate further research and development in unified multimodal AI.

Future work includes extending BLIP3-o to downstream applications such as image editing, visual dialogue, and interleaved generation, starting with enabling image reconstruction within the unified framework.

In summary, BLIP3-o is a practical implementation of a unified multimodal model, demonstrating that combining autoregressive and diffusion models with CLIP features and Flow Matching, trained sequentially, is an effective strategy for achieving strong performance in both image understanding and generation. The released resources provide a valuable foundation for researchers and practitioners working on multimodal AI systems.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Jiuhai Chen (26 papers)
Zhiyang Xu (29 papers)
Xichen Pan (11 papers)
Yushi Hu (23 papers)
Can Qin (37 papers)
Tom Goldstein (226 papers)
Lifu Huang (91 papers)
Tianyi Zhou (172 papers)
Saining Xie (60 papers)
Silvio Savarese (200 papers)
Le Xue (23 papers)
Caiming Xiong (337 papers)
Ran Xu (89 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/JiuhaiC/status/1924476097866994084

https://twitter.com/_akhaliq/status/1923001246136328295

https://twitter.com/iScienceLuvr/status/1922843716324073807

https://twitter.com/Synced_Global/status/1923178511868592536

https://twitter.com/TheTuringPost/status/1925335358012891620

https://twitter.com/WesRothMoney/status/1923018772832714930

YouTube

Show All Videos