The paper "BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset" (Chen et al., 14 May 2025 ) introduces BLIP3-o, a suite of unified multimodal models designed to excel in both image understanding and image generation tasks within a single framework. Motivated by the potential of combining autoregressive and diffusion models, the authors conduct a systematic paper into the optimal design choices for such unified architectures.
The core concept involves an autoregressive model processing textual prompts to generate intermediate visual features, which then condition a diffusion model to produce image pixels. This pipeline aims to leverage the reasoning and in-context learning capabilities of LLMs for image generation. The paper investigates three key design axes:
- Image Representations: Comparing low-level pixel features (like those from VAEs) versus high-level semantic features (like those from CLIP). The paper finds that CLIP image features offer more compact and informative representations, leading to higher training efficiency and better generative quality. CLIP features allow both understanding and generation tasks to operate within the same semantic space, facilitating unification.
- Training Objectives: Evaluating Mean Squared Error (MSE) versus Flow Matching for aligning the autoregressive model's output (predicted visual features) with the ground-truth image features. While MSE is simpler, it tends to produce deterministic outputs. Flow Matching, by incorporating stochasticity from diffusion, better models the distribution of continuous visual features, resulting in greater sample diversity and enhanced visual quality.
- Training Strategies: Comparing joint multitask training (understanding and generation together) with sequential training (first understanding, then generation with a frozen autoregressive backbone). Sequential training is found to be practically advantageous as it preserves image understanding capabilities while developing strong image generation abilities, allowing dedicated training capacity for generation without negative interference from understanding tasks.
Based on these findings, BLIP3-o adopts a design using a frozen LLM (specifically, Qwen 2.5 VL) as the autoregressive backbone for understanding, coupled with a diffusion transformer (based on Lumina-Next architecture) that generates CLIP image features using Flow Matching. This approach leverages the MLLM's reasoning abilities while generating high-quality, semantically aligned images.
The training of BLIP3-o follows a two-stage sequential strategy:
- Stage 1: Pretraining for Image Generation: The diffusion transformer is trained using the Flow Matching objective on a large dataset of image-caption pairs. For the 8B model, this includes 25 million open-source images (CC12M, SA-1B, JourneyDB) and 30 million proprietary images, with detailed captions generated by Qwen2.5-VL-7B-Instruct. The 4B model uses only the 25 million open-source images. Both datasets include a mix of detailed and shorter captions to improve generalization.
- Stage 2: Instruction Tuning for Image Generation: To improve the model's ability to generate specific content (e.g., complex human gestures, common objects, landmarks, simple text) and enhance visual aesthetics, a targeted instruction tuning dataset (BLIP3o-60k) is curated. This dataset consists of ~60k high-quality prompt-image pairs, many generated by prompting GPT-4o or sourced from JourneyDB and DALL·E 3. This stage helps the model align better with human preferences and improves visual quality. The authors note that the model adapts quickly to AI-generated data styles.
The paper presents extensive evaluation results. In image understanding benchmarks (VQAv2, MMBench, MMMU, etc.), BLIP3-o 8B achieves state-of-the-art performance across most evaluated datasets, demonstrating the effectiveness of the frozen MLLM backbone and CLIP-based approach. For image generation, BLIP3-o achieves strong performance on metrics like GenEval and WISE. While automated metrics like DPG-Bench show mixed results compared to baselines, a human evaluation on DPG-Bench prompts demonstrates that BLIP3-o significantly outperforms Janus Pro in both visual quality and prompt alignment according to human judges.
A key practical contribution of the paper is the full open-sourcing of BLIP3-o models, code, training scripts, and both the pretraining and instruction tuning datasets. This aims to facilitate further research and development in unified multimodal AI.
Future work includes extending BLIP3-o to downstream applications such as image editing, visual dialogue, and interleaved generation, starting with enabling image reconstruction within the unified framework.
In summary, BLIP3-o is a practical implementation of a unified multimodal model, demonstrating that combining autoregressive and diffusion models with CLIP features and Flow Matching, trained sequentially, is an effective strategy for achieving strong performance in both image understanding and generation. The released resources provide a valuable foundation for researchers and practitioners working on multimodal AI systems.