Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Emerging Properties in Unified Multimodal Pretraining (2505.14683v2)

Published 20 May 2025 in cs.CV

Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/

Summary

  • The paper introduces BAGEL, a 7B active parameter open-source model with emergent capabilities in multimodal reasoning and semantic editing.
  • It employs a decoder-only mixture-of-transformer-experts architecture integrating distinct modules for multimodal understanding and generation without bottlenecks.
  • The multi-stage training on trillions of tokens and interleaved data leads to notable improvements in image understanding, generation, and intelligent editing.

This paper introduces BAGEL (Scala*ble **Generative **Cognitive Model*), an open-source 7B active parameter (14B total) multimodal foundation model designed for unified understanding and generation across text, image, and video. The key innovation lies in scaling pretraining with large amounts of carefully curated interleaved multimodal data, leading to emergent capabilities in complex reasoning and manipulation.

Model Architecture:

BAGEL employs a decoder-only Mixture-of-Transformer-Experts (MoT) architecture. This design features two distinct transformer experts: one for multimodal understanding and another for multimodal generation. These experts operate on the same token sequence via shared self-attention in every layer, avoiding bottlenecks often found in other unified models.

  • Initialization: The LLM backbone is initialized from Qwen2.5, utilizing RMSNorm, SwiGLU, RoPE, and GQA. QK-Norm is added for training stability.
  • Visual Encoders:
    • Understanding: A SigLIP2-so400m/14 ViT encoder (fixed 384×384384 \times 384 resolution, interpolated for up to 980×980980 \times 980 native aspect ratio inputs using NaViT) processes images. A two-layer MLP connects ViT tokens to LLM hidden states.
    • Generation: A frozen pre-trained VAE from FLUX converts images to/from latent space (8x downsampling, 16 channels). Latents are patched (2×22 \times 2) to match LLM hidden dimensions.
  • Token Representation: 2D positional encoding is applied to ViT and VAE tokens. Diffusion timestep embedding is added directly to VAE token hidden states.
  • Prediction: Text tokens are predicted autoregressively. Visual tokens use the Rectified Flow method.
  • Transformer Design Choice: The "Integrated Transformer" approach was chosen over "Quantized Autoregressive" and "External Diffuser" designs to enable bottleneck-free, long-context interaction between understanding and generation. Experiments showed MoT outperforming dense and Mixture-of-Experts (MoE) variants, especially in generation, by dedicating separate capacity to understanding (original LLM parameters for text/ViT tokens) and generation (replicated LLM parameters for VAE tokens).

Generalized Causal Attention:

For training with interleaved multimodal data, BAGEL uses a generalized causal attention mechanism:

  • Tokens are partitioned into single-modality splits. Tokens in one split attend to all tokens in preceding splits.
  • Causal attention is used for text tokens within a split; bidirectional attention for vision tokens.
  • For generation tasks involving multiple images, three sets of visual tokens are prepared for each image:

    1. Noised VAE tokens: Used for Rectified-Flow training (MSE loss).
    2. Clean VAE tokens: Conditioning for subsequent generation.
    3. ViT tokens: Aid interleaved generation quality.
  • Subsequent tokens attend to clean VAE and ViT tokens of preceding images, not noised ones.

  • Diffusion Forcing: Applied for multi-image generation, with independent noise levels for different images. Consecutive images can be grouped with full attention within the group (shared noise level).
  • Implementation: PyTorch FlexAttention provides a ~2x speedup.
  • Inference: KV pairs of clean VAE and ViT tokens are cached. For classifier-free guidance, text, ViT, and clean VAE tokens are randomly dropped (probs 0.1, 0.5, 0.1 respectively).

Data Curation and Preprocessing:

BAGEL is trained on trillions of tokens from diverse sources:

  • Text-Only Data: High-quality text for LLMing capabilities.
  • Vision-Text Paired Data:
    • VLM Image-Text Pairs (500M samples, 0.5T tokens): Web alt-text/captions, filtered by CLIP similarity, resolution, text length, and deduplicated. Includes OCR, chart, and grounding data.
    • T2I Image-Text Pairs (1.6B samples, 2.6T tokens): High-quality pairs and synthetic data from models like FLUX, SD3, featuring diverse caption styles.
  • Vision-Text Interleaved Data:
    • Understanding Data (100M samples, 0.5T tokens): From VLM interleaved datasets.
    • Generation Data (Video: 45M samples, 0.7T tokens; Web: 20M samples, 0.4T tokens):
    • Video Sources: Public online videos, Koala36M (instructional), MVImgNet2.0 (multi-view).
    • Filtering: Temporal splitting, spatial cropping, quality filters (length, resolution, clarity, motion), deduplication.
    • Construction: Inter-frame captions (describing changes) generated by a distilled Qwen2.5-VL-7B model (max 30 tokens/caption). ~4 frames sampled per clip.
    • Web Sources: OmniCorpus (from Common Crawl), image editing datasets.
    • Filtering: Two-stage topic selection (LLM-trained fastText classifier, then LLM refinement). Rule-based filters (UI removal, resolution, clarity, text density, relevance, doc trimming, image quantity).
    • Construction: Caption-first strategy (Qwen2.5-VL-7B generates concise caption before each image). Inter-image text >300 tokens summarized by an LLM.
  • Reasoning-Augmented Data (500k examples):
    • Text-to-Image: Qwen2.5-72B generates query-guidance pairs and detailed prompts; FLUX.1-dev produces images.
    • Free-form Image Manipulation: VLM generates reasoning traces using source/target images, query, and an R1 example. Sources include editing datasets and video data.
    • Conceptual Edits: Three-stage VLM pipeline on web interleaved data to identify input/output pairs, generate questions, and assess quality, then produce reasoning traces.

Training Strategy:

A multi-stage training approach is used:

  1. Alignment (5K steps, 4.9B tokens):
    • Aligns SigLIP2 ViT with Qwen2.5 LLM. Only MLP connector trained.
    • Image-text pairs only (captioning), 378×378378 \times 378 resolution.
    • LR: 1×1031 \times 10^{-3} (cosine), Warm-up: 250 steps.
  2. Pre-training (PT) (200K steps, 2.5T tokens):
    • All model parameters trainable except VAE. QK-Norm added.
    • Native resolution for understanding (224-980px) and generation (256-512px).
    • Data: text (5%), T2I pairs (60%), I2T pairs (10%), interleaved understanding (10%), interleaved video gen (10%), interleaved web gen (5%).
    • LR: 1.0×1041.0 \times 10^{-4} (constant), Warm-up: 2500 steps.
    • Loss weights: CE:MSE = 0.25:1. Max context: 16K.
  3. Continued Training (CT) (100K steps, 2.6T tokens):
    • Increased visual resolution (Und: 378-980px, Gen: 512-1024px).
    • Increased interleaved data ratio: text (5%), T2I (40%), I2T (10%), interleaved und (15%), interleaved video (15%), interleaved web (15%).
    • Diffusion timestep shift increased from 1.0 to 4.0.
    • LR: 1.0×1041.0 \times 10^{-4} (constant), Warm-up: 2500 steps. Max context: 40K.
  4. Supervised Fine-tuning (SFT) (15K steps, 72.7B tokens):
    • High-quality subsets from image-text pair and interleaved-generation datasets. Filtered LLaVA-OV and Mammoth-VL instruction data.
    • Data: text (5%), T2I (30%), I2T (5%), interleaved und (20%), interleaved video (20%), interleaved web (20%).
    • LR: 2.5×1052.5 \times 10^{-5} (constant), Warm-up: 500 steps.
    • EMA ratio: 0.995. Max context: 40K.
  • Optimizer: AdamW (β1=0.9,β2=0.95,ϵ=1.0×1015\beta_1=0.9, \beta_2=0.95, \epsilon=1.0 \times 10^{-15}). Gradient norm clip: 1.0.
  • Hyperparameter Tuning:
    • Data Sampling: Experiments showed higher generation data ratio (e.g., 4:1 gen:und) reduces MSE loss without harming CE loss.
    • Learning Rate: Larger LR benefits generation (MSE), smaller LR benefits understanding (CE). Balanced via loss weighting.

Evaluation and Emerging Properties:

  • Benchmarks:
    • Understanding: MME, MMBench, MM-Vet, MMMU, MathVista, MMVP.
    • Generation: GenEval, WISE.
    • Editing: GEdit-Bench.
  • IntelligentBench: A new benchmark (350 examples) for free-form image manipulation, evaluated by GPT-4o on fulfiLLMent, consistency, and creativity.
  • Emerging Properties: Defined as abilities not present in earlier training but appearing later.
    • Understanding/generation saturate early (~0.18T / ~0.68T tokens to 85% peak).
    • Editing tasks converge slower (~2.64T tokens).
    • Intelligent Editing emerges last (~3.61T tokens), showing significant improvement after 3T tokens, especially with higher resolution.
    • Removing ViT tokens minimally impacts GEdit-Bench but causes a 16% drop in Intelligent Edit, highlighting ViT's role in semantic reasoning for complex edits.
    • Qualitative examples show text rendering (e.g., "hello", "BAGEL") emerges between 1.5T-4.5T tokens.

Results:

  • Image Understanding: BAGEL (7B MoT) outperforms existing unified models and is competitive with specialized understanding models like Qwen2.5-VL and InternVL2.5.
  • Image Generation:
    • GenEval: 88% (with LLM rewriter), 82% (without), outperforming FLUX.1-dev, SD3-Medium, Janus-Pro, MetaQuery-XL.
    • WISE: 0.52 (0.70 with Self-CoT), outperforming prior open-source models.
  • Image Editing:
    • GEdit-Bench: Competitive with Step1X-Edit, outperforms Gemini 2.0.
    • IntelligentBench: 44.9 (55.3 with Self-CoT), significantly surpassing Step1X-Edit (14.9).
  • Generation/Editing with Thinking (Chain-of-Thought):
    • Improves WISE score from 0.52 to 0.70.
    • Improves IntelligentBench score from 44.9 to 55.3.
  • World Modeling: Fine-tuned with more video/navigation data, BAGEL demonstrates navigation, rotation, and multi-frame generation, generalizing to diverse domains.

Implementation Details & Considerations:

  • The model leverages existing strong open-source components like Qwen2.5, SigLIP2, and FLUX VAE.
  • The MoT architecture, while increasing total parameters, maintains similar FLOPs to a dense model of the "active parameter" size during training/inference.
  • Careful data curation, filtering, and construction are crucial, especially for interleaved and reasoning-augmented data. Small, specialized VLMs (distilled from larger ones) are used for scalable captioning.
  • The multi-stage training protocol allows for gradual scaling of resolution and data complexity.
  • Hyperparameters like data sampling ratios and learning rates need careful balancing in unified training.
  • The use of PyTorch FlexAttention offers significant speedups for the custom attention mechanism.
  • The "thinking" or CoT process, where the model generates intermediate reasoning steps, significantly boosts performance on complex tasks.

The paper concludes by highlighting BAGEL's SOTA performance and its unique emergent capabilities due to large-scale unified pretraining on interleaved data. The model, code, and data creation protocols are open-sourced.

Youtube Logo Streamline Icon: https://streamlinehq.com