BEiT-3: Multimodal Transformer Models
- BEiT-3 models are general-purpose multimodal foundation models that integrate vision and language through a unified Multiway Transformer architecture.
- They employ unified masked modeling over images and text, enabling deep cross-modal fusion and state-of-the-art results in tasks like segmentation and VQA.
- Their modular design with modality-specific FFNs and scalable parameters supports efficient adaptation to diverse downstream vision-language applications.
BEiT-3 Models are general-purpose multimodal foundation models based on a unified Multiway Transformer architecture, enabling deep fusion of vision and language modalities for a broad spectrum of vision, language, and vision-language tasks. Leveraging masked "language" modeling objectives across images, text, and image-text pairs, BEiT-3 achieves state-of-the-art transfer performance and forms the backbone of several leading models in open-vocabulary segmentation and referring expression understanding (Wang et al., 2022, Chen et al., 2024, Zhang et al., 2024).
1. Multiway Transformer Backbone and Architectural Innovations
BEiT-3 introduces a Multiway Transformer backbone that generalizes and synthesizes the architectural principles of both vision and language Transformer encoders. Each layer comprises a shared multi-head self-attention (MHSA) module and a modality-specific pool of feed-forward experts (FFNs):
- Input Embeddings: Images are split into non-overlapping patches (e.g., or ), linearly projected to obtain . Text is tokenized (e.g., XLM-RoBERTa tokenizer) to .
- Token Routing and Fusion: For joint tasks, visual and language tokens are concatenated; MHSA operates on the combined sequence, followed by routing to the modality-specific FFN (V-FFN, L-FFN, VL-FFN in top layers).
- Deep Fusion: Cross-modal alignment and interaction occur at every block via shared attention weights, with earlier layers preserving modality–specific pathways and top layers enabling modality fusion (Wang et al., 2022).
The base version uses layers, hidden size, and heads, while the large model employs and ; the full-scale release scales to 40 layers with a hidden size of 1408 and 16 attention heads.
2. Unified Masked "Language" Modeling Pretraining
BEiT-3 conducts self-supervised pretraining by treating both images and texts as token sequences within a shared space, introducing three primary objectives:
- Imglish (Masked Image Modeling, MIM): Images are discretized via a VQ tokenizer; patches are randomly masked, and reconstruction is supervised with a cross-entropy loss over VQ tokens.
- English (Masked Text Modeling, MTM): Standard BERT/RoBERTa-style masking and prediction over text.
- Masked Parallel Image-Text (MM): Jointly masks both vision and language tokens in aligned image–text pairs, with losses for both modalities:
The aggregate objective is (Wang et al., 2022). All objectives share a single backbone, driving unified, cross-modal pretraining.
3. Model Scaling and Parameterization
BEiT-3 adopts a ViT-giant recipe for scaling, maintaining balance between capacity and efficiency:
| Model | Layers | Hidden | MLP | V-FFN | L-FFN | VL-FFN | Shared Attn | Total Params |
|---|---|---|---|---|---|---|---|---|
| BEiT-3 | 40 | 1408 | 6144 | 692 M | 692 M | 52 M | 317 M | 1.9 B |
Further breakdown:
- Patch size: (for 224×224 images)
- Expert FFNs: Substantial capacity in both vision (692 M) and language (692 M) paths
- VL-FFN: Fused in the top three layers for maximal cross-modal information exchange
This scaling approach yields consistent gains, with deeper, wider models outperforming shallower ones across downstream benchmarks (Wang et al., 2022, Zhang et al., 2024).
4. Downstream Applications: Open-Vocabulary and Referring Segmentation
BEiT-3 constitutes the core of several state-of-the-art segmentation and vision-LLMs.
OMTSeg: Open-Vocabulary Panoptic Segmentation
OMTSeg exploits the BEiT-3-Large backbone (frozen) for panoptic segmentation, with key structural choices:
- Multiway Fusion Encoder: Stacked "fusion" Transformer layers enable vision–language mixing at every block.
- Adapters and Prompt Tuning: Dense prediction adapts via a spatial "visual adapter" (SFI) and prompt tuning, modifying only selected prompt embeddings (e.g., category token deltas).
- Segmentation Head: Adapts Mask2Former’s multi-decoder; performs cross-attention to both visual and linguistic streams, followed by mask prediction, contrastive alignment, and objectness scoring.
Empirically, OMTSeg achieves state-of-the-art open-vocabulary semantic segmentation mIoU on multiple datasets and panoptic quality (PQ), while being highly parameter-efficient (0.72B) compared to CLIP-based models:
| Method | Params | ADE20K PQ/AP/mIoU | COCO PQ/AP/mIoU |
|---|---|---|---|
| ODISE | 1.5 B | 22.6 / 14.4 / 29.9 | 55.4 / 46.0 / 65.2 |
| OMTSeg (Ours) | 0.72 B | 27.5 / 17.4 / 34.8 | 54.9 / 45.0 / 64.0 |
Ablative analysis highlights large performance drops when omitting cross-modal attention (PQ 27.5→18.4) or the visual adapter (PQ 27.5→8.8) (Chen et al., 2024).
EVF-SAM: Referring Expression Segmentation
EVF-SAM demonstrates the impact of early-fusion BEiT-3 backbones for text-prompted Segment Anything (SAM):
- Early Fusion: Runs self-attention over concatenated image and text streams at every layer, outperforming late-fusion or unimodal text encoders.
- Prompt Generation: A [CLS] token from the fused stream is projected as a prompt token for SAM’s sparse prompt interface.
Reported RefCOCO (testA) cIoU metrics:
| Model | cIoU |
|---|---|
| CLIP (text+image, late fuse) | 67.9 |
| BEiT-3 (early fuse, 1–24 layers) | 83.7 |
| ViLT (early fusion) | 75.3 |
EVF-SAM achieves state-of-the-art results with fewer parameters (1.32B vs. 7.7B for LISA), and ablation studies confirm that BEiT-3-generated prompts, even with minimal fine-tuning, are sufficient for high segmentation accuracy (Zhang et al., 2024).
5. Pretraining Data, Protocols, and Hyperparameters
BEiT-3 pretraining leverages only publicly available data:
- Image–Text pairs: 21M (CC12M, CC3M, SBU, COCO, VG)
- Images: 14M (ImageNet-21K)
- Text: 160 GB diverse corpora
Key pretraining choices include:
- Optimizer: AdamW (, , )
- Weight decay: 0.05
- Learning rate schedule: cosine decay, peak , 10k warmup steps
- Augmentations: random crop, flip, color jitter
- Batch: 6144 (2048 each image, text, pair)
- Resolution: 224×224 for pretrain; up to 896 or 1280 for various finetuning tasks
Fine-tuning protocols adjust schedules, drop rates, and data augmentation for task specialization across segmentation, detection, VQA, and retrieval. Mask2Former adapters and prompt tuning are commonly used for dense prediction tasks (Wang et al., 2022, Chen et al., 2024).
6. Quantitative Performance and Comparative Analysis
Across a spectrum of benchmarks, BEiT-3 models reach or surpass prior state of the art in both vision and vision-language domains:
Vision Tasks (No Extra Data):
| Task | Dataset | Metric | Previous SOTA | BEiT-3 |
|---|---|---|---|---|
| Semantic Segmentation | ADE20K | mIoU (+MS) | 61.4 (FD-SwinV2) | 62.8 (+1.4) |
| Object Detection | COCO | AP | 63.3 (DINO) | 63.7 (+0.4) |
| Instance Segmentation | COCO | AP<sup>mask</sup> | 54.7 (Mask DINO) | 54.8 (+0.1) |
| Classification | ImageNet | Top-1 Acc. | 89.0 (FD-CLIP) | 89.6 (+0.6) |
Vision–Language Tasks:
- Visual Reasoning (NLVR2): 92.6 test [CoCa: 87.0]
- VQA (VQAv2): 84.0 [CoCa: 82.3]
- Image Captioning (COCO Karpathy): CIDEr 147.6 [OFA: 145.3]
- Retrieval (Flickr30K, COCO): 98.0/94.9 img→txt R@1, 90.3/81.5 txt→img R@1 in finetuned/zero-shot (Wang et al., 2022)
Performance gains are attributed to the unique early and deep fusion properties of the architecture, lightweight adapters, and prompt tuning mechanisms, as evidenced in OMTSeg and EVF-SAM deployments.
7. Architectural Advantages, Limitations, and Research Outlook
BEiT-3’s design offers several documented advantages:
- Layer-wise vision-language cross-attention fosters rich regional and contextual feature extraction, exceeding the capability of two-stream CLIP-based systems (Chen et al., 2024).
- Prompt specialization via trainable tokens enables superior open-vocabulary generalization (Chen et al., 2024).
- Early cross-modal attention enhances fine-grained visual-linguistic alignment, especially beneficial for referring segmentation (Zhang et al., 2024).
- Parameter efficiency allows high-quality dense tasks with reduced backbone size (0.7–1.3B vs. 7–13B for LLM alternatives).
Empirical ablations demonstrate that omitting cross-modal attention or the spatial adapter leads to dramatic performance reduction. In masked segmentation, excluding BEiT-3-generated prompts slashes accuracy by over 60 points (cIoU). This suggests cross-modal interaction is essential for BEiT-3's success (Chen et al., 2024, Zhang et al., 2024).
A plausible implication is that future models may benefit from scaling BEiT-3’s early-fusion backbones further and exploring more sophisticated, task-specific adapters within the Multiway Transformer framework. The modular, expert-driven design supports extensibility and domain adaptation across the vision-language spectrum.