BEiT-3: Multimodal Transformer Models

Updated 29 December 2025

BEiT-3 models are general-purpose multimodal foundation models that integrate vision and language through a unified Multiway Transformer architecture.
They employ unified masked modeling over images and text, enabling deep cross-modal fusion and state-of-the-art results in tasks like segmentation and VQA.
Their modular design with modality-specific FFNs and scalable parameters supports efficient adaptation to diverse downstream vision-language applications.

BEiT-3 Models are general-purpose multimodal foundation models based on a unified Multiway Transformer architecture, enabling deep fusion of vision and language modalities for a broad spectrum of vision, language, and vision-language tasks. Leveraging masked "language" modeling objectives across images, text, and image-text pairs, BEiT-3 achieves state-of-the-art transfer performance and forms the backbone of several leading models in open-vocabulary segmentation and referring expression understanding (Wang et al., 2022, Chen et al., 2024, Zhang et al., 2024).

1. Multiway Transformer Backbone and Architectural Innovations

BEiT-3 introduces a Multiway Transformer backbone that generalizes and synthesizes the architectural principles of both vision and language Transformer encoders. Each layer comprises a shared multi-head self-attention (MHSA) module and a modality-specific pool of feed-forward experts (FFNs):

Input Embeddings: Images $I\in\mathbb{R}^{H\times W\times 3}$ are split into non-overlapping patches (e.g., $14\times14$ or $16\times16$ ), linearly projected to obtain $V\in\mathbb{R}^{N_p\times d}$ . Text is tokenized (e.g., XLM-RoBERTa tokenizer) to $L\in\mathbb{R}^{N_w\times d}$ .
Token Routing and Fusion: For joint tasks, visual and language tokens are concatenated; MHSA operates on the combined sequence, followed by routing to the modality-specific FFN (V-FFN, L-FFN, VL-FFN in top layers).
Deep Fusion: Cross-modal alignment and interaction occur at every block via shared attention weights, with earlier layers preserving modality–specific pathways and top layers enabling modality fusion (Wang et al., 2022).

The base version uses $L=12$ layers, $d=768$ hidden size, and $h=12$ heads, while the large model employs $L=24$ and $d=1024$ ; the full-scale release scales to 40 layers with a hidden size of 1408 and 16 attention heads.

2. Unified Masked "Language" Modeling Pretraining

BEiT-3 conducts self-supervised pretraining by treating both images and texts as token sequences within a shared space, introducing three primary objectives:

Imglish (Masked Image Modeling, MIM): Images are discretized via a VQ tokenizer; $n$ patches are randomly masked, and reconstruction is supervised with a cross-entropy loss over VQ tokens.
English (Masked Text Modeling, MTM): Standard BERT/RoBERTa-style masking and prediction over text.
Masked Parallel Image-Text (MM): Jointly masks both vision and language tokens in aligned image–text pairs, with losses for both modalities:

$\mathcal{L}_{\mathrm{mm}} = -\mathbb{E}_{(x,t)\sim\mathcal{D}_{\mathrm{mm}}}\left[ \sum_{i\in\mathcal{M}_{\mathrm{img}}}\log p(v_i|\widetilde{x},\widetilde{t}) + \sum_{j\in\mathcal{M}_{\mathrm{txt}}}\log p(w_j|\widetilde{x},\widetilde{t}) \right]$

The aggregate objective is $\mathcal{L}_{\mathrm{pretrain}} = \mathcal{L}_{\mathrm{img}} + \mathcal{L}_{\mathrm{txt}} + \mathcal{L}_{\mathrm{mm}}$ (Wang et al., 2022). All objectives share a single backbone, driving unified, cross-modal pretraining.

3. Model Scaling and Parameterization

BEiT-3 adopts a ViT-giant recipe for scaling, maintaining balance between capacity and efficiency:

Model	Layers	Hidden	MLP	V-FFN	L-FFN	VL-FFN	Shared Attn	Total Params
BEiT-3	40	1408	6144	692 M	692 M	52 M	317 M	1.9 B

Further breakdown:

Patch size: $14\times14$ (for 224×224 images)
Expert FFNs: Substantial capacity in both vision ( $\sim$ 692 M) and language ( $\sim$ 692 M) paths
VL-FFN: Fused in the top three layers for maximal cross-modal information exchange

This scaling approach yields consistent gains, with deeper, wider models outperforming shallower ones across downstream benchmarks (Wang et al., 2022, Zhang et al., 2024).

4. Downstream Applications: Open-Vocabulary and Referring Segmentation

BEiT-3 constitutes the core of several state-of-the-art segmentation and vision-LLMs.

OMTSeg: Open-Vocabulary Panoptic Segmentation

OMTSeg exploits the BEiT-3-Large backbone (frozen) for panoptic segmentation, with key structural choices:

Multiway Fusion Encoder: Stacked "fusion" Transformer layers enable vision–language mixing at every block.
Adapters and Prompt Tuning: Dense prediction adapts via a spatial "visual adapter" (SFI) and prompt tuning, modifying only selected prompt embeddings (e.g., category token deltas).
Segmentation Head: Adapts Mask2Former’s multi-decoder; performs cross-attention to both visual $\mathbf{F}_V$ and linguistic $\mathbf{F}_L$ streams, followed by mask prediction, contrastive alignment, and objectness scoring.

Empirically, OMTSeg achieves state-of-the-art open-vocabulary semantic segmentation mIoU on multiple datasets and panoptic quality (PQ), while being highly parameter-efficient (0.72B) compared to CLIP-based models:

Method	Params	ADE20K PQ/AP/mIoU	COCO PQ/AP/mIoU
ODISE	1.5 B	22.6 / 14.4 / 29.9	55.4 / 46.0 / 65.2
OMTSeg (Ours)	0.72 B	27.5 / 17.4 / 34.8	54.9 / 45.0 / 64.0

Ablative analysis highlights large performance drops when omitting cross-modal attention (PQ 27.5→18.4) or the visual adapter (PQ 27.5→8.8) (Chen et al., 2024).

EVF-SAM: Referring Expression Segmentation

EVF-SAM demonstrates the impact of early-fusion BEiT-3 backbones for text-prompted Segment Anything (SAM):

Early Fusion: Runs self-attention over concatenated image and text streams at every layer, outperforming late-fusion or unimodal text encoders.
Prompt Generation: A [CLS] token from the fused stream is projected as a prompt token for SAM’s sparse prompt interface.

Reported RefCOCO (testA) cIoU metrics:

Model	cIoU
CLIP (text+image, late fuse)	67.9
BEiT-3 (early fuse, 1–24 layers)	83.7
ViLT (early fusion)	75.3

EVF-SAM achieves state-of-the-art results with fewer parameters (1.32B vs. 7.7B for LISA), and ablation studies confirm that BEiT-3-generated prompts, even with minimal fine-tuning, are sufficient for high segmentation accuracy (Zhang et al., 2024).

5. Pretraining Data, Protocols, and Hyperparameters

BEiT-3 pretraining leverages only publicly available data:

Image–Text pairs: 21M (CC12M, CC3M, SBU, COCO, VG)
Images: 14M (ImageNet-21K)
Text: 160 GB diverse corpora

Key pretraining choices include:

Optimizer: AdamW ( $\beta_1=0.9$ , $\beta_2=0.98$ , $\epsilon=1\text{e}^{-6}$ )
Weight decay: 0.05
Learning rate schedule: cosine decay, peak $1\text{e}^{-3}$ , 10k warmup steps
Augmentations: random crop, flip, color jitter
Batch: 6144 (2048 each image, text, pair)
Resolution: 224×224 for pretrain; up to 896 or 1280 for various finetuning tasks

Fine-tuning protocols adjust schedules, drop rates, and data augmentation for task specialization across segmentation, detection, VQA, and retrieval. Mask2Former adapters and prompt tuning are commonly used for dense prediction tasks (Wang et al., 2022, Chen et al., 2024).

6. Quantitative Performance and Comparative Analysis

Across a spectrum of benchmarks, BEiT-3 models reach or surpass prior state of the art in both vision and vision-language domains:

Vision Tasks (No Extra Data):

Task	Dataset	Metric	Previous SOTA	BEiT-3
Semantic Segmentation	ADE20K	mIoU (+MS)	61.4 (FD-SwinV2)	62.8 (+1.4)
Object Detection	COCO	AP	63.3 (DINO)	63.7 (+0.4)
Instance Segmentation	COCO	AP<sup>mask</sup>	54.7 (Mask DINO)	54.8 (+0.1)
Classification	ImageNet	Top-1 Acc.	89.0 (FD-CLIP)	89.6 (+0.6)

Vision–Language Tasks:

Visual Reasoning (NLVR2): 92.6 test [CoCa: 87.0]
VQA (VQAv2): 84.0 [CoCa: 82.3]
Image Captioning (COCO Karpathy): CIDEr 147.6 [OFA: 145.3]
Retrieval (Flickr30K, COCO): 98.0/94.9 img→txt R@1, 90.3/81.5 txt→img R@1 in finetuned/zero-shot (Wang et al., 2022)

Performance gains are attributed to the unique early and deep fusion properties of the architecture, lightweight adapters, and prompt tuning mechanisms, as evidenced in OMTSeg and EVF-SAM deployments.

7. Architectural Advantages, Limitations, and Research Outlook

BEiT-3’s design offers several documented advantages:

Layer-wise vision-language cross-attention fosters rich regional and contextual feature extraction, exceeding the capability of two-stream CLIP-based systems (Chen et al., 2024).
Prompt specialization via trainable tokens enables superior open-vocabulary generalization (Chen et al., 2024).
Early cross-modal attention enhances fine-grained visual-linguistic alignment, especially beneficial for referring segmentation (Zhang et al., 2024).
Parameter efficiency allows high-quality dense tasks with reduced backbone size (0.7–1.3B vs. 7–13B for LLM alternatives).

Empirical ablations demonstrate that omitting cross-modal attention or the spatial adapter leads to dramatic performance reduction. In masked segmentation, excluding BEiT-3-generated prompts slashes accuracy by over 60 points (cIoU). This suggests cross-modal interaction is essential for BEiT-3's success (Chen et al., 2024, Zhang et al., 2024).

A plausible implication is that future models may benefit from scaling BEiT-3’s early-fusion backbones further and exploring more sophisticated, task-specific adapters within the Multiway Transformer framework. The modular, expert-driven design supports extensibility and domain adaptation across the vision-language spectrum.

Markdown Report Issue Upgrade to Chat

References (3)

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (2022)

Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model (2024)

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEiT-3 Models.