Vision-and-Language BERT Models

Updated 15 April 2026

Vision-and-Language BERT models are hybrid neural architectures that jointly encode visual and linguistic inputs into a unified embedding space.
They employ various fusion strategies—single-stream, two-stream, and hybrid—to integrate region features and text tokens via Transformer layers.
These models achieve state-of-the-art performance in tasks like visual question answering and image captioning by leveraging multimodal pretraining and innovative alignment objectives.

Vision-and-Language (V&L) BERT models are neural architectures that extend the standard BERT paradigm to jointly encode and fuse visual and linguistic information. By leveraging large-scale multimodal pretraining with objectives tailored for cross-modal alignment, these models learn deep, generic representations that support a broad spectrum of vision-language tasks including visual question answering, visual commonsense reasoning, image-text retrieval, captioning, and dialog. V&L BERT architectures have established the state of the art in vision-language understanding and constitute a foundational research axis at the intersection of computer vision and natural language processing.

The core design challenge in V&L BERT models is the integration (fusion) of high-dimensional visual and linguistic features into a common embedding space within a Transformer architecture (Long et al., 2022, Gwinnup et al., 2023).

Textual Embedding: Standard WordPiece tokenization is applied; embeddings consist of token, position, and segment (modality) encodings, with special tokens ([CLS], [SEP], [MASK]) to delimit sequences and serve as pooling vectors.
Visual Embedding: Input images are encoded using object detectors (e.g., Faster R-CNN) to yield region-of-interest (RoI) features (Lu et al., 2019, Li et al., 2019), or alternatively via grid/patch tokenization as in vision transformers (Long et al., 2022). RoI features are often augmented with spatial location encodings.
Fusion Mechanisms:
- Single-stream models (VisualBERT, UNITER, VL-BERT, OSCAR): All tokens (text + visual) are concatenated at input and passed through a unified multi-layer Transformer. This architecture enables full self-attention at all depths, allowing direct intra- and inter-modal alignment (Li et al., 2019, Su et al., 2019).
- Two-stream models (ViLBERT, LXMERT): Separate Transformers for vision and language are periodically coupled via cross-modal co-attention blocks, so unimodal contextualization precedes or alternates with cross-modal fusion (Lu et al., 2019, Long et al., 2022).
- Hybrid models: Late-fusion or adapter-based models introduce small cross-modal modules atop pretrained unimodal encoders, retaining frozen backbones for one or both modalities and adding fusion in lightweight but late layers (Gwinnup et al., 2023).

A summary table of input processing is as follows:

Modality	Tokenization/Embedding	Notes
Text	WordPiece, Positional/Segment	[CLS], [SEP], [MASK], shared vocabulary
Vision	RoI (Faster-RCNN), Patch/Grid	Optional spatial encoding; grid for ViT
Cross-modal	Concatenation (1-stream)	Full or periodic co-attention (2-stream)

2. Pretraining Objectives and Loss Functions

V&L BERTs are pretrained with composite objectives to induce both intra- and intermodal alignment (Long et al., 2022, Gwinnup et al., 2023). Representative objectives include:

Masked Language Modeling (MLM): Predict randomly masked text tokens from context (including both text and visual features).

$\mathcal{L}_{\mathrm{MLM}} = -\sum_{i\in \mathcal{M}} \log P(w_i \mid \tilde{w}_{<i},\,\text{Visual})$

Masked Region Modeling (MRM): Predict masked visual region labels (class, distribution, or features) from the remaining visual+text context. For region $j$ :

$\mathcal{L}_{\mathrm{MRM}_C} = -\sum_{j\in \mathcal{R}} \log P(c_j \mid \hat V_j)$

Image-Text Matching (ITM): Binary prediction whether (image, text) pair is aligned, typically via a classification head atop the [CLS] token from the fusion layers.

$\mathcal{L}_{\mathrm{ITM}} = -[y\log P(1 | \mathrm{[CLS]}) + (1-y)\log P(0 | \mathrm{[CLS]})]$

Contrastive Alignment (Dual Encoder Models): Batch-wise InfoNCE loss to align true image-caption pairs and decouple mismatches (Gwinnup et al., 2023).

Advanced training regimes introduce span-level masking (Lin et al., 2020), structured masking guided by cross-modal alignment (Zhuge et al., 2021), and loss weighting schedules for multi-task pretraining (Yang et al., 2022).

3. Model Variants and Key Innovations

A spectrum of model architectures explore different fusion, training, and data strategies:

VisualBERT: Canonical single-stream model; region features appended as tokens after text, pretraining with MLM and sentence-image matching (Li et al., 2019).
ViLBERT: Two-stream (co-attention) architecture; separate vision and language encoders with cross-modal transformer blocks, strongly decoupling unimodal from shared embeddings (Lu et al., 2019).
VL-BERT: Single-stream, all tokens passed through a unified Transformer. Jointly pre-trained on massive vision-language and text-only data, with distinct MLM and RoI classification losses (Su et al., 2019).
UNITER: Single-stream plus novel word-region alignment losses and extended pretraining on both paired and unpaired corpora (Long et al., 2022).
KVL-BERT: Extension of VL-BERT with explicit commonsense knowledge tokens from ConceptNet, injected per token with position and attention masking to encode world knowledge for visual reasoning (Song et al., 2020).
DiMBERT: Disentangled multimodal attention: each modality has separate attention matrices and projections but is fused for downstream objectives. Incorporates explicit visual concepts as textual tokens (Liu et al., 2022).
GroundedBERT: Lightweight visual grounding for BERT via partial optimal transport between token and patch embeddings, aligning text features to image regions without retraining the full encoder (Nguyen et al., 2023).

The table below illustrates representative architectural differences:

Model	Fusion	Pretraining Tasks	Unique Feature
VisualBERT	Single-stream	MLM, Sentence–Image Matching	Simple, interpretable
ViLBERT	Two-stream	MLM, MRM, Alignment	Co-attention blocks
VL-BERT	Single-stream	MLM+visual, Masked RoI	Full end-to-end optimization
KVL-BERT	Single-stream	MLM, Masked RoI, ConceptNet	Commonsense knowledge mask
DiMBERT	Single-stream	BLM, S2SLM	Disentangled attention
GroundedBERT	BERT+VG head	Img-text matching, partial OT	Lightweight visual alignment

4. Fine-Tuning and Downstream Task Specialization

Once pretrained, V&L BERTs are modularly adapted to downstream tasks by attaching shallow task-specific heads and performing end-to-end fine-tuning (Long et al., 2022, Su et al., 2019):

Visual Question Answering (VQA): Classification head atop [CLS] embedding; answer selection from fixed vocabulary.
Visual Commonsense Reasoning (VCR): 4-way classification for both answers and rationales; input typically includes question, answer/rationale candidates, and image RoIs.
Image Captioning: Generative head (autoregressive decoder or seq-to-seq mask) to produce captions, using standard cross-entropy or CIDEr-based RL (Liu et al., 2022).
Referring Expression Comprehension: For each RoI, binary classifier or softmax score indicating correspondence to referring phrase.
Dialog and Navigation: Input includes conversational history and spatial/temporal context, with fusion adapted for sequential or recurrent modeling (Wang et al., 2020, Hong et al., 2020).

5. Empirical Performance and Comparisons

V&L BERT models establish strong empirical baselines and often outperform task-specific architectures:

ViLBERT achieves SOTA across VQA, VCR, RefCOCO+ (e.g., 70.55% on VQA v2.0 test-dev, 72.42% Q→A accuracy on VCR) (Lu et al., 2019).
VL-BERT outperforms prior and concurrent work on VCR, VQA, and referring expressions without the need for an explicit image-text matching loss (Su et al., 2019).
KVL-BERT demonstrates that knowledge injection (+0.8–1% on VCR tasks) improves accuracy in visual commonsense reasoning over strong pretrained models (Song et al., 2020).
DiMBERT’s disentangled attention and visual concepts yield absolute gains of up to +2.2% RefCOCO+ val accuracy and +4.1 CIDEr on MSCOCO captioning (Liu et al., 2022).
GroundedBERT achieves consistent improvement (+2–6 F1/acc points) over BERT or Vokenization on language-only tasks, demonstrating benefits of even weak visual grounding (Nguyen et al., 2023).

6. Extensions, Limitations, and Open Challenges

Key challenges and future research avenues include (Long et al., 2022, Gwinnup et al., 2023):

Modality Alignment: More granular and explicit alignment between vision and language at representational and instance level (e.g., optimal transport alignment, masking in latent joint space).
Efficient Pretraining: Multi-task and curriculum pretraining strategies to balance dataset heterogeneity, task difficulty, and transferability.
Evaluation Methodologies: Developing intermediate metrics (e.g., perplexity in joint embedding space) and visual reasoning benchmarks to better track pretraining progress and generalization (e.g., VLU tasks (Alper et al., 2023)).
Knowledge and World Modeling: Incorporating structured knowledge—both visual (scene graphs, object properties) and commonsense (ConceptNet, relational graphs)—to facilitate high-level reasoning, as in KVL-BERT.
Task Generalization: Extending fusion strategies for modular downstream adaptation (e.g., adapters for machine translation or speech).
Data Scarcity & Grounding: Scaling robustly to weakly or sparsely grounded settings common in real-world multimodal problems (e.g., How2 instructional videos), with curriculum-based or modular fusion approaches (Gwinnup et al., 2023).
Computational Efficiency: Addressing the quadratic cost of joint attention for large-scale or long-context fusion (single-stream), and balancing representation power versus capacity for multimodal and unimodal transfer.

7. Significance and Impact

The V&L BERT paradigm has established fundamental benchmarks for generic, transferable, and robust multimodal representation learning. By combining vast corpora of aligned vision-language data, joint self-supervised objectives, and extensible architectures, these models serve as universal backbones for visual reasoning, language-guided perception, dialog, navigation, and beyond. Ongoing research prioritizes explicit modality alignment, principled curriculum pretraining, lightweight fusion for low-resource tasks, and incorporation of structured world knowledge—pointing toward increasingly generalist architectures for multimodal intelligence (Long et al., 2022, Gwinnup et al., 2023).