Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

129 tokens/sec

GPT-4o

28 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Vision-Language Models

Updated 2 July 2025

Vision-Language Models are AI systems that learn joint representations of images and text using massive web-scale image-text pairs.
They support diverse tasks like image classification, object detection, segmentation, and retrieval, often in zero-shot or few-shot settings.
VLMs leverage unified and modular architectures with contrastive, generative, and alignment objectives to achieve robust vision-language integration.

A Vision-LLM (VLM) is a class of artificial intelligence models designed to learn joint representations of visual data (such as images or videos) and natural language, enabling machines to understand, describe, and reason about visual content with linguistic grounding. VLMs leverage massive amounts of web-scale image-text pairs to uncover semantic correlations between vision and language modalities, supporting a range of visual recognition tasks—including image classification, object detection, segmentation, and retrieval—often in a zero-shot or few-shot setting, where no task-specific annotated data is needed.

1. Evolution of Visual Recognition Paradigms

Early visual recognition pipelines relied on hand-crafted features and classical machine learning models, which demanded domain expertise and did not scale effectively to complex tasks. The advent of deep neural networks brought about end-to-end learning but necessitated vast quantities of labeled data and time-consuming training for every distinct task. Subsequent stages introduced supervised and unsupervised pre-training, leveraging large annotated and unannotated datasets to reduce per-task annotation demands and foster transferability.

This progression culminates in the VLM paradigm: models are pretrained on massive, diverse, web-scale image-text pair datasets, enabling generalization across an extensive range of downstream tasks. Such VLMs can perform zero-shot predictions—that is, direct inference on tasks or classes never explicitly encountered during training—by exploiting rich cross-modal associations learned from internet-scale data.

2. Network Architectures and Pre-training Objective Functions

Modular and Unified Architectures

Image Encoders: Feature extractors based on convolutional neural networks (CNNs, e.g., ResNet, EfficientNet) or transformers (e.g., Vision Transformer, Swin Transformer). Preprocessing steps for transformers typically involve patchifying images, linear embedding, and the addition of positional encodings.
Text Encoders: Transformer models akin to those in NLP (e.g., standard Transformer, BERT, GPT-style). These encoders are frequently initialized with public or custom pretraining.

Three architectural paradigms are prevalent:

Two-tower: Separate encoders for image and text; embeddings are compared in a shared space (e.g., CLIP, ALIGN).
Two-leg: Dual encoders plus explicit fusion layers allowing richer cross-modal interaction (e.g., FLAVA, COCA).
One-tower: Both modalities are jointly processed by a unified model (e.g., CLIPPO, OneR).

Pre-training Objectives

VLMs are optimized through a combination of contrastive, generative, and alignment-based objectives:

Contrastive Loss (e.g., InfoNCE):

$\mathcal{L}_{I \rightarrow T} = - \frac{1}{B} \sum_{i=1}^B \log \frac{\exp(z_i^I \cdot z_i^T/\tau)}{\sum_{j=1}^{B} \exp(z_i^I \cdot z_j^T/\tau)}$

and symmetric for $T \rightarrow I$ ; encourages paired image-text representations to be close, unpaired to be distant.

Masked Modeling:
- Masked Image Modeling:
$\mathcal{L}_{\text{MIM}} = - \frac{1}{B} \sum_{i=1}^B \log f_{\theta}( \overline{x}_i^I \mid \hat{x}_i^I )$ - Masked LLMing:

$\mathcal{L}_{\text{MLM}} = - \frac{1}{B} \sum_{i=1}^B \log f_{\phi}( \overline{x}_i^T \mid \hat{x}_i^T )$
Alignment Objectives:
- Image-Text (IT) Matching:
$\mathcal{L}_{\text{IT}} = p \log \mathcal{S}(z^I, z^T) + (1-p) \log (1 - \mathcal{S}(z^I, z^T))$

These objectives are combined to foster both robust representation learning and fine-grained vision-language alignment essential for dense prediction tasks.

3. Datasets for Pre-training and Evaluation

Pre-training Datasets:

SBU (1M image-caption pairs)
COCO Captions (1.5M)
YFCC100M (100M images/videos with metadata)
Visual Genome (5.4M region/relationship pairs)
Conceptual Captions (CC3M, CC12M)
WIT (37.6M, multilingual)
LAION400M and LAION5B (400M, 5B multilingual, web-scale)
WuKong (Chinese, 100M)
WebLI (12B pairs in 109 languages)

These datasets are vastly larger and more diverse than traditional vision benchmarks, promoting superior generalization and robustness to distribution shifts.

Evaluation Datasets: For benchmarking, established datasets cover:

Image classification (ImageNet-1k, CIFAR, Food-101, etc.)
Object detection (COCO, ODinW, LVIS)
Semantic segmentation (ADE20k, PASCAL VOC)
Retrieval and video/action recognition.

Dataset curation and scale are pivotal in the generalization capacity of VLMs.

4. Principal Methods in VLM Research

Pre-training Approaches:

Contrastive-based models: CLIP, ALIGN, FILIP, PyramidCLIP, etc.
Generative-based models: COCA, FLAVA, PaLI, SegCLIP, mixing captioning and masked modeling.
Alignment-based models: FILIP, GLIP, RegionCLIP, advancing local region-word and global image-text matching.

Transfer Learning Strategies:

Prompt tuning: Adapt text (CoOp, CoCoOp, LASP) and/or visual prompts (VP, RePrompt) for new tasks with few labels.
Feature adapters: Clip-Adapter, Tip-Adapter, enhancing flexibility.
Direct fine-tuning and architectural adaptation: For dense prediction, e.g., MaskCLIP.
Test-time adaptation and cross-attention: VT-Clip, UPT, enabling adaptation without retraining every component.

Knowledge Distillation:

Object detection: ViLD, F-VLM, OV-DETR—distilling VLM “knowledge” into specialized detector architectures.
Semantic segmentation: CLIPSeg, ZegFormer—aligning or passing pixel/region-level semantics.
Pseudo-labeling and teacher-student learning: Leveraged for open-vocabulary object detection and segmentation.

5. Performance Characterization and Analytical Insights

Benchmarks

Zero-shot image classification: CLIP reaches ~76% top-1 on ImageNet-1k; FILIP, COCA exceed 85% with dataset/model scaling (but with diminishing returns at extreme scales).
Dense prediction: Recent VLMs (GLIP, RegionCLIP) achieve strong zero-shot and transfer results for detection/segmentation on COCO, LVIS, ADE20k, especially when leveraging local alignment.
Transfer learning: Prompt-tuned or adapter-based methods consistently surpass zero-shot, whether supervised (few/linear-probe) or unsupervised (UPL, TPT).
Distillation: Incorporating VLM knowledge lifts performance of standard detection/segmentation models to new state-of-the-art on open-vocabulary tasks.

Strengths:

Generalization across domains, outstanding zero-shot/few-shot performance, robust to broad image-text distributions.

Weaknesses:

High training cost (compute/memory).
Diminishing returns at Internet-scale.
Relative immaturity and benchmarking difficulty for dense prediction tasks.
Sensitivity to fair comparison due to training/data discrepancies.

6. Research Challenges and Future Directions

Pre-training:

Attaining fine-grained (pixel/region-level) vision-language alignment for dense predictions.
Developing unified (one-tower) architectures for tighter and more efficient modal integration.
Expansion to multilingual and data-efficient VLMs.
Automated enrichment of training data/captions using LLMs.

Transfer learning:

Unsupervised/domain-adaptive and test-time transfer.
Enhanced prompt and adapter techniques, especially for dense or complex outputs.
Dynamic prompt engineering with LLMs.

Knowledge distillation:

Integrating multiple VLMs for stronger knowledge transfer.
Extending beyond detection/segmentation to instance-level, panoptic, and further real-world recognition tasks.

7. Community Resources

A curated and continuously updated project repository linking to VLM papers, datasets, codebases, and benchmarking results is available at https://github.com/jingyi0000/VLM_survey. This serves as a central resource for reproducibility, comparison, and advancement of vision-LLMs across the research community.

Aspect	Summary
Paradigms	Progression from hand-crafted features → deep learning → pre-training → VLMs (zero/few-shot prediction)
Foundations	Modular/unified architectures with contrastive/generative/alignment objectives; tasks: classification, detection, etc.
Datasets	Billion-scale, multi-modal, multilingual; matching large curated eval sets for diverse tasks
Methods	Contrastive/generative/alignment-based; prompt tuning, adapters, distillation; unified/transfer/test-time learning
Benchmarks	VLMs excel in zero-shot; scaling data/model helps but saturates; transfer/distillation further boost results
Challenges	Need for fine-grained and unified models; multilingual/data-efficiency; test-time and knowledge distillation improvements
Project	Open-source repository for reproduction and research (see link above)

For more comprehensive resources, model/dataset lists, and full benchmarking tables, see the appendices and project page in the original survey.

PDF Markdown Chat (Upgrade)