Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 165 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Vision-language Models

Updated 1 July 2025
  • Vision-language models (VLMs) are machine learning systems that learn joint representations by training on vast image-text data, bridging visual and linguistic information.
  • These models employ architectures like two-tower or fusion networks and are pre-trained using objectives such as contrastive loss (like InfoNCE) to align image and text features.
  • VLMs enable powerful zero-shot prediction, allowing generalization across diverse vision tasks like classification, detection, and segmentation without requiring task-specific labeled data or fine-tuning.

Vision-LLMs (VLMs) are machine learning systems that learn joint representations bridging visual and linguistic information, primarily by training on large-scale image-text pairs harvested from the web. These models underpin a new generation of visual recognition benchmarks by enabling zero-shot predictions, drastically reducing the reliance on expensive, task-specific labeled data and facilitating rapid generalization across diverse vision tasks.

1. Evolution and Paradigms in Visual Recognition

Contemporary VLMs arise from a sequence of paradigm shifts in visual recognition:

  • Hand-crafted Features & Shallow Classifiers: Early approaches relied on engineered descriptors and simple classifiers but offered limited scalability and weak generalization to complex images.
  • Supervised Deep Learning: The advent of deep neural networks (notably CNNs) shifted the focus toward end-to-end learned features. However, these approaches required vast, manually labeled datasets (e.g., ImageNet) and lengthy convergence periods.
  • Transfer & Self-supervised Learning: Pre-training on large datasets, followed by task-specific fine-tuning, improved data efficiency but maintained a dependency on labeled data.
  • VLM Pre-training and Zero-Shot Prediction: VLMs leverage vast, uncurated collections of image-text pairs openly available on the Internet. Inspired by advances in natural language processing (e.g., BERT, GPT), these models learn to align visual content with textual descriptions, enabling robust zero-shot transfer—directly solving new tasks by matching image features to language, without fine-tuning for every scenario. This approach yields highly scalable and generalizable vision systems.

2. Foundations: Architectures and Training Objectives

2.1 Network Structures

Component Common Choices
Image Encoders CNNs (VGG, ResNet, EfficientNet), Transformers (ViT variants)
Text Encoders Transformers (BERT, GPT-2), usually trained from scratch for cross-modal pretraining
Model Topology - Two-tower: Separate image/text encoders (CLIP, ALIGN)<br>- Fusion: Add cross-modal layers<br>- One-tower: Unified encoder for images and text
  • Two-tower schemes separately embed images and text and align their latent spaces.
  • Fusion and One-tower models introduce multi-modal interactivity, enabling more nuanced cross-modal reasoning.

2.2 Pre-training Objectives

  • Contrastive Losses: Encourage similar image-text pairs to have higher similarity, while pushing apart mismatched pairs. The InfoNCE loss is central:

LIT=1Bi=1Blogexp(ziIziT/τ)j=1Bexp(ziIzjT/τ)\mathcal{L}_{I \rightarrow T} = -\frac{1}{B} \sum_{i=1}^B \log \frac{\exp(z_i^I \cdot z_i^T / \tau)}{\sum_{j=1}^B \exp(z_i^I \cdot z_j^T / \tau)}

where ziIz_i^I and zjTz_j^T are normalized visual and textual features, BB is batch size, and τ\tau a temperature parameter. The objective is typically symmetric for image→text and text→image.

  • Generative and Alignment Objectives: Masked image or LLMing (e.g., MAE, BERT), image-to-text generation, and explicit alignment via image-text (or region-word) matching.
  • Supervisory Signals: Standard discriminative labels can be incorporated for further refinement and downstream transfer.

2.3 Downstream Tasks and Evaluation

  • Zero-shot Prediction: Central to the VLM paradigm; evaluate models by composing task-specific text prompts and ranking the alignment with image features, requiring no task-specific retraining.
  • Linear Probing and Fine-tuning: Evaluate generalizable representations by training shallow classifiers or adapting internal features.
  • Target Tasks: Image classification, object detection, semantic segmentation, image-text retrieval, and action recognition.

3. Data for Pre-training and Evaluation

3.1 Pre-training Corpora

The effectiveness of VLMs critically depends on large, diverse, and noisy image-text datasets. Major sources include:

  • SBU Caption (~1M)
  • COCO Caption (1.5M)
  • YFCC100M (100M)
  • LAION400M/5B (400M/5B)
  • WebLI (12B, multilingual)
  • Additional datasets for region-level or fine-grained tasks (e.g., Object365)

These datasets provide the scale and diversity necessary to learn broad, robust, and semantically meaningful cross-modal representations.

3.2 Benchmarking Datasets

  • Image Classification: ImageNet, CIFAR, Oxford Flowers, Cars, Aircraft, Pets, Food101.
  • Object Detection and Segmentation: COCO, LVIS, ODinW, PASCAL VOC, ADE20k, Cityscapes.
  • Image-Text Retrieval: Flickr30k, COCO Caption.
  • Action Recognition: UCF101, Kinetics700.

These datasets facilitate rigorous benchmarking under zero-shot, few-shot, and dense prediction regimes, spanning both coarse and fine-grained vision tasks.

4. Transfer Learning and Knowledge Distillation

4.1 Transfer Learning Approaches

  • Prompt Tuning
    • Text Prompt Tuning (TPT): e.g., CoOp, CoCoOp
    • Visual Prompt Tuning (VPT): Modify image inputs with learnable perturbations.
    • Joint Prompt Tuning: Simultaneous adaptation of both textual and visual prompts.
  • Adapters: Lightweight modules interposed between backbone and classifier (e.g., CLIP-Adapter, Tip-Adapter).
  • Direct Fine-tuning: Robust model combinations (e.g., Wise-FT) achieve good performance across transfer settings.
  • LLM-Augmented Prompts: Generating tailored prompts or captions via LLMs (CuPL, VCD).

4.2 Knowledge Distillation

  • For Detection: Map VLM representations into object detector backbones (e.g., via ViLD, HierKD, RKD), or use pseudo-labeling (PB-OVD, XPM).
  • For Segmentation: Techniques such as CLIPSeg, ZegFormer, LSeg, and prompt-driven approaches like MaskCLIP+ transfer VLM knowledge to pixel-level prediction.

Key innovations involve:

  • Unified vision-language pre-training,
  • Fine-grained (region/word, pixel/word) alignment for transfer,
  • Data-efficient paradigms harnessing synthetic or LLM-generated supervision.

5. Benchmarking: Analysis of Performance and Limitations

  • Zero-Shot Image Classification: Current state-of-the-art (SOTA) models trained on billions of image-text pairs achieve 80–86% accuracy on ImageNet, maintaining strong generalization across multiple benchmarks.
    • Scaling Laws: Increases in data/model size continue to drive improvement, but with diminishing marginal returns and large computational cost.
    • Transferability: High cross-task generalization achieved without fine-tuning; few-shot and unsupervised methods further bridge gaps.
    • Dense Tasks: Zero-shot detection and segmentation lag behind classification, with region/pixel-level vision-language correlation an open research frontier.
  • Limitations:
    • Training requires enormous hardware and energy budgets.
    • Public vs. proprietary data splits complicate fair benchmarking.
    • Dense prediction capabilities (detection/segmentation) are less mature than classification, especially regarding fine-grained alignment.
    • Benchmarks across transfer settings are not fully standardized.

6. Research Challenges and Prospects

Key Open Problems

  • Fine-grained Cross-Modal Alignment: Improving local (region, patch, pixel) vision-language correspondence to enable zero-shot dense prediction.
  • Unified Architectures: Moving towards one-tower models for directly tokenizing and co-processing visual and textual input, improving efficiency and interaction.
  • Multilingual and Culturally Fair Models: Current VLMs disproportionately focus on English; new efforts (e.g., WebLI, PaLI, WuKong) seek to expand linguistic and cultural coverage.
  • Data Efficiency: Designing objectives and supervision strategies to reduce dependence on web-scale corpora, making VLM pre-training more accessible and sustainable.
  • LLM Integration: Merging VLMs with LLMs offers new pathways for knowledge-enriched or instruction-fine-tuned representations.

Future Directions

  • Zero-Shot Dense Prediction: Developing region- and pixel-level alignment for dense, open-vocabulary tasks.
  • Efficient Transfer: Unsupervised and few-shot methods to minimize annotation requirements, especially for localization tasks.
  • Test-Time Adaptation: Methods enabling adaptation to new domains without new training cycles.
  • Standardization and Reproducibility: Broad adoption of open datasets, code bases, and protocols to democratize VLM research and strengthen empirical claims.

7. Summary of Contributions in the Literature

  • Comprehensive Frameworks: Synthesizing models, objectives, architectures, and evaluation benchmarks to systematize VLM research.
  • Empirical Benchmarking: Quantitative comparisons across dozens of tasks clarify strengths, weaknesses, and practical considerations.
  • Prospective Roadmaps: Detailed open questions and research priorities establish a trajectory towards more robust, broad, and accessible vision-language intelligence.

Vision-LLMs now underpin a rapidly evolving research ecosystem for vision tasks, offering unprecedented generalization and scalability across image classification, detection, segmentation, and retrieval. Their advances arise from large-scale, contrastively or generatively pre-trained architectures, the leveraging of web-scale data, and a growing suite of transfer and distillation techniques. Continued progress hinges on improvements in cross-modal alignment, computational accessibility, multilingualism, and rigorous, open benchmarking.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vision-Language Models.