Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
123 tokens/sec
GPT-4o
83 tokens/sec
Gemini 2.5 Pro Pro
62 tokens/sec
o3 Pro
41 tokens/sec
GPT-4.1 Pro
71 tokens/sec
DeepSeek R1 via Azure Pro
24 tokens/sec
2000 character limit reached

Vision-Language Models (VLMs): A Comprehensive Survey

Last updated: June 12, 2025

Certainly. Here is a fact-faithful, well-sourced, and stylistically polished academic overview of Vision-LLMs ° (VLMs °) synthesizing the full scope of the survey "Vision-LLMs for Vision Tasks: A Survey" (Zhang et al., 2023 ° ).


Vision-LLMs (VLMs): A Comprehensive Survey

1. Evolution of Visual Recognition Paradigms

Visual recognition has undergone significant paradigm shifts, fueled by advancements in both computational models and data availability:

  1. Traditional Machine Learning: Early vision systems ° relied on hand-crafted features ° (e.g., SIFT) and classic classifiers. These approaches required extensive domain expertise ° and could not scale to complex, large-scale visual tasks.
  2. Deep Neural Networks (DNNs): The advent of deep learning, notably with architectures like AlexNet ° and ResNet, enabled end-to-end feature learning ° but required large task-specific labeled datasets °.
  3. Supervised Pre-training ° & Fine-tuning: Models pre-trained on massive labeled datasets (like ImageNet) demonstrated improved adaptability to new tasks via fine-tuning, but still demanded extensive labeled data for each downstream application.
  4. Unsupervised (Self-supervised) Pre-training: Self-supervised methods ° (e.g., contrastive learning in MoCo, SimCLR) could harness vast amounts of unlabeled data, but still required supervised fine-tuning for specific tasks.
  5. Vision-Language Pre-training ° & Zero-Shot Learning: The introduction of VLMs represents a transformative step. By pre-training on web-scale image-text pairs, which are nearly limitless, VLMs learn rich, generalizable cross-modal representations °. This enables direct zero-shot prediction ° on a variety of vision tasks—eliminating the need for exhaustive task-specific data collection and model retraining ° for each new problem.

Key Takeaway:

VLMs fundamentally shift the visual recognition paradigm, enabling reusable, universal vision models ° that generalize across diverse tasks and data domains.


2. Foundations of VLMs

Architectures

  • Image Encoders:
    • CNN-based: Classic backbones (ResNet, EfficientNet) underpin early VLMs.
    • Transformer-based: Vision Transformer (ViT) models partition images into patches and process them with self-attention °, offering data and model scalability °.
  • Text Encoders:
  • Network Frameworks:

Training Objectives

  • Contrastive Objectives ° (most common):
    • InfoNCE loss ° aligns matching image-text pairs while separating mismatched pairs:

    LIT=1Bi=1Blogexp(ziIziT/τ)j=1Bexp(ziIzjT/τ)\mathcal{L}_{I \rightarrow T} = - \frac{1}{B} \sum_{i=1}^B \log \frac{ \exp(z_i^I \cdot z_i^T / \tau) }{ \sum_{j=1}^B \exp(z_i^I \cdot z_j^T / \tau) } - Also employed: category label supervision, enhanced region-word matching, and data augmentations °.

  • Generative Objectives:

  • Alignment and Matching:
    • Global (image-sentence) and local (region-word) alignment ensure models can localize and describe objects or fine regions within images.

Downstream Tasks

  • Zero-shot predictions:
  • Classification, detection, segmentation, and retrieval:
    • VLMs are evaluated via embedding-based ° comparison, prompt-driven classification, or regional/pixel-level matching with text.
  • Linear probing:

3. Datasets

Pre-training Datasets

  • Web-scale sources:
    • LAION-400M/5B, CC3M/12M, SBU ° Captions, Visual Genome, YFCC100M, WIT, Red Caps, WuKong, WebLI and more enable training over billions of image-text pairs.
  • Strengths:
    • Enormous scale, wide domain coverage, and multilingual options mitigate bias and enhance generalization.
    • Auxiliary datasets (Object365, COCO °, etc.) assist in local/region-level or fine-grained tasks.

Evaluation Datasets

  • Classification/Recognition:
    • ImageNet, CIFAR, Food-101, Oxford-IIIT PETS, FGVC ° Aircraft
  • Segmentation/Detection:
    • ADE20K, PASCAL ° VOC, Cityscapes, COCO, LVIS, ODinW
  • Retrieval:
    • Flickr30k, COCO Caption
  • Video & Action Recognition:
    • UCF101, Kinetics700

4. Methods

Pre-training Algorithms

  • Contrastive Learning (CLIP, ALIGN °):
    • Maximizes similarity for correct image-text pairs; enhanced by larger scale, better region-word matching, and hierarchical attention °.
  • Generative Modeling (COCA, PaLI, FLAVA):
  • Alignment Methods (GLIP, FIBER):
    • Emphasize explicit matching at global or regional levels, important for dense and fine-grained prediction.

Transfer Learning

Knowledge Distillation

  • Feature-Space Distillation:
  • Prompt-based and Pseudo-labeling:
    • Generate or refine downstream supervision using VLM ° knowledge.
  • Local/Regional Distillation:

5. Benchmarking and Analysis

  • Pre-training:
  • Transfer Learning:
    • Few-shot and prompt tuning are efficient for adapting to new tasks, with unsupervised/test-time methods crucial for low-resource or practical settings.
  • Dense Prediction:
  • Distillation:
    • Explicit local/regional alignment outperforms basic feature consistency, significantly improving performance on rare/novel categories.
  • Overall:
    • VLMs excel in universal zero-shot recognition, efficiency, and open-vocabulary reasoning. Limitations include high data/compute requirements, scaling saturation, weaker dense prediction, and benchmarking difficulties due to varying training setups.

6. Research Challenges and Future Directions

Ongoing and prospective areas include:

  1. Fine-grained & Local Modeling:
    • Achieving robust region/pixel-level correspondence for improved dense vision tasks.
  2. Unified Architectures:
  3. Multilingual & Multicultural Modeling:
    • Reducing bias and broadening applicability via diverse language and cultural data.
  4. Data- and Compute-Efficiency:
  5. LLM ° Integration:
    • Utilize LLMs for prompt/caption augmentation and dynamic adaptation.
  6. Unsupervised & Test-time Adaptation:
    • Techniques for domain adaptation in settings lacking annotated data.
  7. Distillation & Compression:
    • Creating compact, high-performing models suited for deployment and less-studied vision tasks.
  8. Benchmark Standardization & Open Science:

Conclusion

Vision-LLMs stand at the forefront of visual recognition research, with universal pre-training on web-scale image-text pairs unlocking zero-shot, cross-domain, and open-vocabulary capabilities. The field continues to advance toward richer fine-grained modeling, unified and efficient architectures, robust multilingualism, and accessible, reproducible benchmarks °. These models are poised for broad deployment across real-world vision tasks, with ongoing work addressing key challenges to make VLMs even more versatile, interpretable, and resource-efficient (Zhang et al., 2023 ° ).


Recommended Reference:

For practical frameworks, implementation details, and up-to-date resource links, consult the companion repository: https://github.com/jingyi0000/VLM_survey