Vision-Language Models (VLMs): A Comprehensive Survey
Last updated: June 12, 2025
Certainly. Here is a fact-faithful, well-sourced, and stylistically polished academic overview of Vision-LLMs ° (VLMs °) synthesizing the full scope of the survey "Vision-LLMs for Vision Tasks: A Survey" (Zhang et al., 2023 ° ).
Vision-LLMs (VLMs): A Comprehensive Survey
1. Evolution of Visual Recognition Paradigms
Visual recognition has undergone significant paradigm shifts, fueled by advancements in both computational models and data availability:
- Traditional Machine Learning: Early vision systems ° relied on hand-crafted features ° (e.g., SIFT) and classic classifiers. These approaches required extensive domain expertise ° and could not scale to complex, large-scale visual tasks.
- Deep Neural Networks (DNNs): The advent of deep learning, notably with architectures like AlexNet ° and ResNet, enabled end-to-end feature learning ° but required large task-specific labeled datasets °.
- Supervised Pre-training ° & Fine-tuning: Models pre-trained on massive labeled datasets (like ImageNet) demonstrated improved adaptability to new tasks via fine-tuning, but still demanded extensive labeled data for each downstream application.
- Unsupervised (Self-supervised) Pre-training: Self-supervised methods ° (e.g., contrastive learning in MoCo, SimCLR) could harness vast amounts of unlabeled data, but still required supervised fine-tuning for specific tasks.
- Vision-Language Pre-training ° & Zero-Shot Learning: The introduction of VLMs represents a transformative step. By pre-training on web-scale image-text pairs, which are nearly limitless, VLMs learn rich, generalizable cross-modal representations °. This enables direct zero-shot prediction ° on a variety of vision tasks—eliminating the need for exhaustive task-specific data collection and model retraining ° for each new problem.
Key Takeaway:
VLMs fundamentally shift the visual recognition paradigm, enabling reusable, universal vision models ° that generalize across diverse tasks and data domains.
2. Foundations of VLMs
Architectures
- Image Encoders:
- CNN-based: Classic backbones (ResNet, EfficientNet) underpin early VLMs.
- Transformer-based: Vision Transformer (ViT) models partition images into patches and process them with self-attention °, offering data and model scalability °.
- Text Encoders:
- Transformer architectures ° (BERT, GPT) process textual modality; sometimes reused across image and text streams.
- Network Frameworks:
- Two-tower (e.g., CLIP): Separate encoders project images and text into a shared embedding space °.
- Two-leg/One-tower: Architectures may integrate additional cross-modal fusion ° or unify the encoding of both modalities.
Training Objectives
- Contrastive Objectives ° (most common):
- InfoNCE loss ° aligns matching image-text pairs while separating mismatched pairs:
- Also employed: category label supervision, enhanced region-word matching, and data augmentations °.
Generative Objectives:
- Masked Image Modeling ° (MIM): Predicting masked patches in the image.
- Masked LLMing (MLM): Predicting masked text tokens °.
- Alignment and Matching:
- Global (image-sentence) and local (region-word) alignment ensure models can localize and describe objects or fine regions within images.
Downstream Tasks
- Zero-shot predictions:
- VLMs can infer classes or concepts not seen in supervised training by matching image embeddings ° to textual descriptions °.
- Classification, detection, segmentation, and retrieval:
- VLMs are evaluated via embedding-based ° comparison, prompt-driven classification, or regional/pixel-level matching with text.
- Linear probing:
- Training lightweight classifiers atop frozen VLM features ° for specific tasks.
3. Datasets
Pre-training Datasets
- Web-scale sources:
- LAION-400M/5B, CC3M/12M, SBU ° Captions, Visual Genome, YFCC100M, WIT, Red Caps, WuKong, WebLI and more enable training over billions of image-text pairs.
- Strengths:
- Enormous scale, wide domain coverage, and multilingual options mitigate bias and enhance generalization.
- Auxiliary datasets (Object365, COCO °, etc.) assist in local/region-level or fine-grained tasks.
Evaluation Datasets
- Classification/Recognition:
- ImageNet, CIFAR, Food-101, Oxford-IIIT PETS, FGVC ° Aircraft
- Segmentation/Detection:
- ADE20K, PASCAL ° VOC, Cityscapes, COCO, LVIS, ODinW
- Retrieval:
- Flickr30k, COCO Caption
- Video & Action Recognition:
- UCF101, Kinetics700
4. Methods
Pre-training Algorithms
- Contrastive Learning (CLIP, ALIGN °):
- Maximizes similarity for correct image-text pairs; enhanced by larger scale, better region-word matching, and hierarchical attention °.
- Generative Modeling (COCA, PaLI, FLAVA):
- Employs masked modeling ° and image-to-caption paradigms to capture rich representations.
- Alignment Methods (GLIP, FIBER):
- Emphasize explicit matching at global or regional levels, important for dense and fine-grained prediction.
Transfer Learning
- Prompt Tuning °:
- Text, visual, or joint prompt embeddings adapt VLMs to new classes or tasks with minimal data and compute.
- Custom prompt generation ° and test-time tuning improve adaptability.
- Feature Adapters:
- Lightweight modules fine-tune features for specific tasks, often training-free °.
- Full Model Fine-tuning ° & Architecture Modifications:
- More intensive adaptation for complex or dense-output tasks.
Knowledge Distillation
- Feature-Space Distillation:
- Align downstream model representations ° to VLMs, especially for object detection/segmentation.
- Prompt-based and Pseudo-labeling:
- Generate or refine downstream supervision using VLM ° knowledge.
- Local/Regional Distillation:
- Focuses on aligning pixel/region features, enabling open-vocabulary segmentation °.
5. Benchmarking and Analysis
- Pre-training:
- Larger scale (models/data) consistently improves zero-shot image classification ° and retrieval, with diminishing returns ° at extreme scales.
- Transfer Learning:
- Few-shot and prompt tuning are efficient for adapting to new tasks, with unsupervised/test-time methods crucial for low-resource or practical settings.
- Dense Prediction:
- Recent VLMs with region or pixel-level supervision ° achieve open-vocabulary ° detection/segmentation, but lag behind in localization compared to classification.
- Distillation:
- Explicit local/regional alignment outperforms basic feature consistency, significantly improving performance on rare/novel categories.
- Overall:
- VLMs excel in universal zero-shot recognition, efficiency, and open-vocabulary reasoning. Limitations include high data/compute requirements, scaling saturation, weaker dense prediction, and benchmarking difficulties due to varying training setups.
6. Research Challenges and Future Directions
Ongoing and prospective areas include:
- Fine-grained & Local Modeling:
- Achieving robust region/pixel-level correspondence for improved dense vision tasks.
- Unified Architectures:
- Token-level unification of vision and language ° streams ("one-tower" models).
- Multilingual & Multicultural Modeling:
- Reducing bias and broadening applicability via diverse language and cultural data.
- Data- and Compute-Efficiency:
- More selective data curation, effective training objectives °, self-/weakly-supervised hybrid strategies °.
- LLM ° Integration:
- Utilize LLMs for prompt/caption augmentation and dynamic adaptation.
- Unsupervised & Test-time Adaptation:
- Techniques for domain adaptation in settings lacking annotated data.
- Distillation & Compression:
- Creating compact, high-performing models suited for deployment and less-studied vision tasks.
- Benchmark Standardization & Open Science:
- Improving comparability, reproducibility, and open-source resource sharing ° in the community.
Conclusion
Vision-LLMs stand at the forefront of visual recognition research, with universal pre-training on web-scale image-text pairs unlocking zero-shot, cross-domain, and open-vocabulary capabilities. The field continues to advance toward richer fine-grained modeling, unified and efficient architectures, robust multilingualism, and accessible, reproducible benchmarks °. These models are poised for broad deployment across real-world vision tasks, with ongoing work addressing key challenges to make VLMs even more versatile, interpretable, and resource-efficient (Zhang et al., 2023 ° ).
Recommended Reference:
For practical frameworks, implementation details, and up-to-date resource links, consult the companion repository: https://github.com/jingyi0000/VLM_survey