Vision-Language Models for Vision Tasks: A Survey (2304.00685v2)

Published 3 Apr 2023 in cs.CV

Abstract: Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-LLMs (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual LLMs for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition. A project associated with this survey has been created at https://github.com/jingyi0000/VLM_survey.

PDF Abstract

Vision-LLMs for Visual Recognition: A Survey

The paper "Vision-LLMs for Vision Tasks: A Survey" provides a comprehensive overview of Vision-LLMs (VLMs) focusing on their application in various visual recognition tasks such as image classification, object detection, and semantic segmentation. These tasks are essential components in fields like autonomous driving and robotics. The paper addresses the traditional reliance on crowd-labelled data and task-specific neural networks, which are both time-consuming and resource-intensive.

Background and Foundations

The paper begins by outlining the foundational shift in visual recognition paradigms, moving from traditional hand-crafted features to deep learning approaches. The emergence of VLMs marks a new era in leveraging web-scale image-text pairs for robust and generalizable model training, allowing for zero-shot and open-vocabulary recognition capabilities. These models rely on unlabelled data and large-scale pre-training, enabling a single VLM to make predictions across multiple recognition tasks.

Network Architectures

The survey explores the architectures employed in VLMs, primarily CNN and Transformer-based models. CNNs like ResNet have been commonly adapted, while ViTs enable deep learning from image patches, accommodating vision-and-language tasks. Text encoders are mostly Transformer-based, ensuring alignment with state-of-the-art natural language processing capabilities.

Pre-training Objectives

VLMs are pre-trained using objectives categorized into contrastive, generative, and alignment objectives. Contrastive methods learn discriminative features by differentiating paired and unpaired data, while generative methods focus on reconstructing or generating data within or across modalities. Alignment objectives aim to model both global and local correspondences between images and texts.

Evaluation and Datasets

The paper categorizes the datasets for pre-training VLMs and highlights the typical evaluation setups such as zero-shot prediction and linear probing. Popular datasets include extensive social media-sourced image-text collections which provide a wide semantic diversity, crucial for the robust performance of VLMs on unseen tasks.

Key Findings and Performance

VLMs have demonstrated impressive capabilities on image classification tasks, exhibiting efficient zero-shot performance with the scaling of model size and training data. However, their application in dense prediction tasks like object detection and semantic segmentation remains under-explored.

Challenges and Future Directions

The survey identifies several potential future directions in VLM research:

Fine-grained Vision-Language Correlation Modeling: Enhanced modeling of local vision-language correspondences could improve performance on pixel or region-level recognition tasks.
Unified Vision and Language Learning: Integrating image and text learning in a single framework could reduce computational overhead and improve cross-modal communication efficiency.
Multilingual VLMs: Training VLMs across multiple languages may address bias and widen usability across different cultural and linguistic contexts.
Data-Efficient Pre-training: Developing methods to train VLMs efficiently with limited data resources could help mitigate the computational demands of current models.
Incorporation of LLMs: Leveraging LLMs for generating richer language data during pre-training could enhance the learning of diverse and informative visual concepts.

Conclusion

The paper provides a thorough investigation into the state-of-the-art in VLMs, bringing to light both achievements and challenges. The insights shared highlight the transformative potential of VLMs in visual recognition tasks and pave the way for sustainable and wide-reaching applications in computer vision and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jingyi Zhang (63 papers)
Jiaxing Huang (68 papers)
Sheng Jin (69 papers)
Shijian Lu (151 papers)

Citations (282)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - jingyi0000/VLM_survey: Vision-Language Models for Vision Tasks: A Survey (1,866 stars)

Tweets

https://twitter.com/demirbasayyuce/status/1777666034784215190