CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding (2305.08685v5)

Published 15 May 2023 in cs.CV

Abstract: Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.

PDF HTML Abstract

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

The paper at hand introduces a novel method titled "CLIP-VG," which addresses the visual grounding (VG) task by self-paced curriculum adapting of CLIP utilizing pseudo-language labels. Visual grounding is pivotal in vision and language research, requiring models to localize regions within an image that correspond to given textual expressions. Previous approaches largely depended on fully annotated datasets, but the crux of CLIP-VG is its focus on minimizing dependence on labeled data through unsupervised approaches.

Methodological Overview

CLIP-VG leverages the vision-language pretrained model CLIP. It innovates by performing self-paced curriculum learning, which involves iteratively adapting the CLIP model to the VG task using pseudo-labels. The pseudo-labels act as a form of surrogate supervision, substantially reducing the need for manually labeled data.

The architecture centers on adapting CLIP, wherein a lightweight transformer is utilized to bridge the vision and linguistic modalities in the VG context. This involves freezing the pre-trained components of CLIP and introducing a six-layer transformer to realize cross-modal interactions. Furthermore, the model employs multi-layer feature extraction to enhance visual representation perception, highlighting its design simplicity and computational efficiency.

Curriculum Adaptation Approach

Reliability Measurement: The approach begins with training a preliminary model on pseudo-labels to measure their reliability. The LOI (likelihood of being correctly predicted) is computed for each sample, and a reliability histogram is constructed.
Self-Paced Adapting (SSA and MSA):
- SSA focuses on a single pseudo-source, progressively refining the model by selecting samples with increasing reliability, thus minimizing the risk of model degradation due to unreliable data.
- MSA, addressing multiple pseudo-sources, introduces cross-source reliability, allowing the model to leverage diverse sources while maintaining stability and robustness in its learning trajectory.

Empirical Analysis

The paper provides substantial empirical validation of the CLIP-VG method across five datasets, including RefCOCO, RefCOCO+, RefCOCOg, and others like ReferitGame and Flickr30K Entities. Results demonstrate significant performance improvements over existing state-of-the-art unsupervised methods. Notably, the model's performance gains in the multi-source scenario underscore the efficacy of the MSA strategy in integrating diverse pseudo-data to bolster generalization capabilities.

Additionally, the method surpasses several weakly supervised models and even demonstrates competitive performance against fully supervised state-of-the-art models. Its efficiency in terms of computational resources and processing speed presents a compelling case for its adoption.

Implications and Future Prospects

The introduction of CLIP-VG signifies a stride forward in reducing reliance on labeled data for visual grounding tasks. The ability to leverage pseudo-labels effectively and adapt models through self-paced curriculum learning could revolutionize how vision-language tasks are approached, offering scalability and adaptability.

Future research could explore integrating more sophisticated pseudo-label generation techniques, refining the curriculum adaptation strategies for even better performance, and applying similar principles to broader AI tasks. The balance CLIP-VG achieves between efficiency, performance, and resource optimization indicates a promising direction for advancing AI systems in both academic and practical applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Linhui Xiao (5 papers)
Xiaoshan Yang (19 papers)
Fang Peng (7 papers)
Ming Yan (190 papers)
Yaowei Wang (149 papers)
Changsheng Xu (100 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - linhuixiao/CLIP-VG: CLIP for Visual Grounding (126 stars)