Attentive Mask CLIP (2212.08653v2)

Published 16 Dec 2022 in cs.CV and eess.IV

Abstract: Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.

PDF Abstract

An Expert Review of "Attentive Mask CLIP"

The paper "Attentive Mask CLIP" introduces an innovative approach to enhance the efficiency and performance of CLIP-style vision-LLMs. The research investigates the implications of image token removal, presenting a novel technique termed "attentive token removal" to address the challenges faced by existing methods.

Key Contributions and Methodology

The central premise of this paper revolves around the hypothesis that random removal of image tokens can detrimentally affect the semantic alignment between image and text pairs in CLIP training. To mitigate this issue, the authors propose an attentive token removal strategy. This approach selectively retains image tokens with high semantic correlation to their corresponding text descriptions, determined through dynamically evaluated correlation scores via an EMA-updated vision encoder.

The "Attentive Mask CLIP" (A-CLIP) framework is designed to outperform the original CLIP model and random token removal variants, achieving significant improvements in both efficiency and accuracy. The attentive mask strategy ensures that only semantically meaningful image tokens are retained, facilitating efficient multi-view contrastive learning. The empirical results demonstrate that A-CLIP achieves superior zero-shot image classification and retrieval accuracies.

Experimental Results

The experimental evaluation of A-CLIP shows remarkable improvements:

ImageNet-1K zero-shot classification: Achieves 43.9% top-1 accuracy, outperforming SLIP by 1.1%.
Retrieval Tasks: Reports 62.7/42.1 on Flickr30K and 38.0/23.2 on MS COCO I2T/T2I retrieval tasks, surpassing SLIP by +5.5/+0.9 and +4.4/+1.3, respectively.
Efficiency Gains: A-CLIP is 2.30× faster than SLIP while maintaining robust performance, with an efficient variant achieving further speed gains without compromising accuracy.

Detailed Analysis

The research provides comprehensive ablation studies to validate the attentive masking strategy. Various selection strategies, such as low-score token removal, are rigorously analyzed. The proposed method demonstrates that retaining semantically related image tokens significantly improves training effectiveness and zero-shot evaluation accuracy.

Moreover, the incorporation of additional masked views and auxiliary self-supervised tasks further enhances the model's performance. By integrating SimCLR and SimSiam tasks, the A-CLIP framework benefits from robust data augmentation, ultimately leading to refined visual representations.

Theoretical and Practical Implications

The paper contributes both theoretical insights and practical advancements in efficient vision-LLMing. The attentive mask CLIP technique exemplifies a balanced approach to addressing semantic misalignment in token removal, paving the way for more computationally efficient pre-training methods without sacrificing accuracy.

Future Directions

The framework presents opportunities for further exploration, particularly in integrating masked image modeling and expanding the range of auxiliary tasks. Future developments could harness the potential of A-CLIP's methodology to explore larger and more diverse datasets or extend to other vision-language tasks.

Conclusion

"Attentive Mask CLIP" offers a significant stride in optimizing CLIP-like models through its targeted token removal strategy. The work successfully demonstrates that careful semantic alignment can yield substantial improvements in both model efficiency and task performance. With its robust experimental validations, A-CLIP stands as a compelling contribution to the field, encouraging further research and development in efficient vision-language integration methods.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Yifan Yang (578 papers)
Weiquan Huang (8 papers)
Yixuan Wei (16 papers)
Houwen Peng (36 papers)
Xinyang Jiang (40 papers)
Huiqiang Jiang (32 papers)
Fangyun Wei (53 papers)
Yin Wang (58 papers)
Han Hu (196 papers)
Lili Qiu (50 papers)
Yuqing Yang (83 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/A-CLIP: Official Implementation of Attentive Mask CLIP (ICCV2023, https://arxiv.org/abs/2212.08653) (32 stars)

Tweets

https://twitter.com/sameQCU/status/1824260154583552253