Enhancing Vision-Language Model with Unmasked Token Alignment (2405.19009v2)

Published 29 May 2024 in cs.CV

Abstract: Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (5)

Jihao Liu (60 papers)
Jinliang Zheng (10 papers)
Boxiao Liu (16 papers)
Yu Liu (784 papers)
Hongsheng Li (340 papers)

GitHub

GitHub - jihaonew/UTA: Enhancing Vision-Language Model with Unmasked Token Alignment (TMLR) (9 stars)

Enhancing Vision-Language Model with Unmasked Token Alignment (2405.19009v2)

Related Papers

GitHub