Leveraging per Image-Token Consistency for Vision-Language Pre-training (2211.15398v2)

Published 20 Nov 2022 in cs.CV and cs.LG

Abstract: Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked LLMing (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a LLM (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. The code is released at https://github.com/gyhdog99/epic.

Authors (6)

Yunhao Gou (9 papers)
Tom Ko (31 papers)
Hansi Yang (12 papers)
James Kwok (23 papers)
Yu Zhang (1400 papers)
Mingxuan Wang (83 papers)

Citations (8)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - gyhdog99/epic (2 stars)

Leveraging per Image-Token Consistency for Vision-Language Pre-training (2211.15398v2)

Summary

Related Papers

GitHub