CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising (2112.07515v1)

Published 14 Dec 2021 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked LLMing and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Jianjie Luo (4 papers)
Yehao Li (35 papers)
Yingwei Pan (77 papers)
Ting Yao (127 papers)
Hongyang Chao (34 papers)
Tao Mei (209 papers)

Citations (40)

View on Semantic Scholar

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising (2112.07515v1)

Related Papers