Global and Local Semantic Completion Learning for Vision-Language Pre-training (2306.07096v2)
Abstract: Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.
- X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen, “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” CoRR, vol. abs/2109.04290, 2021.
- J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” CoRR, vol. abs/2206.08916, 2022.
- H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in EMNLP, 2019.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019.
- W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang, “UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning,” in ACL, 2021.
- Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo, “Bridging video-text retrieval with multiple choice questions,” in CVPR, 2022.
- Y. Ji, J. Wang, Y. Gong, L. Zhang, Y. Zhu, H. Wang, J. Zhang, T. Sakai, and Y. Yang, “MAP: modality-agnostic uncertainty-aware vision-language pre-training model,” CoRR, vol. abs/2210.05335, 2022.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020.
- T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “VIOLET : End-to-end video-language transformers with masked visual-token modeling,” CoRR, vol. abs/2111.12681, 2021.
- M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in ICCV, 2021.
- S. He, T. Guo, T. Dai, R. Qiao, C. Wu, X. Shu, and B. Ren, “VLMAE: vision-language masked autoencoder,” CoRR, vol. abs/2208.09374, 2022.
- K. Q. Lin, A. J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou, “Egocentric video-language pretraining,” CoRR, vol. abs/2206.01670, 2022.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019.
- H. Gao, C. Zhu, M. Liu, W. Gu, H. Wang, W. Liu, and X. Yin, “Calic: Accurate and efficient image-text retrieval via contrastive alignment and visual contexts modeling,” in ACM Multimedia, 2022.
- H. Bao, W. Wang, L. Dong, and F. Wei, “Vl-beit: Generative vision-language pretraining,” CoRR, vol. abs/2206.01127, 2022.
- W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” CoRR, vol. abs/2208.10442, 2022.
- T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “An empirical study of end-to-end video-language transformers with masked visual modeling,” CoRR, vol. abs/2209.01540, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
- V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in ECCV, 2020.
- H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in EMNLP, 2021.
- H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong, “VATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in NeurIPS, 2021.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” CoRR, vol. abs/2204.06125, 2022.
- Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “CRIS: clip-driven referring image segmentation,” in CVPR. IEEE, 2022, pp. 11 676–11 685.
- X. Ni, H. Wen, Y. Liu, Y. Ji, and Y. Yang, “Multimodal prototype-enhanced network for few-shot action recognition,” CoRR, vol. abs/2212.04873, 2022.
- A. Andonian, S. Chen, and R. Hamid, “Robust cross-modal representation learning with progressive self-distillation,” in CVPR, 2022.
- H. Lu, N. Fei, Y. Huo, Y. Gao, Z. Lu, and J. Wen, “COTS: collaborative two-stream vision-language pre-training model for cross-modal retrieval,” in CVPR, 2022.
- Y. Ge, Y. Ge, X. Liu, J. Wang, J. Wu, Y. Shan, X. Qie, and P. Luo, “MILES: visual BERT pre-training with injected language semantics for video-text retrieval,” in ECCV, 2022.
- Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.
- Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, Z. Liu, and M. Zeng, “An empirical study of training end-to-end vision-and-language transformers,” in CVPR, 2022.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in CVPR, 2022.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML, 2021.
- W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” CoRR, 2021.
- J. Huang, Y. Li, J. Feng, X. Sun, and R. Ji, “Clover: Towards A unified video-language alignment and fusion model,” CoRR, vol. abs/2207.07885, 2022.
- D. Li, J. Li, H. Li, J. C. Niebles, and S. C. H. Hoi, “Align and prompt: Video-and-language pre-training with entity prompts,” in CVPR, 2022.
- W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Journalism quarterly, vol. 30, no. 4, pp. 415–433, 1953.
- J. Wang, Y. Ji, J. Sun, Y. Yang, and T. Sakai, “MIRTT: learning multimodal interaction representations from trilinear transformers for visual question answering,” in EMNLP (Findings). Association for Computational Linguistics, 2021, pp. 2280–2292.
- G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. Soatto, “Masked vision and language modeling for multi-modal representation learning,” CoRR, vol. abs/2208.02131, 2022.
- C. Wei, H. Fan, S. Xie, C. Wu, A. L. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in CVPR, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., 2017.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.
- V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in NeurIPS, 2011.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, 2019.
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in NeurIPS, 2020.
- Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” in ICLR, 2022.
- P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022.
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
- Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” CoRR, 2020.
- P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021.
- Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in CVPR, 2021.
- Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering,” in Proc. of CVPR, 2017.
- A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi, “A corpus for reasoning about natural language grounded in photographs,” in ACL, 2019.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
- J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in CVPR, 2016.
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in CVPR, 2015.
- H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang, “E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning,” in ACL, 2021.
- Rong-Cheng Tu (18 papers)
- Yatai Ji (15 papers)
- Jie Jiang (246 papers)
- Weijie Kong (11 papers)
- Chengfei Cai (10 papers)
- Wenzhe Zhao (11 papers)
- Hongfa Wang (29 papers)
- Yujiu Yang (155 papers)
- Wei Liu (1135 papers)