Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Global and Local Semantic Completion Learning for Vision-Language Pre-training (2306.07096v2)

Published 12 Jun 2023 in cs.CV

Abstract: Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations to local features of the other modality. Therefore, in this paper, we propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously. Specifically, the GLSCL task complements the missing semantics of masked data and recovers global and local features by cross-modal interactions. Our GLSCL consists of masked global semantic completion (MGSC) and masked local token completion (MLTC). MGSC promotes learning more representative global features, which have a great impact on the performance of downstream tasks, while MLTC reconstructs modal-fusion local tokens, further enhancing accurate comprehension of multimodal data. To evaluate the proposed approaches on cross-modal alignment, we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a flexible vision encoder, enabling our model to simultaneously perform image-text and video-text multimodal tasks. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. X. Cheng, H. Lin, X. Wu, F. Yang, and D. Shen, “Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss,” CoRR, vol. abs/2109.04290, 2021.
  2. J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” CoRR, vol. abs/2206.08916, 2022.
  3. H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in EMNLP, 2019.
  4. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019.
  5. W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang, “UNIMO: towards unified-modal understanding and generation via cross-modal contrastive learning,” in ACL, 2021.
  6. Y. Ge, Y. Ge, X. Liu, D. Li, Y. Shan, X. Qie, and P. Luo, “Bridging video-text retrieval with multiple choice questions,” in CVPR, 2022.
  7. Y. Ji, J. Wang, Y. Gong, L. Zhang, Y. Zhu, H. Wang, J. Zhang, T. Sakai, and Y. Yang, “MAP: modality-agnostic uncertainty-aware vision-language pre-training model,” CoRR, vol. abs/2210.05335, 2022.
  8. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020.
  9. T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “VIOLET : End-to-end video-language transformers with masked visual-token modeling,” CoRR, vol. abs/2111.12681, 2021.
  10. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in ICCV, 2021.
  11. S. He, T. Guo, T. Dai, R. Qiao, C. Wu, X. Shu, and B. Ren, “VLMAE: vision-language masked autoencoder,” CoRR, vol. abs/2208.09374, 2022.
  12. K. Q. Lin, A. J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou, “Egocentric video-language pretraining,” CoRR, vol. abs/2206.01670, 2022.
  13. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019.
  14. H. Gao, C. Zhu, M. Liu, W. Gu, H. Wang, W. Liu, and X. Yin, “Calic: Accurate and efficient image-text retrieval via contrastive alignment and visual contexts modeling,” in ACM Multimedia, 2022.
  15. H. Bao, W. Wang, L. Dong, and F. Wei, “Vl-beit: Generative vision-language pretraining,” CoRR, vol. abs/2206.01127, 2022.
  16. W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” CoRR, vol. abs/2208.10442, 2022.
  17. T. Fu, L. Li, Z. Gan, K. Lin, W. Y. Wang, L. Wang, and Z. Liu, “An empirical study of end-to-end video-language transformers with masked visual modeling,” CoRR, vol. abs/2209.01540, 2022.
  18. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  19. C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
  20. V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in ECCV, 2020.
  21. H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” in EMNLP, 2021.
  22. H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong, “VATT: transformers for multimodal self-supervised learning from raw video, audio and text,” in NeurIPS, 2021.
  23. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” CoRR, vol. abs/2204.06125, 2022.
  24. Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “CRIS: clip-driven referring image segmentation,” in CVPR.   IEEE, 2022, pp. 11 676–11 685.
  25. X. Ni, H. Wen, Y. Liu, Y. Ji, and Y. Yang, “Multimodal prototype-enhanced network for few-shot action recognition,” CoRR, vol. abs/2212.04873, 2022.
  26. A. Andonian, S. Chen, and R. Hamid, “Robust cross-modal representation learning with progressive self-distillation,” in CVPR, 2022.
  27. H. Lu, N. Fei, Y. Huo, Y. Gao, Z. Lu, and J. Wen, “COTS: collaborative two-stream vision-language pre-training model for cross-modal retrieval,” in CVPR, 2022.
  28. Y. Ge, Y. Ge, X. Liu, J. Wang, J. Wu, Y. Shan, X. Qie, and P. Luo, “MILES: visual BERT pre-training with injected language semantics for video-text retrieval,” in ECCV, 2022.
  29. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.
  30. Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, Z. Liu, and M. Zeng, “An empirical study of training end-to-end vision-and-language transformers,” in CVPR, 2022.
  31. J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” NeurIPS, 2021.
  32. J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in CVPR, 2022.
  33. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML, 2021.
  34. W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” CoRR, 2021.
  35. J. Huang, Y. Li, J. Feng, X. Sun, and R. Ji, “Clover: Towards A unified video-language alignment and fusion model,” CoRR, vol. abs/2207.07885, 2022.
  36. D. Li, J. Li, H. Li, J. C. Niebles, and S. C. H. Hoi, “Align and prompt: Video-and-language pre-training with entity prompts,” in CVPR, 2022.
  37. W. L. Taylor, ““cloze procedure”: A new tool for measuring readability,” Journalism quarterly, vol. 30, no. 4, pp. 415–433, 1953.
  38. J. Wang, Y. Ji, J. Sun, Y. Yang, and T. Sakai, “MIRTT: learning multimodal interaction representations from trilinear transformers for visual question answering,” in EMNLP (Findings).   Association for Computational Linguistics, 2021, pp. 2280–2292.
  39. G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. Soatto, “Masked vision and language modeling for multi-modal representation learning,” CoRR, vol. abs/2208.02131, 2022.
  40. C. Wei, H. Fan, S. Xie, C. Wu, A. L. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in CVPR, 2022.
  41. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  43. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  44. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., 2017.
  45. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.
  46. V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in NeurIPS, 2011.
  47. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, 2019.
  48. E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in NeurIPS, 2020.
  49. Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” in ICLR, 2022.
  50. P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022.
  51. J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  52. Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” CoRR, 2020.
  53. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in CVPR, 2021.
  54. Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in CVPR, 2021.
  55. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering,” in Proc. of CVPR, 2017.
  56. A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi, “A corpus for reasoning about natural language grounded in photographs,” in ACL, 2019.
  57. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
  58. J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description dataset for bridging video and language,” in CVPR, 2016.
  59. A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in CVPR, 2015.
  60. H. Xu, M. Yan, C. Li, B. Bi, S. Huang, W. Xiao, and F. Huang, “E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning,” in ACL, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Rong-Cheng Tu (18 papers)
  2. Yatai Ji (15 papers)
  3. Jie Jiang (246 papers)
  4. Weijie Kong (11 papers)
  5. Chengfei Cai (10 papers)
  6. Wenzhe Zhao (11 papers)
  7. Hongfa Wang (29 papers)
  8. Yujiu Yang (155 papers)
  9. Wei Liu (1135 papers)
Citations (2)