Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training (2403.00249v1)
Abstract: In vision-language pre-training (VLP), masked image modeling (MIM) has recently been introduced for fine-grained cross-modal alignment. However, in most existing methods, the reconstruction targets for MIM lack high-level semantics, and text is not sufficiently involved in masked modeling. These two drawbacks limit the effect of MIM in facilitating cross-modal semantic alignment. In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. Specifically, to provide more semantically meaningful supervision for MIM, we propose a local semantics enhancing approach, which harvest high-level semantics from global image features via self-supervised agreement learning and transfer them to local patch encodings by sharing the encoding space. Moreover, to achieve deep involvement of text during the entire MIM process, we propose a text-guided masking strategy and devise an efficient way of injecting textual information in both masked modeling and reconstruction target acquisition. Experimental results validate that our method improves the effectiveness of the MIM task in facilitating cross-modal semantic alignment. Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks.
- Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957.
- Beit: Bert pre-training of image transformers. In International Conference on Learning Representations.
- Vl-beit: Generative vision-language pretraining. arXiv preprint arXiv:2206.01127.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Duet: Cross-modal semantic grounding for contrastive zero-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 405–413.
- Peco: Perceptual codebook for bert pre-training of vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 552–560.
- Maskclip: Masked self-distillation advances contrastive language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10995–11005.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176.
- Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In European Conference on Computer Vision, pages 691–708. Springer.
- Multimodal masked autoencoders learn transferable representations. ArXiv, abs/2205.14204.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
- Vlmae: Vision-language masked autoencoder. arXiv preprint arXiv:2208.09374.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. ArXiv, abs/2004.00849.
- Seeing what you miss: Vision-language pre-training with semantic completion learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6789–6798.
- Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23262–23271.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
- mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7241–7259.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV.
- Microsoft coco: Common objects in context. In ECCV.
- Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
- Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1195.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Learning semantics-grounded vocabulary representation for video-text retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4460–4470.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics.
- Position-guided text prompt for vision-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23242–23251.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
- Accelerating vision-language pretraining with free language modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23161–23170.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442.
- Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.
- Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4233–4241.
- E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 503–513.
- mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. In International Conference on Machine Learning, pages 25994–26009. PMLR.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.
- Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049.
- Haowei Liu (13 papers)
- Yaya Shi (13 papers)
- Haiyang Xu (67 papers)
- Chunfeng Yuan (35 papers)
- Qinghao Ye (31 papers)
- Chenliang Li (92 papers)
- Ming Yan (190 papers)
- Ji Zhang (176 papers)
- Fei Huang (408 papers)
- Bing Li (374 papers)
- Weiming Hu (91 papers)