Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking (2401.16575v1)
Abstract: The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy. We focus on studying multimodal models that consider regions of interest (ROI) features obtained by object detectors as input tokens. We probe the understanding of verbs using guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models can predict the correct verb with high accuracy. This contrasts with previous conclusions drawn from image-text matching probing techniques that frequently fail in situations requiring verb understanding. The code for all experiments will be publicly available https://github.com/ivana-13/guided_masking.
- Vl-interpret: An interactive visualization tool for interpreting vision-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21406–21415.
- Vl-match: Enhancing vision-language pretraining with token-level and instance-level matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2584–2593.
- Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics, 9:978–994.
- Measuring progress in fine-grained vision-and-language understanding. arXiv preprint arXiv:2305.07558.
- Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision, pages 565–580. Springer.
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 397–406.
- gscorecam: What objects is clip looking at? In Proceedings of the Asian Conference on Computer Vision, pages 1959–1975.
- Uniter: Learning universal image-text representations.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. arXiv preprint arXiv:2109.04448.
- Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474.
- Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141.
- Incorporating structured representations into pretrained vision & language models using scene graphs. arXiv preprint arXiv:2305.06343.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer.
- Visual spatial reasoning. arXiv preprint arXiv:2205.00363.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Seeing past words: Testing the cross-modal capabilities of pretrained v&l models on counting tasks. arXiv preprint arXiv:2012.12352.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Are vision-language transformers learning multimodal representations? a probing perspective. In AAAI 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Foil it! find one mismatch between image and language caption. arXiv preprint arXiv:1705.01359.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248.
- Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. arXiv preprint arXiv:2101.03289.
- Attention is all you need. Advances in neural information processing systems, 30.
- Omnivl: One foundation model for image-language and video-language tasks. Advances in neural information processing systems, 35:5696–5710.
- Improving visual grounding by encouraging consistent gradient-based explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19165–19174.
- What you see is what you read? improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400.
- When and why vision-language models behave like bags-of-words, and what to do about it? arXiv e-prints, pages arXiv–2210.
- Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276.
- Ivana Beňová (3 papers)
- Michal Gregor (11 papers)
- Martin Tamajka (4 papers)
- Marcel Veselý (1 paper)
- Marián Šimko (10 papers)
- Jana Košecká (4 papers)