Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment (2212.10549v2)

Published 20 Dec 2022 in cs.CL, cs.CV, and cs.LG

Abstract: Despite recent progress towards scaling up multimodal vision-LLMs, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-LLMs is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a 'change of basis' provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, 2020.
  2. Vl-interpret: An interactive visualization tool for interpreting vision-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21406–21415, 2022.
  3. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016.
  4. Syntax-bert: Improving pre-trained transformers with syntax trees. arXiv preprint arXiv:2103.04350, 2021.
  5. Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs. Transactions of the Association for Computational Linguistics, 9:978–994, 2021. doi: 10.1162/tacl_a_00408. URL https://aclanthology.org/2021.tacl-1.58.
  6. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  7. Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia, pages 797–806, 2021.
  8. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, page 765–773, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368896. doi: 10.1145/3343031.3350943. URL https://doi.org/10.1145/3343031.3350943.
  9. Vlgrammar: Grounded grammar induction of vision and language. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1665–1674, October 2021.
  10. Attention as grounding: Exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4062–4073, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.320. URL https://aclanthology.org/2022.findings-acl.320.
  11. Single-stream multi-level alignment for vision-language pretraining. arXiv preprint arXiv:2203.14395, 2022.
  12. Cross-modal alignment learning of vision-language conceptual systems. arXiv preprint arXiv:2208.01744, 2022.
  13. Improving bert with syntax-aware local attention. arXiv preprint arXiv:2012.15150, 2020.
  14. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  16. From balustrades to pierre vinken: Looking for syntax in transformer self-attentions. arXiv preprint arXiv:1906.01958, 2019.
  17. Finding structural knowledge in multimodal-bert. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5658–5671, 2022.
  18. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  19. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  20. Learning relation alignment for calibrated cross-modal retrieval. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 514–524, 2021.
  21. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
  22. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  23. Unsupervised vision-language grammar induction with shared structure modeling. In International Conference on Learning Representations, 2021.
  24. Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501, 2022a.
  25. Sgeitl: Scene graph enhanced image-text learning for visual commonsense reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5914–5922, 2022b.
  26. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  27. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. Advances in Neural Information Processing Systems, 34:4514–4528, 2021.
  28. Auto-parsing network for image captioning and visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2197–2207, 2021a.
  29. Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9847–9857, June 2021b.
  30. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  31. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3208–3216, 2021.
  32. When and why vision-language models behave like bag-of-words models, and what to do about it? arXiv preprint arXiv:2210.01936, 2022.
  33. Bryan Zhang. Improve MT for search with selected translation memory using search signals. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 123–131, Orlando, USA, September 2022. Association for Machine Translation in the Americas. URL https://aclanthology.org/2022.amta-upg.9.
  34. Hierarchical vision-language alignment for video captioning. In International Conference on Multimedia Modeling, pages 42–54. Springer, 2019.
  35. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Rohan Pandey (13 papers)
  2. Rulin Shao (20 papers)
  3. Paul Pu Liang (103 papers)
  4. Ruslan Salakhutdinov (248 papers)
  5. Louis-Philippe Morency (123 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.