Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues (2403.20026v1)

Published 29 Mar 2024 in cs.CV and cs.CL

Abstract: Multi-modal reasoning plays a vital role in bridging the gap between textual and visual information, enabling a deeper understanding of the context. This paper presents the Feature Swapping Multi-modal Reasoning (FSMR) model, designed to enhance multi-modal reasoning through feature swapping. FSMR leverages a pre-trained visual-LLM as an encoder, accommodating both text and image inputs for effective feature representation from both modalities. It introduces a unique feature swapping module, enabling the exchange of features between identified objects in images and corresponding vocabulary words in text, thereby enhancing the model's comprehension of the interplay between images and text. To further bolster its multi-modal alignment capabilities, FSMR incorporates a multi-modal cross-attention mechanism, facilitating the joint modeling of textual and visual information. During training, we employ image-text matching and cross-entropy losses to ensure semantic consistency between visual and language elements. Extensive experiments on the PMR dataset demonstrate FSMR's superiority over state-of-the-art baseline models across various performance metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2131–2140.
  2. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
  3. Unsupervised natural language inference via decoupled multimodal contrastive learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5511–5520, Online. Association for Computational Linguistics.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL, pages 4171–4186.
  5. Premise-based multimodal reasoning: Conditional inference on joint textual and visual clues. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 932–946.
  6. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969.
  7. Image retrieval from contextual descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3426–3440.
  8. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  9. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, pages 121–137. Springer.
  10. A multi-modal context reasoning approach for conditional inference on joint textual and visual clues. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10757–10770, Toronto, Canada. Association for Computational Linguistics.
  11. Mvptr: Multi-level semantic alignment for vision-language pre-training via multi-stage learning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4395–4405.
  12. Modular and parameter-efficient multimodal fusion with prompting. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2976–2985.
  13. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  14. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, pages 13–23.
  15. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473.
  16. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5103–5114.
  17. Tijmen Tieleman and Geoffrey Hinton. 2012. Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. COURSERA Neural Networks Mach. Learn, 17.
  18. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
  19. Chunk-aware alignment and lexical constraint for visual entailment with natural language explanations. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3587–3597.
  20. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3208–3216.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com