Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning (2404.13847v1)

Published 22 Apr 2024 in cs.CV and cs.CL

Abstract: Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of LLMs, it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.
  2. Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908.05054, 2019.
  3. Language models are few-shot learners. In Advances in neural information processing systems, pages 1877–1901, 2020.
  4. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  5. Uniter: UNiversal Image-TExt Representation Learning. In European conference on computer vision, pages 104–120, 2020.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North, Jan 2019.
  8. A survey of vision-language pre-trained models. In Proceedings of International Joint Conference on Artificial Intelligence, pages 5436–5443, 2022.
  9. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  10. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
  13. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  14. Do vision-language transformers exhibit visual commonsense? an empirical study of vcr. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5634–5644, 2023.
  15. TAB-VCR: Tags and attributes based visual commonsense reasoning baselines. In Advances in Neural Information Processing Systems, volume 32, 2019.
  16. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  17. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  18. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, volume 32, 2019.
  19. VisualCOMET: Reasoning about the dynamic context of a still image. In In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  20. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  21. Movie description. International Journal of Computer Vision, 123:94–120, 2017.
  22. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  23. KVL-BERT: Knowledge enhanced visual-and-linguistic bert for visual commonsense reasoning. Knowledge-Based Systems, 230:107408, 2019.
  24. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2019.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  26. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  27. SGEITL: Scene graph enhanced image-text learning for visual commonsense reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5914–5922, 2022.
  28. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology, 31:1042–1054, 2020.
  29. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology, page 1042–1054, Mar 2021.
  30. Connective cognition network for directional visual commonsense reasoning. In Advances in Neural Information Processing Systems, volume 32, 2019.
  31. PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11104–11117, 2022.
  32. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  33. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  34. Heterogeneous graph learning for visual commonsense reasoning. In Advances in Neural Information Processing Systems, volume 32, 2019.
  35. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3208–3216, 2021.
  36. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  37. Multi-level counterfactual contrast for visual commonsense reasoning. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1793–1802, 2021.
  38. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  39. Explicit cross-modal representation learning for visual commonsense reasoning. IEEE Transactions on Multimedia, page 2986–2997, Jan 2022.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mingjie Ma (2 papers)
  2. Yichao Ma (3 papers)
  3. Guohui Li (12 papers)
  4. zhihuan yu (2 papers)