Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment (2402.13561v2)

Published 21 Feb 2024 in cs.CL and cs.CV

Abstract: Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small LLM and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into LLMs. We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
  2. Patrick Cavanagh. 2011. Visual cognition. Vision research, 51(13):1538–1551.
  3. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  4. Utc: A unified transformer with inter-task contrastive learning for visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18103–18112.
  5. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713.
  6. Temporal knowledge question answering via abstract reasoning induction. arXiv preprint arXiv:2311.09149.
  7. Multi-granularity temporal question answering over knowledge graphs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11378–11392, Toronto, Canada. Association for Computational Linguistics.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Jessica A. Collins and Ingrid R. Olson. 2014. Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic Bulletin and Review, 21(4):843–860.
  10. e-snli-ve: Corrected visual-textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744.
  11. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  12. Select, substitute, search: A new benchmark for knowledge-augmented visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2491–2498.
  13. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  14. Segment anything. arXiv preprint arXiv:2304.02643.
  15. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823.
  16. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.
  18. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. ACL.
  19. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv, pages 2023–11.
  20. A multi-modal context reasoning approach for conditional inference on joint textual and visual clues. arXiv preprint arXiv:2305.04530.
  21. Lmeye: An interactive perception network for large language models. arXiv preprint arXiv:2305.03701.
  22. Towards vision enhancing llms: Empowering multimodal knowledge storage and sharing in llms. arXiv preprint arXiv:2311.15759.
  23. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  25. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  26. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  27. Imagination-augmented natural language understanding. NACCL.
  28. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093.
  29. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  30. Linearly mapping from image to text space. ICLR.
  31. OpenAI. 2023. Gpt-4 technical report. https://arxiv.org/abs/2303.08774.
  32. Steven Pinker. 1984. Visual cognition: An introduction. Cognition, 18(1-3):1–63.
  33. A-okvqa: A benchmark for visual question answering using world knowledge. arXiv.
  34. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326.
  35. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449.
  36. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  37. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223.
  38. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427.
  39. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570.
  40. Self-instruct: Aligning language model with self generated instructions.
  41. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
  42. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  43. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
  44. Reasoning with multi-structure commonsense knowledge in visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4600–4609.
  45. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  46. Visualize before you write: Imagination-guided open-ended text generation. arXiv preprint arXiv:2210.03765.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yunxin Li (29 papers)
  2. Xinyu Chen (65 papers)
  3. Baotian Hu (67 papers)
  4. Haoyuan Shi (13 papers)
  5. Min Zhang (630 papers)
Citations (2)