Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment (2402.13561v2)
Abstract: Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small LLM and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into LLMs. We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
- Patrick Cavanagh. 2011. Visual cognition. Vision research, 51(13):1538–1551.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- Utc: A unified transformer with inter-task contrastive learning for visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18103–18112.
- Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713.
- Temporal knowledge question answering via abstract reasoning induction. arXiv preprint arXiv:2311.09149.
- Multi-granularity temporal question answering over knowledge graphs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11378–11392, Toronto, Canada. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Jessica A. Collins and Ingrid R. Olson. 2014. Knowledge is power: How conceptual knowledge transforms visual cognition. Psychonomic Bulletin and Review, 21(4):843–860.
- e-snli-ve: Corrected visual-textual entailment with natural language explanations. arXiv preprint arXiv:2004.03744.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Select, substitute, search: A new benchmark for knowledge-augmented visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2491–2498.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Segment anything. arXiv preprint arXiv:2304.02643.
- Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. ACL.
- A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv, pages 2023–11.
- A multi-modal context reasoning approach for conditional inference on joint textual and visual clues. arXiv preprint arXiv:2305.04530.
- Lmeye: An interactive perception network for large language models. arXiv preprint arXiv:2305.03701.
- Towards vision enhancing llms: Empowering multimodal knowledge storage and sharing in llms. arXiv preprint arXiv:2311.15759.
- A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- Imagination-augmented natural language understanding. NACCL.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
- Linearly mapping from image to text space. ICLR.
- OpenAI. 2023. Gpt-4 technical report. https://arxiv.org/abs/2303.08774.
- Steven Pinker. 1984. Visual cognition: An introduction. Cognition, 18(1-3):1–63.
- A-okvqa: A benchmark for visual question answering using world knowledge. arXiv.
- Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223.
- Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427.
- Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570.
- Self-instruct: Aligning language model with self generated instructions.
- Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731.
- Reasoning with multi-structure commonsense knowledge in visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4600–4609.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Visualize before you write: Imagination-guided open-ended text generation. arXiv preprint arXiv:2210.03765.
- Yunxin Li (29 papers)
- Xinyu Chen (65 papers)
- Baotian Hu (67 papers)
- Haoyuan Shi (13 papers)
- Min Zhang (630 papers)