LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge (2311.11860v2)

Published 20 Nov 2023 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal LLM (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

PDF Abstract

Analysis of the Paper: LION - Empowering Multimodal LLMs with Dual-Level Visual Knowledge

The paper entitled "LION: Empowering Multimodal LLMs with Dual-Level Visual Knowledge" presents an innovative approach to enhance the capabilities of Multimodal LLMs (MLLMs) by integrating dual-level visual knowledge. This integration aims to address the prevalent challenge of insufficient extraction and reasoning of visual information in existing MLLMs, which primarily use vision encoders pretrained on coarsely aligned image-text pairs.

Key Contributions

The authors introduce the LION framework, which focuses on two main levels of visual knowledge enhancement:

Progressive Incorporation of Fine-Grained Spatial-Aware Visual Knowledge:
- The authors devise a vision aggregator partnered with region-level vision-language tasks to embed fine-grained spatial visual knowledge into the MLLM. This methodology resolves the conflicts between image-level and region-level VL tasks by implementing a stage-wise instruction-tuning strategy using a mixture-of-adapters. This strategy not only mediates incompatibility but also promotes mutual enhancement across different types of VL tasks.
Soft Prompting of High-Level Semantic Visual Evidence:
- LION leverages diverse image tags to infuse high-level semantic knowledge into the model. Addressing inaccuracies in predicted tags, the paper proposes a soft prompting technique that utilizes a learnable token integrated into the text instruction, further augmenting the MLLM's performance in semantic understanding.

Experimental Validation

The authors underpin their propositions with extensive empirical validation across multiple multimodal benchmarks. Notably, LION demonstrates an improvement of 5% accuracy on visual spatial reasoning (VSR) and 3% CIDEr on TextCaps compared to InstructBLIP, as well as a 5% accuracy enhancement on RefCOCOg over Kosmos-2. These results substantiate the efficacy of LION in improving MLLMs' performance in various VL tasks.

Implications and Future Directions

The approach outlined in this paper presents significant implications for the development of more proficient MLLMs. By effectively addressing the visual knowledge extraction limitations and integrating multiple levels of visual information, the LION framework sets a precedent for future research aimed at refining the perceptual and cognitive capabilities of AI models.

From a theoretical perspective, the dual-level visual knowledge enhancement could be explored further to potentially integrate additional dimensions of visual understanding, such as temporal dynamics in visual data. Practically, the application of LION's advanced MLLMs could extend across domains requiring intricate visual-textual reasoning, including autonomous navigation, advanced robotics, and more nuanced AI-human interactions.

As AI continues to evolve, the mechanisms introduced in LION for enhancing multimodal learning constitute a promising step towards creating robust models capable of more sophisticated interactions and interpretations, ultimately bridging the gap between human and machine understanding in the visual and linguistic domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Gongwei Chen (16 papers)
Leyang Shen (3 papers)
Rui Shao (31 papers)
Xiang Deng (43 papers)
Liqiang Nie (191 papers)

Citations (24)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - rshaojimmy/JiuTian: JiuTian, a Multimodal Large Language Model from HITSZ (145 stars)