Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models (2308.13437v2)
Abstract: Recently, Multimodal LLMs (MLLMs) that enable LLMs to interpret images through visual instruction tuning have achieved significant success. However, existing visual instruction tuning methods only utilize image-language instruction data to align the language and image modalities, lacking a more fine-grained cross-modal alignment. In this paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which extends the functionality of MLLMs by integrating an additional region-level vision encoder. This integration promotes a more detailed comprehension of images for the MLLM. In addition, to efficiently achieve a fine-grained alignment between the vision modules and the LLM, we design multiple data generation strategies to construct an image-region-language instruction dataset. Finally, we present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model. Code and data will be released at https://github.com/PVIT-official/PVIT.
- Chi Chen (62 papers)
- Ruoyu Qin (3 papers)
- Fuwen Luo (14 papers)
- Xiaoyue Mi (9 papers)
- Peng Li (390 papers)
- Maosong Sun (337 papers)
- Yang Liu (2253 papers)