Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs (2310.00582v3)

Published 1 Oct 2023 in cs.CV and cs.AI

Abstract: Multi-modal LLMs (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance in fine-grained image understanding tasks is still limited. To address this issue, this paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. Specifically, we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. A self-consistent bootstrapping method is also introduced to extend existing dense object annotations into high-quality referring-expression-bounding-box pairs. These methods enable the generation of high-quality instruction data which includes a wide range of fundamental abilities essential for fine-grained image perception. Moreover, we argue that the visual encoder should be tuned during instruction tuning to mitigate the gap between full image perception and fine-grained image perception. Experimental results demonstrate the superior performance of our method. For instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also attained the top rank on the leaderboard of MMBench. This promising performance is achieved by training on only publicly available data, making it easily reproducible. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink.

PDF Abstract

Enhancing Fine-Grained Image Understanding in Multi-Modal LLMs Through Referential Comprehension

The paper "Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs" presents a novel approach to enhance the fine-grained image understanding abilities of Multi-Modal LLMs (MLLMs). Despite the capabilities of MLLMs in various multi-modal tasks, their performance in tasks requiring detailed image understanding has been suboptimal. The proposed framework seeks to address this challenge by incorporating referential comprehension (RC) tasks into the instruction tuning phase, allowing MLLMs to better interpret fine-grained visual elements.

Methodology Overview

The core of the proposed framework includes a sophisticated dataset construction method and an efficient strategy to adjust the visual encoder during instruction tuning:

Dataset Construction: The paper introduces a cost-effective approach to build a comprehensive instruction tuning dataset by leveraging annotations from existing datasets. This method allows the creation of high-quality instruction data essential for enhancing MLLMs' abilities in detailed image perception. The authors have ingeniously devised methods to extend dense object annotations into referring-expression-bounding-box pairs, thereby enriching the dataset with tasks that cover fundamental abilities such as visual relation reasoning, spatial reasoning, object counting, and object detection.
Self-Consistent Bootstrapping: A novel method termed "self-consistent bootstrapping" is utilized to ensure the accuracy and quality of generated data. This method leverages the capabilities of MLLMs to extend existing dense annotations into high-quality RC data, enabling the generation of varied instruction-following datasets efficiently.
Visual Encoder Tuning: The research emphasizes the need to tune the visual encoder during instruction tuning. By introducing adaptable components such as Adapters and LoRA, the methodology prevents semantic loss and advances the ability of the visual encoder to perform detailed image understanding without necessitating extensive datasets or computation resources.

Experimental Validation

The proposed model, named "Pink," undergoes exhaustive experimentation, demonstrating its superior performance against existing methods, especially in RC tasks. It exhibits notable accuracy improvements of 5.2% over Qwen-VL on the GQA benchmark and an impressive 24.7% accuracy gain over Kosmos-2 on the RefCOCO_val dataset. The model also secures the top position on the MMBench leaderboard. This was achieved using publicly accessible data and maintaining a smaller set of trainable parameters (6.7M), highlighting the efficiency and reproducibility of the framework.

Implications and Future Directions

The paper makes significant contributions to the field of AI and multi-modal learning by paving the way for more nuanced visual understanding through MLLMs. The method's reliance on publicly available datasets and its adaptability for use on consumer-grade GPUs underscore its potential for widespread academic application and replication.

In terms of future developments, the research suggests the utility of expanding RC task diversity further, which could enhance the model’s ability to generalize across various tasks. Additionally, exploring more advanced fine-tuning techniques, perhaps integrating more sophisticated components within the visual encoder, could yield even finer granularity in image understanding tasks.

Overall, the paper provides a valuable framework that not only enhances the capacity of current MLLMs but also offers insights into efficiently optimizing vision-LLMs for more detailed comprehension tasks without excessive resource demands.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shiyu Xuan (6 papers)
Qingpei Guo (27 papers)
Ming Yang (289 papers)
Shiliang Zhang (132 papers)

Citations (29)

View on Semantic Scholar

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs (2310.00582v3)

Enhancing Fine-Grained Image Understanding in Multi-Modal LLMs Through Referential Comprehension

Methodology Overview

Experimental Validation

Implications and Future Directions

Related Papers

GitHub

YouTube