GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest (2307.03601v3)

Published 7 Jul 2023 in cs.CV

Abstract: Visual instruction tuning LLM(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.

PDF HTML Abstract

Insights on "GPT4RoI: Instruction Tuning LLM on Region-of-Interest"

The paper "GPT4RoI: Instruction Tuning LLM on Region-of-Interest" introduces an advanced model for fine-grained multimodal understanding by enhancing LLMs with spatial instruction tuning. This research highlights a novel approach in efficiently aligning region-level visual features with language embeddings, stepping beyond the conventional image-text pair paradigm.

Model Architecture and Methodology

GPT4RoI utilizes state-of-the-art architecture components, including a CLIP-based vision encoder and the Vicuna model for language understanding. A significant contribution is the extraction and integration of region-level features using RoIAlign fused with multi-level features. This design allows the model to convert user instructions into spatial representations by replacing designated region tokens with their respective features.

The architecture supports interaction beyond traditional language-based inputs, allowing users to specify regions of interest directly, thereby improving communication granularity. This capacity is explored across various datasets, focusing on enhancing the model's ability to understand and reason about specific regions within images.

Training and Dataset Utilization

Two training stages are employed: an initial alignment of simple region-text pairs followed by fine-tuning with complex concepts and reasoning tasks. The datasets employed, such as COCO and Visual Genome, are transformed into a spatial instruction format to accommodate user interactions at a region level. Further, incorporating LLaVA150K enhances multi-round dialogue capabilities.

During training, the model undergoes next-token prediction loss adjustments, aligning region features with the linguistic context. This approach enhances the model's proficiency in understanding fine-grained details beyond mere category recognition.

Results and Findings

Empirical results demonstrate that GPT4RoI delivers outstanding performance in tasks requiring intricate understanding, such as Visual Commonsense Reasoning (VCR), where it achieves near-human performance levels. The model significantly surpasses existing benchmarks in region captioning and reasoning tasks, showcasing its robust comprehension and reasoning capabilities.

Implications and Future Work

The research has substantial ramifications for developing more interactive and precise AI systems in multimodal understanding. The ability to reference and reason about specific regions opens up possibilities for applications that necessitate detailed visual understanding, paving the way for more intuitive AI interactions.

Future developments could focus on expanding region-level datasets and refining model architectures to further enhance performance. Exploring semi-supervised techniques for generating region-level data and developing diverse interaction modes could enable a more comprehensive understanding of visual content.

Overall, GPT4RoI marks a significant advancement in the domain of multimodal AI, advocating for a seamless integration of spatial and linguistic data processing capabilities. This work sets the stage for future exploration and refinement in vision-LLMs.