List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs (2404.16375v1)

Published 25 Apr 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal LLMs (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{https://github.com/zzxslp/SoM-LLaVA}.

PDF Abstract

Enhancing Multimodal LLMs with Set-of-Mark Prompting

Introduction to Set-of-Mark Prompting

Recent developments in Multimodal LLMs (MLLMs) like GPT-4V have demonstrated significant capabilities in multimodal reasoning and interaction, notably through the Set-of-Mark (SoM) prompting technique. SoM enhances the connection between visual content and text by associating visual objects with alphanumeric tags. This paper examines the challenges MLLMs face in understanding SoM, proposes a training paradigm to equip MLLMs with SoM capabilities, and assesses the impact of this training on their performance.

Methodology

Issue with Existing MLLMs

The paper identifies that while GPT-4V can effectively utilize SoM, other MLLMs like LLaVA-1.5 and commercial systems struggle. The core challenge is the third capability of SoM prompting—associating alphanumeric tags with their corresponding visual objects. The paper hypothesizes that the lack of explicit training data involving SoM-like annotations limits the effectiveness of SoM in these models.

Proposed Training Paradigm

To address this, the authors introduce a "list items one by one" learning paradigm. This strategy involves training the model to enumerate and describe tagged visual objects following the alphanumeric order. A custom dataset was created with 10k to 30k annotated images by tagging images using GPT-4V and associating these tags with visual objects through prompts designed for enumeration.

Evaluation

Dataset and Model Finetuning

Upon finetuning MLLMs using the proposed dataset across various configurations and sizes, the models showed significant improvement in understanding and generating descriptions based on SoM prompts. Key evaluations involved tasks requiring the enumeration of objects tagged in images.

MLLM Benchmarks

Furthermore, the models were evaluated against five MLLM benchmarks, encompassing tasks from visual reasoning to preventing hallucinations in model responses. The finetuned models not only excelled in tasks involving SoM prompts but also showed enhanced general multimodal understanding and reasoning capabilities even when tags were removed during inference. This underscores the potential of the proposed training paradigm in strengthening object-text recognition and alignment.

Future Implications and Conclusions

The findings suggest that the "list things one by one" paradigm may represent an effective method for training more robust MLLMs. The ability of models trained with this new dataset to perform well even in non-tagged scenarios indicates a broader understanding and internalization of visual and textual alignment principles. Looking ahead, this technique could aid in the development of more intuitive and effective human-AI interfaces across various applications, from assistive technologies to interactive learning environments.

The implications of successfully extending SoM capabilities from GPT-4V to other open-source and commercial MLLMs could be substantial, offering a broader utility of multimodal models and democratizing advanced AI technologies. The availability of the dataset and code provided by the authors also encourages further exploration and adaptation of SoM prompting in multi-modal AI research.