Enhancing Multimodal LLMs with Set-of-Mark Prompting
Introduction to Set-of-Mark Prompting
Recent developments in Multimodal LLMs (MLLMs) like GPT-4V have demonstrated significant capabilities in multimodal reasoning and interaction, notably through the Set-of-Mark (SoM) prompting technique. SoM enhances the connection between visual content and text by associating visual objects with alphanumeric tags. This paper examines the challenges MLLMs face in understanding SoM, proposes a training paradigm to equip MLLMs with SoM capabilities, and assesses the impact of this training on their performance.
Methodology
Issue with Existing MLLMs
The paper identifies that while GPT-4V can effectively utilize SoM, other MLLMs like LLaVA-1.5 and commercial systems struggle. The core challenge is the third capability of SoM prompting—associating alphanumeric tags with their corresponding visual objects. The paper hypothesizes that the lack of explicit training data involving SoM-like annotations limits the effectiveness of SoM in these models.
Proposed Training Paradigm
To address this, the authors introduce a "list items one by one" learning paradigm. This strategy involves training the model to enumerate and describe tagged visual objects following the alphanumeric order. A custom dataset was created with 10k to 30k annotated images by tagging images using GPT-4V and associating these tags with visual objects through prompts designed for enumeration.
Evaluation
Dataset and Model Finetuning
Upon finetuning MLLMs using the proposed dataset across various configurations and sizes, the models showed significant improvement in understanding and generating descriptions based on SoM prompts. Key evaluations involved tasks requiring the enumeration of objects tagged in images.
MLLM Benchmarks
Furthermore, the models were evaluated against five MLLM benchmarks, encompassing tasks from visual reasoning to preventing hallucinations in model responses. The finetuned models not only excelled in tasks involving SoM prompts but also showed enhanced general multimodal understanding and reasoning capabilities even when tags were removed during inference. This underscores the potential of the proposed training paradigm in strengthening object-text recognition and alignment.
Future Implications and Conclusions
The findings suggest that the "list things one by one" paradigm may represent an effective method for training more robust MLLMs. The ability of models trained with this new dataset to perform well even in non-tagged scenarios indicates a broader understanding and internalization of visual and textual alignment principles. Looking ahead, this technique could aid in the development of more intuitive and effective human-AI interfaces across various applications, from assistive technologies to interactive learning environments.
The implications of successfully extending SoM capabilities from GPT-4V to other open-source and commercial MLLMs could be substantial, offering a broader utility of multimodal models and democratizing advanced AI technologies. The availability of the dataset and code provided by the authors also encourages further exploration and adaptation of SoM prompting in multi-modal AI research.