Attention Prompting on Image for Large Vision-Language Models (2409.17143v1)

Published 25 Sep 2024 in cs.CV and cs.AI

Abstract: Compared with LLMs, Large Vision-LLMs (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.

PDF HTML Abstract

The paper "Attention Prompting on Image for Large Vision-LLMs" presents an innovative technique designed to enhance the performance of Large Vision-LLMs (LVLMs). These models are capable of taking both text and image inputs, enabling them to address various vision-language tasks with impressive results. The research draws inspiration from the concept of text prompting in LLMs and extends it to the visual domain, introducing a novel method they call "Attention Prompting on Image".

The Problem Addressed

Traditional visual prompting methods for LVLMs have primarily focused on visual inputs alone, without taking associated text queries into account. This limitation hampers the models' ability to effectively follow and respond to text-based instructions, reducing their potential efficacy in tasks that require a coordinated understanding of both visual and textual information.

The Proposed Solution

The authors propose an approach wherein a text-query-guided attention heatmap is overlaid onto the original input images. This method leverages an auxiliary model such as CLIP to generate the attention heatmap based on the given text query. The heatmap is then used to modify the input image by multiplying its pixel values with the heatmap values. This adjusted image, which has been aligned more closely with the text query, is fed into the LVLM.

Key Steps in the Methodology

Attention Heatmap Generation: Using a model like CLIP, the attention heatmap is generated for the input image in the context of the provided text query.
Image Adjustment: This heatmap is then applied to the original image by multiplying corresponding pixel values, effectively reweighting the image to highlight areas of interest identified by the text query.
Enhanced Input: The modified input image is fed into the LVLM, which now has an input more attuned to the text instructions.

Experimental Validation

The authors validated their method through extensive experiments on multiple vision-language benchmark tasks. Specifically, their technique led to significant performance enhancements:

On the MM-Vet benchmark, the Attention Prompting on Image technique improved the performance of the LLaVA-1.5 model by 3.8%.
On the LLaVA-Wild benchmark, a 2.9% improvement was recorded.

Conclusion

This work fills a crucial gap in LVLM development by bridging visual inputs with text-based queries through a sophisticated attention prompting mechanism. The proposed technique demonstrates a scalable and effective way to enhance the capabilities of LVLMs, making them more responsive and accurate in tasks that require the integration of visual and textual data. These advancements suggest that incorporating contextual attention mechanisms can significantly boost the utility and performance of vision-LLMs.