CARVE: Contrastive Attention for Visual Enhancement

Updated 15 September 2025

The paper introduces CARVE, a training-free approach that refines attention maps by contrasting task-specific and general queries to effectively isolate semantic signals.
Its methodology employs pixel-level masking and attention entropy analysis, achieving up to a 71.83% relative improvement on visual reasoning benchmarks.
CARVE enhances noise suppression and semantic extraction in cluttered scenes, offering practical benefits for vision-language models without retraining or external segmentation.

Contrastive Attention Refinement for Visual Enhancement (CARVE) is a methodological framework designed to enhance the visual reasoning capabilities of Vision-LLMs (VLMs) in complex visual environments. CARVE operates as a training-free, pixel-level approach that leverages the intrinsic attention maps of VLMs—contrasting task-specific and general query attentions—to isolate and amplify semantic signals while suppressing visual noise. Experimental evidence demonstrates substantial performance improvements on visual benchmarks, providing both theoretical insight and practical utility for demanding real-world applications (Ge et al., 8 Sep 2025).

1. Theoretical Foundation and Attention Decomposition

CARVE is grounded in the empirical observation that VLMs' attention maps encode a mixture of semantic signal and visual noise. Specifically, given an input image $\mathcal{I}$ and a query $Q$ , the attention map at layer $l$ and timestep $t$ is modeled as: $A_{l, t}^{(Q)}(\mathcal{I}) = \mathcal{F}_{\text{vis}}(\mathcal{I}) \odot \mathcal{F}_{\text{sem}}(Q, \mathcal{I})$ where $\mathcal{F}_{\text{vis}}$ denotes question-independent visual noise (driven by texture, color complexity) and $\mathcal{F}_{\text{sem}}(Q, \mathcal{I})$ represents the task-specific semantic signal.

For a generic instruction $G$ (e.g., "Describe this image"), $\mathcal{F}_{\text{sem}}$ approaches uniformity and $A_{l, t}^{(G)}(\mathcal{I})$ almost exclusively reflects the noise term. By contrasting the maps of task-specific ( $A^{(Q)}$ ) and general ( $A^{(G)}$ ) queries, CARVE isolates $\mathcal{F}_{\text{sem}}$ : $\hat{A}_i = \frac{A_i^{(Q)}}{A_i^{(G)} + \lambda}$ for each token $i$ with a small regularization constant $\lambda$ .

Attention entropy formalizes the relationship between attention dispersion and model reasoning: $\bar{H} = \frac{1}{|\mathcal{L}|}\sum_{l} \left[ -\sum_i a_{l, t, i} \ln a_{l, t, i} \right]$ Higher attention entropy (more dispersed focus) correlates negatively with visual reasoning accuracy in complex scenes.

2. Methodological Workflow

CARVE is executed in three core stages:

Acquisition of Attention Maps: The VLM processes the image $\mathcal{I}$ twice: once for the task-specific query $Q$ (extracting $A_{l, t}^{(Q)}$ ) and once for a general instruction $G$ (extracting $A_{l, t}^{(G)}$ ).
Contrastive Refinement: For each pixel or visual token, the refined attention is calculated via $\hat{A}_{l, t} = A_{l, t}^{(Q)} \div (A_{l, t}^{(G)} + \lambda)$ . These refined maps are fused across a chosen layer range $\mathcal{L}$ and generation timesteps $\mathcal{T}$ ; later tokens are weighted more heavily.
Signal Extraction and Masking: The fused attention is spatially reshaped to the pixel grid, thresholded at top- $p$ percentile $\tau$ to mask the most salient regions, and connected-component analysis selects top- $K$ regions that maximize cumulative attention. The enhanced image is then cropped, resized, and re-evaluated by the VLM for final output.

No fine-tuning or external segmentation tools are required; pixel-level semantic extraction relies solely on the VLM's own attention mechanisms.

3. Empirical Evaluation and Performance

CARVE has been validated across diverse visual reasoning benchmarks, including A-OKVQA, POPE, V*, and TextVQA. Key observed improvements:

Model	Dataset	Accuracy Gain	Relative Improvement
Qwen2.5-VL-3B	V*	+13–17 points	up to 71.83%
Llava1.5-7B	POPE/V*	+13–17 points	up to 71.83%

Ablation studies indicate optimal attention focusing is achieved in deeper layers ( $l = 20$ –$25$) and with balanced thresholds (top- $p$ between 20–60%, $K=2$ –3 regions). Computational overhead remains reasonable ( $\sim$ 1.34s/sample with modern GPUs), outperforming segmentation-based methods in efficiency.

4. Impact on Visual Reasoning and Robustness

CARVE directly improves visual question answering, especially in cluttered, noisy, or color-complex scenes where standard VLMs are distracted by background signal. By contrasting semantic and noise components, CARVE delivers enhanced robustness and higher accuracy for fine-grained reasoning tasks.

Additional benefits include adaptability across models and tasks without retraining or architecture modification. This method is equally applicable to document analysis, OCR, and any context requiring discrimination between semantic and irrelevant visual content.

5. Comparison with Prior Enhancement Methods

CARVE distinguishes itself from existing approaches by:

No Training Required: Operations are training-free; only inference passes and intrinsic attention use.
Pixel-Level Masking: Unlike external segmentation tools (e.g., SAM, YOLO), CARVE performs native pixel-level signal extraction.
Contextual Adaptability: By contrasting two attention states (one general, one task-specific), it adapts masking granularity to the specific query.
Efficiency: Overhead per sample is competitive, with superior accuracy gains compared to training-heavy or segmentation-based methods.

Conventional enhancement techniques typically operate at a coarse level and/or require additional segmentation data or retraining. CARVE leverages theoretical attention map decomposition to achieve semantic region extraction intrinsically.

6. Future Directions and Implications

Research implications outlined include:

Further investigation of dynamic fusion strategies, adaptive thresholding, and the connection between attention entropy and visual complexity.
Exploration of the integration between contrastive attention extraction and self-supervised learning paradigms or multi-modal fusion networks.
Extension of CARVE-style approaches to new domains requiring fine localization of semantic content, such as multi-document parsing, scientific visualizations, or robust autonomous perception under noise conditions.

The demonstrated interplay between complexity, attention entropy, and performance underscores the value of contrastive attention methods in maximizing VLMs' reasoning capability.

In summary, Contrastive Attention Refinement for Visual Enhancement (CARVE) formalizes and exploits the decomposition of VLM attention maps into semantic signal and noise factors. Its training-free, pixel-level workflow reliably isolates task-relevant regions and mitigates the detrimental effects of visual complexity, achieving substantial improvements in visual reasoning benchmarks while maintaining computational efficiency and model generality (Ge et al., 8 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning (2025)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Contrastive Attention Refinement for Visual Enhancement (CARVE).