Learning to Look: Cognitive Attention Alignment with Vision-Language Models

Published 25 Sep 2025 in cs.CV and cs.AI | (2509.21247v1)

Abstract: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-LLMs to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents an annotation-free framework that aligns CNN attention with VLM-generated maps to mitigate shortcut learning.
It employs a joint training objective using KL divergence to align saliency maps, achieving 64.88% accuracy on ColorMNIST.
The study demonstrates that cognitive attention alignment shifts model focus from spurious cues to task-relevant features, improving generalization.

Cognitive Attention Alignment in Vision-LLMs: A Scalable Framework

Motivation and Background

Convolutional Neural Networks (CNNs) are susceptible to shortcut learning, often exploiting spurious correlations in data rather than acquiring robust, generalizable representations. This phenomenon undermines reliability, especially in settings where superficial cues (e.g., color, background artifacts) are confounded with true class semantics. Cognitive science emphasizes the role of attention in human perception, guiding focus toward task-relevant features and supporting robust generalization. Prior approaches to attention alignment in neural networks—such as concept-based supervision and explanation regularization—require labor-intensive, expert-provided annotations, limiting scalability and introducing annotation bias.

This paper introduces a scalable, annotation-free framework for attention alignment in CNNs, leveraging vision-LLMs (VLMs) to generate semantic attention maps via natural language prompts. The framework employs an auxiliary loss to align CNN attention with these language-guided maps, promoting cognitively plausible decision-making and reducing shortcut reliance.

Methodology

Automatic Generation of Semantic Attention Maps

The framework utilizes WeCLIP+, a state-of-the-art VLM, to generate class-specific attention maps for each image using natural language prompts. For a given input $x_i$ and class label $y_i$ , a prompt $t_{y_i}$ (e.g., "a photo of a digit") is constructed. WeCLIP+ produces an affinity map $M_{\mathrm{VL}}(x_i, y_i)$ , highlighting regions associated with the semantic concept. Optionally, background or distractor prompts are included to help the model distinguish foreground from context.

Attention maps may be post-processed using morphological dilation or edge detection to refine inductive biases, but unmodified WeCLIP+ maps are often sufficient.

Attention-Aligned CNN Training

The CNN $f_\theta$ is trained to minimize a joint objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda\, \mathcal{L}_{\mathrm{attn}}$

where $\mathcal{L}_{\mathrm{CE}}$ is the standard cross-entropy loss and $\mathcal{L}_{\mathrm{attn}}$ is the KL divergence between the normalized CAM saliency map $S_\theta(x_i, y_i)$ and the WeCLIP+ attention map $M_{\mathrm{VL}}(x_i, y_i)$ . The training schedule consists of two phases: initial epochs focus solely on attention alignment, followed by joint optimization with a ramped $\lambda$ to prioritize attention supervision.

Hyperparameters $(\lambda, E_{\mathrm{attn}})$ are selected via grid search using a composite metric, Optim Value:

$\text{Optim Value} = \text{ValAcc} \times (1 - \mathcal{L}_{\mathrm{attn}})$

which favors configurations that jointly maximize accuracy and minimize attention divergence.

Figure 1: ColorMNIST: Optim Value across $(\lambda, E_{\mathrm{attn}})$ , illustrating the impact of attention alignment hyperparameters on joint accuracy and attention loss.

Experimental Evaluation

Datasets and Baselines

Experiments are conducted on ColorMNIST and DecoyMNIST, benchmarks designed to test model reliance on spurious correlations. In ColorMNIST, digit classes are assigned unique colors during training, with color mappings reversed at test time. DecoyMNIST augments digits with class-indicative gray patches, creating spurious associations.

Baselines include a vanilla CNN (Base), concept distillation (CDBS), and explanation regularization methods (RRR, CDEP).

Quantitative Results

On ColorMNIST, the baseline CNN achieves only $0.1\%$ accuracy, indicating complete shortcut reliance. RRR performs similarly, while CDEP and CDBS reach $31.0\%$ and $50.93\%$ respectively. The proposed method achieves $64.88 \pm 2.85\%$ , outperforming annotation-heavy baselines and demonstrating effective shortcut mitigation.

On DecoyMNIST, manual supervision methods achieve near-perfect accuracy ($97.2$– $99.0\%$ ). The proposed method attains $96.19 \pm 0.35\%$ , remaining competitive despite relying solely on automatically generated pseudo-maps.

Qualitative Analysis

Saliency map comparisons reveal that attention alignment shifts model focus from spurious cues (color, corner patches) to digit shape, aligning with human intuition.

(Figure 2)

Figure 2: Qualitative comparison of saliency maps on ColorMNIST and DecoyMNIST, showing original inputs, base model saliency, and attention-aligned saliency. Brighter regions indicate higher saliency.

Implementation Considerations

Computational Requirements: Precomputing and storing attention maps for large datasets can be memory-intensive. On-the-fly generation is a promising direction for future work.
Backbone Agnosticism: The framework is compatible with various CNN architectures and differentiable saliency techniques. Extension to Vision Transformers is feasible by adapting the attribution mechanism.
Hyperparameter Sensitivity: Performance is sensitive to the choice of $\lambda$ and $E_{\mathrm{attn}}$ , necessitating careful grid search.
Bias in Teacher Signals: Reliance on VLM-generated attention maps may introduce biases inherent to the teacher model. Mitigation strategies include debiasing or ensemble teacher approaches.

Limitations and Future Directions

The evaluation is restricted to simple benchmarks; extension to complex, high-dimensional datasets is necessary to assess generalizability. Memory overhead from precomputed maps and potential teacher bias are open challenges. Future work should explore dynamic attention map generation, broader dataset coverage, and robust debiasing techniques.

Conclusion

This work presents a scalable, annotation-free framework for cognitive attention alignment in vision-LLMs, leveraging language-driven attention maps to guide neural networks toward task-relevant features. The approach achieves state-of-the-art accuracy on ColorMNIST and remains competitive on DecoyMNIST, requiring no human-provided saliency or concept labels. The framework is flexible, backbone-agnostic, and amenable to extension. Future research should address scalability, teacher bias, and applicability to more complex domains, advancing the integration of cognitive inductive biases in deep learning systems.

Markdown Report Issue