Papers
Topics
Authors
Recent
2000 character limit reached

ACVC: Attention Consistency on Visual Corruptions

Updated 25 December 2025
  • The method enforces alignment of class activation maps between clean and corrupted images, substantially reducing error rates (e.g., mCE improvements of 12–13%).
  • It leverages a two-stage training process with CAM-guided refinement and standard fine-tuning to maintain focus on discriminative regions, even under diverse corruptions.
  • Mathematical formulations using MSE loss, Jensen–Shannon divergence, and iterative protocols underpin ACVC, yielding state-of-the-art results in both corruption robustness and domain generalization.

Attention Consistency on Visual Corruptions (ACVC) refers to a paradigm in robust deep learning that explicitly enforces the alignment of network-generated attention maps—most commonly class activation maps (CAMs)—between clean and corrupted versions of the same image. The primary goal is to ensure that a model “looks” at the same discriminative regions regardless of input corruption, thereby enhancing reliability under distribution shift (e.g., noise, blur, weather, digital distortions) and improving generalization to unseen domains. ACVC has been realized in both corruption robustness and single-source domain generalization settings, most notably through the AR2 framework (Zhang et al., 8 Jul 2025) and dedicated domain generalization protocols (Cugu et al., 2022).

1. Rationale for Enforcing Attention Consistency

Deep neural networks, especially CNNs, tend to shift or fragment their attention under input corruptions, as revealed by their CAMs. Empirical analyses have demonstrated a strong correlation between these attention shifts and performance degradation, because the models may cease to focus on object parts that are truly discriminative for classification (Zhang et al., 8 Jul 2025). The underpinning hypothesis of ACVC is that by regularizing attention maps to remain invariant across corruptions that preserve semantic content (e.g., label-preserving visual transformations), robustness to both corruptions and domain shift can be substantially improved. This contrasts with traditional prediction-consistency losses, which provide a weaker spatial regularization signal.

2. Mathematical Formulation

Attention consistency is operationalized by defining loss functions that directly penalize discrepancies between the CAMs of clean and corrupted images. For a classifier f(;θ)f(\cdot;\theta), with image xx and its corrupted version x~\tilde{x}, the standard CAM for class cc is constructed as

Mc(x;θ)=k=1Kwkcfk(x;θ)M_c(x;\theta) = \sum_{k=1}^K w_k^c f_k(x;\theta)

where wcw^c are the FC weights and fkf_k is the kkth channel of the final convolutional feature map. The spatial attention map Ac(x;θ)A_c(x;\theta) is normalized: Ac(x;θ)=Mc(x;θ)Mc(x;θ)1A_c(x;\theta) = \frac{M_c(x;\theta)}{\|M_c(x;\theta)\|_1} The ACVC objective, as in AR2, aligns both clean and corrupted attention with a fixed reference on the clean image: Latt=cC(x,x~)Ac(x;θ)Ac(x;θref)p+αcC(x,x~)Ac(x~;θ)Ac(x;θref)p\mathcal L_{\mathrm{att}} = \sum_{c\in\mathcal C(x,\tilde x)} \Bigl\| A_c(x;\theta) - A_c(x;\theta_{\mathrm{ref}}) \Bigr\|_p + \alpha \sum_{c\in\mathcal C(x,\tilde x)} \Bigl\| A_c(\tilde x;\theta) - A_c(x;\theta_{\mathrm{ref}}) \Bigr\|_p with p=2p=2 (MSE) typically preferred, and α\alpha controlling the weight on corrupted alignment (Zhang et al., 8 Jul 2025). In domain generalization (Cugu et al., 2022), attention consistency is measured via the Jensen–Shannon divergence between softmaxed CAMs of the ground-truth class, augmented with negative-CAM KL penalties to suppress attention on irrelevant classes: LCAM(M,M^,y)=DJS(MyM^y)\mathcal L_{\mathrm{CAM}}(M,\hat M,y) = D_{JS}(M_y \Vert \hat M_y)

LNEG(M,M^,Ck)=cCk(DKL(UMc)+DKL(UM^c))\mathcal L_{\mathrm{NEG}}(M,\hat M,C_k) = \sum_{c\in C_k}\left(D_{KL}(U\Vert M_c) + D_{KL}(U\Vert \hat M_c)\right)

The total training loss combines standard cross-entropy on both clean and corrupted samples with the attention consistency regularizer: L(θ)=(X,y)D[LCE(X,X^,y)+λLCON(X,X^,y)]\mathcal L(\theta) = \sum_{(X,y)\in \mathcal{D}}\left[\mathcal L_{\rm CE}(X,\hat X,y) + \lambda \mathcal L_{\rm CON}(X,\hat X,y) \right] where LCON\mathcal L_{\rm CON} collects the CAM and negative-CAM terms (Cugu et al., 2022).

3. Training and Repair Protocols

Corruption Robustness (AR2 Framework)

AR2 introduces a two-stage iterative repair process for pretrained networks:

  • CAM-guided refinement: Model parameters are updated to minimize the attention-consistency loss (Latt\mathcal{L}_{\mathrm{att}}) for paired clean/corrupted inputs, using a reference model to define the clean CAM ground truth.
  • Standard fine-tuning: Alternates with standard supervised fine-tuning on cross-entropy over both clean and corrupted images, restoring classification accuracy.

This cycle is repeated for TT outer iterations, and hyperparameters (number of steps N,MN, M; α\alpha; batch size; learning rates) are fixed based on dataset scale (Zhang et al., 8 Jul 2025).

Domain Generalization

Training synthesizes corrupted versions of each sample using a large pool of label-preserving transforms: 19 ImageNet-C corruptions (weather, blur, noise, digital) plus 3 Fourier-based transforms (phase scaling, constant amplitude, high-pass). The loss is evaluated between the CAMs of the original and corrupted image, with augmentations sampled per iteration (Cugu et al., 2022).

4. Implementation Specifics

ACVC is architecture-agnostic and can be applied to any network yielding spatial attention maps. In practice:

  • Backbones: ResNet-34 (CIFAR), ResNet-50 (ImageNet), ResNet-18 (domain generalization).
  • CAM extraction uses the final spatial feature maps and class weights; upsampling and L1L^1 normalization standardize map shape and scale.
  • Corruptions are generated on-the-fly for CIFAR; a fixed set is precomputed for ImageNet to optimize I/O.
  • Typical hyperparameters are batch sizes of 64–128, learning rates of 10310^{-3} (corruption robustness) or 4×1034\times10^{-3} (domain generalization), and temperature scaling for spatial softmax (e.g., T=0.2T=0.2).

5. Quantitative Results

Method CIFAR-10-C mCE Clean Err% CIFAR-100-C mCE Clean Err% ImageNet-C mCE Clean Err%
Vanilla 94.2 8.6 73.2 27.4 76.4 23.9
AugMix 42.5 7.8 61.4 27.6 64.1 22.4
DeepRepair 40.5 6.2 61.3 29.5
AR2 (ACVC) 30.4 7.5 48.7 27.4 54.0 24.5

ACVC via AR2 achieves absolute improvements of 12–13% mCE over AugMix across datasets, with negligible impact on clean accuracy.

DomainNet (Real → others) Baseline RandAug AugMix VC ACVC
Avg. Accuracy 23.78 26.34 26.48 26.68 26.89

On PACS, COCO→DomainNet, and DomainNet, ACVC sets new state-of-the-art for single-source generalization without target adaptation.

6. Analysis and Ablations

Ablative studies (Zhang et al., 8 Jul 2025) demonstrate:

  • Essentiality of attention consistency: Omitting CAM-guided steps degrades mCE from 30.4% to 82.0% on CIFAR-10-C, confirming that direct attention alignment is crucial.
  • Norm choice for loss: MSE (p=2p=2) outperforms L1L^1 by emphasizing large deviations in attention.
  • Iteration count: Gains from outer loop iterations plateau after T=30T=30 (CIFAR); fewer passes reduce robustness.

In domain generalization, negative-CAM penalties enhance focus on true class regions and suppress diffuse attention, contributing to improved accuracy (Cugu et al., 2022).

7. Extensions and Generalization

The ACVC principle is extensible:

  • Architectures: It is compatible with any model yielding spatial attention (e.g., vision transformers with self-attention maps; Grad-CAM or Score-CAM for CNNs).
  • Tasks beyond classification: Detection or segmentation heads can use bounding-box or mask-level attention in an analogous loss.
  • Corruption and domain diversity: Simultaneously training on multiple corruption types or domains increases cross-corruption synergies, suggesting scalability.
  • Saliency and negative-class regularization: Other sources of attention, including gradient-based saliency or explicit regularization on negative classes, can be harmonized within the ACVC framework.

Collectively, empirical and ablation results indicate that enforcing attention consistency under visual corruptions directs models to utilize the same regions regardless of input perturbation, yielding improved robustness and generalization across a range of visual recognition settings (Zhang et al., 8 Jul 2025, Cugu et al., 2022).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Attention Consistency on Visual Corruptions (ACVC).