Papers
Topics
Authors
Recent
2000 character limit reached

Prompt-CAM: Class-Specific Visual Saliency

Updated 9 December 2025
  • Prompt-CAM is a method that uses class-specific prompt tokens to produce fine-grained activation maps, enhancing model interpretability in deep visual networks.
  • It improves on traditional CAM techniques by integrating transformer prompt tokens and synonym-based prompt selection to better localize subtle, class-specific traits.
  • By modifying pre-trained Vision Transformers and CLIP frameworks, Prompt-CAM delivers sharper, discriminative saliency maps, critical for fine-grained visual discrimination and weakly supervised tasks.

Prompt-CAM refers to a family of methods that leverage class-specific or prompt-based mechanisms to generate interpretable class activation or attention maps for deep visual models. These methods are unified by their focus on fine-grained trait localization, making them highly pertinent in tasks where distinguishing subtle category-specific features is critical. Below, two paradigms from the literature are expounded: (1) Prompt-CAM for pre-trained Vision Transformers (ViTs) and (2) Prompt-CAM in the context of prompt class learning for weakly supervised semantic segmentation. Each approach exploits the unique properties of prompts to induce sharper and more class-discriminative saliency or activation regions compared to classical CAM variants.

1. Rationale and Key Innovations

Pre-trained ViTs, particularly self-supervised models such as DINO and DINOv2, establish highly localized patch feature representations. Conventional saliency methods like Grad-CAM, when applied to such models, yield heatmaps that are spatially coarse and object-centric, failing to isolate the fine, distinctive visual cues that discriminate closely related or visually similar categories. Additionally, standard ViT self-attention—especially from [CLS] tokens—does not encode class-specific information; these attention maps are invariant to class and reflect only general model focus. Prompt-CAM addresses these deficiencies by introducing class-specific prompt tokens whose multi-head attention provides fine-grained, class-distinctive interpretability (Chowdhury et al., 16 Jan 2025).

In a contrasting paradigm, prompt-class learning in the context of CLIP-based weakly supervised semantic segmentation (WSSS) exploits the composition of textual prompts—most critically, the class token—to optimize alignment between image and class semantics for superior activation map quality. This approach demonstrates that even minimal modifications (e.g., synonym selection for the class word) have more impact on CAM quality than tuning the broader textual context (Murugesan et al., 2023).

2. Prompt-CAM for Vision Transformers

Prompt-CAM is instantiated by adding CC learnable, class-specific prompt tokens p1,...,pCp^1, ..., p^C to a frozen, pre-trained ViT. These prompts, injected at every transformer layer (in the “Deep” variant), query the patch tokens via multi-head self-attention such that the output for prompt cc most strongly aggregates information from patches unique to class cc. The key architectural steps are:

  • Embedding and Injection: The input image II is tokenized into MM patches, embedded as E0RD×ME_0 \in \mathbb{R}^{D \times M}, with prompts Pi1RD×CP_{i-1}\in\mathbb{R}^{D \times C} appended to the input at every layer ii.
  • Forward Computation: Each transformer block LiL_i updates both prompts and patch tokens: [Zi,Ei,ci]=Li([Pi1,Ei1,ci1])[Z_i, E_i, c_i] = L_i([P_{i-1}, E_{i-1}, c_{i-1}]), where ZiZ_i collates prompt outputs.
  • Classification and Optimization: At the final layer, only ZNZ_N (the prompt outputs) and a shared scoring vector ww parameterize the class scores: s[c]=wzNcs[c] = w^{\top}z_N^c. Optimization employs frozen ViT parameters; only prompts and ww are updated, typically via SGD with cosine-annealing and warmup.
  • Attention Map Extraction: For each prompt pN1cp_{N-1}^c and attention head rr, the per-patch attention is computed as αN1c,r=softmax((qN1c,r)KN1rD)\alpha_{N-1}^{c,r} = \mathrm{softmax}\left(\frac{(q_{N-1}^{c,r})^\top K_{N-1}^r}{\sqrt{D'}}\right). The final Prompt-CAM heatmap is Hc=1Rr=1RαN1c,rH^c = \frac{1}{R}\sum_{r=1}^R \alpha_{N-1}^{c,r}, providing a per-class, per-patch saliency vector which is upsampled to image resolution.

This configuration ensures the true-class prompt must forcibly focus on spatially unique, class-specific discriminative traits, rather than holistic or shared object features, due to the shared ww across classes (Chowdhury et al., 16 Jan 2025).

3. Prompt-CAM in Prompt-Class Learning for WSSS

In the POLE (PrOmpt cLass lEarning) framework, Prompt-CAM designates CAMs produced by varying the class token in CLIP-style prompts for weakly supervised segmentation. The principal mechanism involves:

  • Backbone and Prompt Construction: A CNN feature extractor yields Z=ft(I)RC×h×wZ = f_t(I) \in \mathbb{R}^{C \times h' \times w'}. Per-class CAMs Pk(h,w)P_k(h,w) are computed as Pk(h,w)=σ(WkZ(h,w))P_k(h,w) = \sigma(W_k^\top Z(h,w)), where WRC×KW \in \mathbb{R}^{C \times K}.
  • Prompt Selection: Prompts take the form "A photo of [CLS_k]." Instead of defaulting to ground-truth class names, POLE constructs a synonym set SkS_k for [CLS_k]. The candidate yielding highest cosine similarity between the image’s masked CLIP embedding and the text embedding is selected per image.
  • Loss Functions: The training objective combines a multi-label classification loss LclsL_\mathrm{cls} and a contrastive loss LcontL_\mathrm{cont} aligning foreground masks to selected text embeddings and penalizing background agreement.
  • CAM Calculation: For pixel (x,y)(x, y), the per-class response is CAMk(x,y)=EI(I):,x,yET([CLS]k)CAM_k(x, y) = E_I(I)_{:, x, y}^\top E_T([\mathrm{CLS}]_k) or its normalized version.

Empirical analysis demonstrates that class token choice governs a 2%+ mIoU boost in initial CAM quality, outstripping gains from context tuning or continuous prompt parameterization (Murugesan et al., 2023).

4. Empirical Assessment and Benchmarks

Prompt-CAM’s efficacy is validated on diverse, fine-grained classification and segmentation benches:

Method Faithfulness (Ins/Del) Human Trait Recognition (%) Accuracy (CUB-200-2011, DINO)
Grad-CAM 0.52 / 0.16
Layer-CAM 0.54 / 0.13
Eigen-CAM 0.42 / 0.33
Prompt-CAM (ViT) 0.61 / 0.09 60.5 71.9
ProtoPNet/TesNet 39.1
ProtoConcepts 30.4

Prompt-CAM (ViT) generates narrower saliency regions that better localize distinguishing class traits. Human studies on the CUB dataset reported that Prompt-CAM heatmaps made 60.5% of expert-traits discoverable by naive participants, compared to <40% for prototype-based networks (Chowdhury et al., 16 Jan 2025).

For POLE, the best-found synonym ([CLS]*) resulted in a 2.1% improvement over fixed label prompts, with final segmentation mIoU on PASCAL-VOC12 reaching 71.5%/71.4% (val/test), an approximately 1% absolute gain over context/continuous prompt optimization (Murugesan et al., 2023).

5. Trait Localization and Interpretability Mechanisms

Prompt-CAM leverages the requirement that, with a shared scoring vector ww, the class cc prompt can only win against other classes by attention to spatial regions containing discriminative traits. In practice, this manifests as attention map peaks aligning with expert-defined traits (e.g., Baltimore Oriole’s orange belly, red-winged blackbird’s red spot). Further, a greedy ranking across heads can isolate the minimal set of attention heads whose focus is critical for class separation—by iteratively blurring heads and observing class logit drop (Chowdhury et al., 16 Jan 2025).

In the POLE regime, ablations reveal that in fewer than half of images, the canonical class name produces the strongest visual–textual alignment; synonyms or alternate tokens often achieve higher semantic correspondence, yielding better activation maps. This suggests the importance of vocabulary harmonization between visual experience and class label selection (Murugesan et al., 2023).

6. Implementation and Overhead Considerations

Prompt-CAM for ViTs requires minimal architectural change—substituting a CC-prompt head for the standard [CLS] head in visual prompt tuning. This constitutes fewer than ten lines of code in typical PyTorch ViT implementations. Computational and memory overhead is small: O(CD)O(CD) parameter additions, negligible extra attention map computation at inference, and training conducted with a frozen backbone (ViT) (Chowdhury et al., 16 Jan 2025). In POLE, synonym search for [CLS] tokens introduces slight additional computation but is efficiently parallelizable and does not require architectural or training regime changes (Murugesan et al., 2023).

7. Comparative Perspective and Implications

Prompt-CAM offers a streamlined path to high-fidelity, class-specific interpretability in both transformer-based and CNN–text joint embedding frameworks. In ViTs, prompt-based interpretability yields sharper, more faithful heatmaps than classical gradient-based or prototype-based post-hoc techniques, at the cost of a modest trade-off in top-1 accuracy. In WSSS, prompt-class selection dominates context and embedding modifications for CAM optimization, suggesting strong coupling between vocabulary choice and model interpretability in CLIP-like models. The paradigm shift from holistic to trait-centric saliency has broad implications for explainable AI, particularly in domains demanding fine-grained visual discrimination.

References:

  • Prompt-CAM for ViTs: "Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis" (Chowdhury et al., 16 Jan 2025)
  • Prompt-class learning for WSSS: "Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation" (Murugesan et al., 2023)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Prompt-CAM.