SAGE: Spuriousness-Aware Guided Exploration

Updated 24 November 2025

SAGE is a zero-shot strategy that selects optimal prompt templates based on semantic separation to mitigate multimodal spurious bias.
It computes the difference between the highest and lowest class scores per image, ensuring selection of prompts that focus on core semantic features.
Empirical results show significant improvements in worst-group accuracy across benchmarks like Waterbirds and CelebA without requiring model fine-tuning.

Spuriousness-Aware Guided Exploration (SAGE) is an inference-time prompt selection strategy designed to mitigate multimodal spurious bias in large vision-LLMs (VLMs), particularly in the context of zero-shot classification with models such as CLIP. SAGE operates without any model fine-tuning, external annotations, or additional training. Instead, it systematically selects, for each test instance, the prompt template that induces maximal semantic separation—i.e., the greatest difference between the highest and lowest class scores—thereby improving robustness, especially on worst-group accuracy benchmarks affected by spurious correlations (Ye et al., 17 Nov 2025).

1. Theoretical Foundation: Multimodal Spurious Bias

Zero-shot classification in CLIP-style models utilizes pre-trained vision ( $\phi(\cdot)$ ) and text ( $\psi(\cdot)$ ) encoders, aligning image and text representations in a shared space. A prompt template $T$ parametrizes class descriptions, with predictions made by computing cosine similarities between image ( $v = \phi(x)$ ) and text ( $u = \psi(t)$ ) embeddings. In practice, CLIP often learns to rely on features ( $u_s$ ) that are spuriously correlated with classes (e.g., backgrounds), leading to multimodal spurious bias. Formally, if $p(v|u) \approx p(v|u,u_s)$ and $p(u|v) \approx p(u|v,u_s)$ for some $u_s$ , the model's predictions will reflect these confounders rather than core class features.

Theoretical analysis demonstrates that, under such bias, the model's score for a spurious class can dominate (e.g., $p(u_1|v) / p(u_2|v) > 1$ when $x$ contains a spurious feature $u_s$ linked to class $u_1$ ). Absent or weakened spurious features at test time, the model’s confidence collapses, yielding low margins. The key insight is that prompts resulting in high margin (difference between top and bottom class scores) are less likely to be susceptible to spurious features, as alignment must be with the core class concept rather than $u_s$ .

2. Methodology: Design and Operation of SAGE

SAGE addresses spurious bias by guided selection from a library of $M$ prompt templates $\mathcal{T}$ , each a string with a class placeholder (e.g., “a bright photo of a [CLASS]”). For each test image $x_n$ , the method computes a separation score $\sigma_j^n$ for each prompt $T_j$ : $\sigma_j^n = \max_{i\in[1..C]} \frac{v_n^\top u^i_j}{\|v_n\|\|u^i_j\|} - \min_{i\in[1..C]} \frac{v_n^\top u^i_j}{\|v_n\|\|u^i_j\|}$ where $u^i_j$ are the text embeddings for class $c_i$ with template $T_j$ . Templates are ranked by $\sigma_j^n$ , and the top- $K$ (typically $K=1$ ) are used for final class assignment by averaging their class similarities and selecting $\hat{y}_n = \arg\max_i (1/K)\sum_k s^i_{j_k}$ . This approach remains purely zero-shot and does not require updating $\phi$ or $\psi$ .

3. Implementation and Computational Considerations

For $N$ test images, $M$ prompt templates, $C$ classes, and embedding of dimensionality $D$ , SAGE requires $O(NMC D)$ cosine similarity computations. Efficiency heuristics keep $M$ and $C$ small (e.g., $M=80$ , $C$ typically $2$–$7$), and prompt text embeddings $u^i_j$ are pre-encoded, minimizing redundant computation. Only the top- $K$ templates are averaged at inference (with $K=1$ optimal or near-optimal), so ensembling overhead remains minor in practice.

4. Benchmark Evaluation and Empirical Results

SAGE was evaluated on four standard multimodal spurious correlation benchmarks:

Waterbirds: 2 classes, 4 groups (background–species confounding)
CelebA–BlondHair: 2 classes (hair color–gender confounding)
PACS: 7 object classes across 4 domains
VLCS: 5 object classes across 4 domains

Experiments were conducted with five pre-trained contrastive models (CLIP-RN50, CLIP-ViT-B/32, CLIP-ViT-L/14, ALIGN, AltCLIP) without fine-tuning. Metrics include overall accuracy (AVG), worst-group accuracy (WGA), and harmonic mean (HM) of these two. SAGE consistently improved WGA and HM over the standard zero-shot baseline (ZS), with large and significant performance gains, particularly under spurious bias:

Dataset	ZS AVG	ZS WGA	ZS HM	SAGE AVG	SAGE WGA	SAGE HM	Δ AVG	Δ WGA	Δ HM
Waterbirds	84.1	36.7	51.1	88.9	44.9	59.7	+4.8	+8.2	+8.6
CelebA	81.1	75.3	78.1	83.4	80.6	82.0	+2.3	+5.3	+3.9
PACS	96.2	75.5	84.6	96.6	81.9	88.7	+0.4	+6.4	+4.1
VLCS	76.1	23.0	35.3	75.8	33.8	46.7	–0.3	+10.8	+11.4

Statistical significance was established by paired t-tests ( $p < 0.01$ ) for WGA and HM. Ablation studies comparing SAGE (top-1 prompt by separation), random prompt selection, and ensemble (all prompts equally) confirmed the superiority of guided selection, with $K=1$ nearly always optimal.

5. Ablation, Comparative Approaches, and Design Analysis

Ablation analyses contrasted three strategies:

Ensemble: Equal weighting of all $M=80$ prompts.
Random: Single prompt picked at random per image.
SAGE: Single prompt with highest $\sigma$ selected per image.

SAGE yielded the highest worst-group accuracy when aggregated across backbones, although per-backbone variances were observed. Increasing $K$ beyond 1 rarely improved results, validating that maximal per-image margin is closely tied to robustness.

Unlike methods such as ROBOSHOT (which queries an LLM for spurious attributes) or TIE* (using pseudo-labels), SAGE operates without any auxiliary data, annotations, or model parameter updates. Its efficacy is contingent on having a sufficiently diverse prompt library; in failure cases where no prompt induces large separation, performance may revert to baseline.

6. Limitations and Prospective Directions

SAGE is strictly an inference-time strategy. It does not fine-tune the VLM or adapt the library of prompt templates during deployment. Thus, its success relies on the existence of at least one prompt per instance that produces substantial class margin. If prompt diversity is insufficient, or no template yields useful semantic separation, SAGE’s benefits are diminished.

A plausible implication is that further performance improvements could be unlocked by learning or expanding the prompt library, or hybridizing SAGE’s selection mechanism with small, labeled datasets. SAGE’s inferences are independent per image, suggesting compatibility with downstream semi-supervised or ensemble strategies.

7. Relation to Prior Work and Generalization

SAGE distinguishes itself from earlier debiasing and robustness approaches by being purely zero-shot, training-free, and annotation-free. Its margin-based selection is orthogonal to approaches requiring fine-tuning or external supervision, thus maintaining strict out-of-the-box usability for pre-trained VLMs. While validated on standard spurious correlation datasets, the separation-based selection criterion is general and, in principle, applicable to any zero-shot multimodal classification setup where prompt-induced margin correlates with robustness.

In sum, SAGE offers an efficient, training-free, and annotation-free mechanism for mitigating multimodal spurious bias in zero-shot VLMs by exploiting per-image, per-template margin as an indicator of prompt robustness (Ye et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Spuriousness-Aware Guided Exploration (SAGE).