Papers
Topics
Authors
Recent
2000 character limit reached

SAGE: Spuriousness-Aware Guided Exploration

Updated 24 November 2025
  • SAGE is a zero-shot strategy that selects optimal prompt templates based on semantic separation to mitigate multimodal spurious bias.
  • It computes the difference between the highest and lowest class scores per image, ensuring selection of prompts that focus on core semantic features.
  • Empirical results show significant improvements in worst-group accuracy across benchmarks like Waterbirds and CelebA without requiring model fine-tuning.

Spuriousness-Aware Guided Exploration (SAGE) is an inference-time prompt selection strategy designed to mitigate multimodal spurious bias in large vision-LLMs (VLMs), particularly in the context of zero-shot classification with models such as CLIP. SAGE operates without any model fine-tuning, external annotations, or additional training. Instead, it systematically selects, for each test instance, the prompt template that induces maximal semantic separation—i.e., the greatest difference between the highest and lowest class scores—thereby improving robustness, especially on worst-group accuracy benchmarks affected by spurious correlations (Ye et al., 17 Nov 2025).

1. Theoretical Foundation: Multimodal Spurious Bias

Zero-shot classification in CLIP-style models utilizes pre-trained vision (ϕ()\phi(\cdot)) and text (ψ()\psi(\cdot)) encoders, aligning image and text representations in a shared space. A prompt template TT parametrizes class descriptions, with predictions made by computing cosine similarities between image (v=ϕ(x)v = \phi(x)) and text (u=ψ(t)u = \psi(t)) embeddings. In practice, CLIP often learns to rely on features (usu_s) that are spuriously correlated with classes (e.g., backgrounds), leading to multimodal spurious bias. Formally, if p(vu)p(vu,us)p(v|u) \approx p(v|u,u_s) and p(uv)p(uv,us)p(u|v) \approx p(u|v,u_s) for some usu_s, the model's predictions will reflect these confounders rather than core class features.

Theoretical analysis demonstrates that, under such bias, the model's score for a spurious class can dominate (e.g., p(u1v)/p(u2v)>1p(u_1|v) / p(u_2|v) > 1 when xx contains a spurious feature usu_s linked to class u1u_1). Absent or weakened spurious features at test time, the model’s confidence collapses, yielding low margins. The key insight is that prompts resulting in high margin (difference between top and bottom class scores) are less likely to be susceptible to spurious features, as alignment must be with the core class concept rather than usu_s.

2. Methodology: Design and Operation of SAGE

SAGE addresses spurious bias by guided selection from a library of MM prompt templates T\mathcal{T}, each a string with a class placeholder (e.g., “a bright photo of a [CLASS]”). For each test image xnx_n, the method computes a separation score σjn\sigma_j^n for each prompt TjT_j: σjn=maxi[1..C]vnujivnujimini[1..C]vnujivnuji\sigma_j^n = \max_{i\in[1..C]} \frac{v_n^\top u^i_j}{\|v_n\|\|u^i_j\|} - \min_{i\in[1..C]} \frac{v_n^\top u^i_j}{\|v_n\|\|u^i_j\|} where ujiu^i_j are the text embeddings for class cic_i with template TjT_j. Templates are ranked by σjn\sigma_j^n, and the top-KK (typically K=1K=1) are used for final class assignment by averaging their class similarities and selecting y^n=argmaxi(1/K)ksjki\hat{y}_n = \arg\max_i (1/K)\sum_k s^i_{j_k}. This approach remains purely zero-shot and does not require updating ϕ\phi or ψ\psi.

3. Implementation and Computational Considerations

For NN test images, MM prompt templates, CC classes, and embedding of dimensionality DD, SAGE requires O(NMCD)O(NMC D) cosine similarity computations. Efficiency heuristics keep MM and CC small (e.g., M=80M=80, CC typically $2$–$7$), and prompt text embeddings ujiu^i_j are pre-encoded, minimizing redundant computation. Only the top-KK templates are averaged at inference (with K=1K=1 optimal or near-optimal), so ensembling overhead remains minor in practice.

4. Benchmark Evaluation and Empirical Results

SAGE was evaluated on four standard multimodal spurious correlation benchmarks:

  • Waterbirds: 2 classes, 4 groups (background–species confounding)
  • CelebA–BlondHair: 2 classes (hair color–gender confounding)
  • PACS: 7 object classes across 4 domains
  • VLCS: 5 object classes across 4 domains

Experiments were conducted with five pre-trained contrastive models (CLIP-RN50, CLIP-ViT-B/32, CLIP-ViT-L/14, ALIGN, AltCLIP) without fine-tuning. Metrics include overall accuracy (AVG), worst-group accuracy (WGA), and harmonic mean (HM) of these two. SAGE consistently improved WGA and HM over the standard zero-shot baseline (ZS), with large and significant performance gains, particularly under spurious bias:

Dataset ZS AVG ZS WGA ZS HM SAGE AVG SAGE WGA SAGE HM Δ AVG Δ WGA Δ HM
Waterbirds 84.1 36.7 51.1 88.9 44.9 59.7 +4.8 +8.2 +8.6
CelebA 81.1 75.3 78.1 83.4 80.6 82.0 +2.3 +5.3 +3.9
PACS 96.2 75.5 84.6 96.6 81.9 88.7 +0.4 +6.4 +4.1
VLCS 76.1 23.0 35.3 75.8 33.8 46.7 –0.3 +10.8 +11.4

Statistical significance was established by paired t-tests (p<0.01p < 0.01) for WGA and HM. Ablation studies comparing SAGE (top-1 prompt by separation), random prompt selection, and ensemble (all prompts equally) confirmed the superiority of guided selection, with K=1K=1 nearly always optimal.

5. Ablation, Comparative Approaches, and Design Analysis

Ablation analyses contrasted three strategies:

  • Ensemble: Equal weighting of all M=80M=80 prompts.
  • Random: Single prompt picked at random per image.
  • SAGE: Single prompt with highest σ\sigma selected per image.

SAGE yielded the highest worst-group accuracy when aggregated across backbones, although per-backbone variances were observed. Increasing KK beyond 1 rarely improved results, validating that maximal per-image margin is closely tied to robustness.

Unlike methods such as ROBOSHOT (which queries an LLM for spurious attributes) or TIE* (using pseudo-labels), SAGE operates without any auxiliary data, annotations, or model parameter updates. Its efficacy is contingent on having a sufficiently diverse prompt library; in failure cases where no prompt induces large separation, performance may revert to baseline.

6. Limitations and Prospective Directions

SAGE is strictly an inference-time strategy. It does not fine-tune the VLM or adapt the library of prompt templates during deployment. Thus, its success relies on the existence of at least one prompt per instance that produces substantial class margin. If prompt diversity is insufficient, or no template yields useful semantic separation, SAGE’s benefits are diminished.

A plausible implication is that further performance improvements could be unlocked by learning or expanding the prompt library, or hybridizing SAGE’s selection mechanism with small, labeled datasets. SAGE’s inferences are independent per image, suggesting compatibility with downstream semi-supervised or ensemble strategies.

7. Relation to Prior Work and Generalization

SAGE distinguishes itself from earlier debiasing and robustness approaches by being purely zero-shot, training-free, and annotation-free. Its margin-based selection is orthogonal to approaches requiring fine-tuning or external supervision, thus maintaining strict out-of-the-box usability for pre-trained VLMs. While validated on standard spurious correlation datasets, the separation-based selection criterion is general and, in principle, applicable to any zero-shot multimodal classification setup where prompt-induced margin correlates with robustness.

In sum, SAGE offers an efficient, training-free, and annotation-free mechanism for mitigating multimodal spurious bias in zero-shot VLMs by exploiting per-image, per-template margin as an indicator of prompt robustness (Ye et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Spuriousness-Aware Guided Exploration (SAGE).