SAGE: Spuriousness-Aware Guided Exploration
- SAGE is a zero-shot strategy that selects optimal prompt templates based on semantic separation to mitigate multimodal spurious bias.
- It computes the difference between the highest and lowest class scores per image, ensuring selection of prompts that focus on core semantic features.
- Empirical results show significant improvements in worst-group accuracy across benchmarks like Waterbirds and CelebA without requiring model fine-tuning.
Spuriousness-Aware Guided Exploration (SAGE) is an inference-time prompt selection strategy designed to mitigate multimodal spurious bias in large vision-LLMs (VLMs), particularly in the context of zero-shot classification with models such as CLIP. SAGE operates without any model fine-tuning, external annotations, or additional training. Instead, it systematically selects, for each test instance, the prompt template that induces maximal semantic separation—i.e., the greatest difference between the highest and lowest class scores—thereby improving robustness, especially on worst-group accuracy benchmarks affected by spurious correlations (Ye et al., 17 Nov 2025).
1. Theoretical Foundation: Multimodal Spurious Bias
Zero-shot classification in CLIP-style models utilizes pre-trained vision () and text () encoders, aligning image and text representations in a shared space. A prompt template parametrizes class descriptions, with predictions made by computing cosine similarities between image () and text () embeddings. In practice, CLIP often learns to rely on features () that are spuriously correlated with classes (e.g., backgrounds), leading to multimodal spurious bias. Formally, if and for some , the model's predictions will reflect these confounders rather than core class features.
Theoretical analysis demonstrates that, under such bias, the model's score for a spurious class can dominate (e.g., when contains a spurious feature linked to class ). Absent or weakened spurious features at test time, the model’s confidence collapses, yielding low margins. The key insight is that prompts resulting in high margin (difference between top and bottom class scores) are less likely to be susceptible to spurious features, as alignment must be with the core class concept rather than .
2. Methodology: Design and Operation of SAGE
SAGE addresses spurious bias by guided selection from a library of prompt templates , each a string with a class placeholder (e.g., “a bright photo of a [CLASS]”). For each test image , the method computes a separation score for each prompt : where are the text embeddings for class with template . Templates are ranked by , and the top- (typically ) are used for final class assignment by averaging their class similarities and selecting . This approach remains purely zero-shot and does not require updating or .
3. Implementation and Computational Considerations
For test images, prompt templates, classes, and embedding of dimensionality , SAGE requires cosine similarity computations. Efficiency heuristics keep and small (e.g., , typically $2$–$7$), and prompt text embeddings are pre-encoded, minimizing redundant computation. Only the top- templates are averaged at inference (with optimal or near-optimal), so ensembling overhead remains minor in practice.
4. Benchmark Evaluation and Empirical Results
SAGE was evaluated on four standard multimodal spurious correlation benchmarks:
- Waterbirds: 2 classes, 4 groups (background–species confounding)
- CelebA–BlondHair: 2 classes (hair color–gender confounding)
- PACS: 7 object classes across 4 domains
- VLCS: 5 object classes across 4 domains
Experiments were conducted with five pre-trained contrastive models (CLIP-RN50, CLIP-ViT-B/32, CLIP-ViT-L/14, ALIGN, AltCLIP) without fine-tuning. Metrics include overall accuracy (AVG), worst-group accuracy (WGA), and harmonic mean (HM) of these two. SAGE consistently improved WGA and HM over the standard zero-shot baseline (ZS), with large and significant performance gains, particularly under spurious bias:
| Dataset | ZS AVG | ZS WGA | ZS HM | SAGE AVG | SAGE WGA | SAGE HM | Δ AVG | Δ WGA | Δ HM |
|---|---|---|---|---|---|---|---|---|---|
| Waterbirds | 84.1 | 36.7 | 51.1 | 88.9 | 44.9 | 59.7 | +4.8 | +8.2 | +8.6 |
| CelebA | 81.1 | 75.3 | 78.1 | 83.4 | 80.6 | 82.0 | +2.3 | +5.3 | +3.9 |
| PACS | 96.2 | 75.5 | 84.6 | 96.6 | 81.9 | 88.7 | +0.4 | +6.4 | +4.1 |
| VLCS | 76.1 | 23.0 | 35.3 | 75.8 | 33.8 | 46.7 | –0.3 | +10.8 | +11.4 |
Statistical significance was established by paired t-tests () for WGA and HM. Ablation studies comparing SAGE (top-1 prompt by separation), random prompt selection, and ensemble (all prompts equally) confirmed the superiority of guided selection, with nearly always optimal.
5. Ablation, Comparative Approaches, and Design Analysis
Ablation analyses contrasted three strategies:
- Ensemble: Equal weighting of all prompts.
- Random: Single prompt picked at random per image.
- SAGE: Single prompt with highest selected per image.
SAGE yielded the highest worst-group accuracy when aggregated across backbones, although per-backbone variances were observed. Increasing beyond 1 rarely improved results, validating that maximal per-image margin is closely tied to robustness.
Unlike methods such as ROBOSHOT (which queries an LLM for spurious attributes) or TIE* (using pseudo-labels), SAGE operates without any auxiliary data, annotations, or model parameter updates. Its efficacy is contingent on having a sufficiently diverse prompt library; in failure cases where no prompt induces large separation, performance may revert to baseline.
6. Limitations and Prospective Directions
SAGE is strictly an inference-time strategy. It does not fine-tune the VLM or adapt the library of prompt templates during deployment. Thus, its success relies on the existence of at least one prompt per instance that produces substantial class margin. If prompt diversity is insufficient, or no template yields useful semantic separation, SAGE’s benefits are diminished.
A plausible implication is that further performance improvements could be unlocked by learning or expanding the prompt library, or hybridizing SAGE’s selection mechanism with small, labeled datasets. SAGE’s inferences are independent per image, suggesting compatibility with downstream semi-supervised or ensemble strategies.
7. Relation to Prior Work and Generalization
SAGE distinguishes itself from earlier debiasing and robustness approaches by being purely zero-shot, training-free, and annotation-free. Its margin-based selection is orthogonal to approaches requiring fine-tuning or external supervision, thus maintaining strict out-of-the-box usability for pre-trained VLMs. While validated on standard spurious correlation datasets, the separation-based selection criterion is general and, in principle, applicable to any zero-shot multimodal classification setup where prompt-induced margin correlates with robustness.
In sum, SAGE offers an efficient, training-free, and annotation-free mechanism for mitigating multimodal spurious bias in zero-shot VLMs by exploiting per-image, per-template margin as an indicator of prompt robustness (Ye et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free