PRSM: Paraphrase Ranking Stability Metric
- The paper introduces PRSM as a quantitative measure to assess the robustness of CLIP models against semantic variations in text-to-image retrieval.
- It computes global stability via Spearman rank-correlation and local stability through top-k image overlaps for different paraphrased queries.
- PRSM highlights fairness concerns by revealing numerical discrepancies in retrieval stability, which may amplify biases in vision-language applications.
The Paraphrase Ranking Stability Metric (PRSM) is a quantitative measure introduced to assess the robustness of Contrastive Language-Image Pre-training (CLIP) models against linguistic paraphrasing in text-to-image retrieval settings. PRSM specifically captures the degree to which semantically equivalent text queries—paraphrases—yield stable and consistent image rankings in retrieval outputs. This metric addresses the real-world requirement for retrieval systems to remain invariant under natural linguistic variation, which is particularly pertinent to the trustworthiness and fairness of deployed vision-LLMs, especially in socially sensitive contexts such as demographic or gendered queries (Schlegel et al., 14 Nov 2025).
1. Motivation and Conceptual Overview
PRSM was developed in response to the observation that CLIP, despite strong performance in zero-shot and few-shot scenarios, exhibits limited robustness to changes in textual phrasing that do not alter semantic content. In deployment scenarios, inconsistent retrieval rankings for paraphrased but semantically equivalent queries can erode user trust and potentially amplify harmful biases if certain phrasings systematically benefit or disadvantage particular demographics. PRSM operationalizes the notion of paraphrase robustness by measuring the stability of image rankings—both overall and in the highest-ranked results—across diverse paraphrases of the same query. The core principle is that a robust model should yield nearly identical image retrievals regardless of superficial linguistic variation.
2. Formal Definition
PRSM is defined for a fixed set of paraphrases and a corresponding fixed gallery of images. Let denote a set of paraphrases (including the original query), and let represent the full ranking of the image gallery by CLIP for query . The top- ranked images are denoted . Global stability is evaluated using the pairwise Spearman rank-correlation between rankings; local stability considers the fraction of overlap in the top- images. The respective formulae are:
Interpretation: A value of close to 1 indicates nearly identical full rankings for all paraphrase pairs; close to 1 indicates nearly complete agreement in the top- retrievals.
3. Computation and Algorithmic Workflow
PRSM is computed using the following procedure:
- Embedding and Similarity Calculation: For each query , compute the CLIP text embedding and, for a fixed gallery of images with embeddings , determine similarity scores .
- Ranking and Extraction: Sort these similarity scores in descending order to obtain , and extract for desired .
- Pairwise Stability Assessment:
- For all unordered pairs , compute for global, and for local.
- Averaging: Aggregate across all pairs according to the PRSM formulae.
This process is described algorithmically in the following pseudocode from (Schlegel et al., 14 Nov 2025):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
for i in 1..m: t[i] = CLIP_encode_text(q[i]) for img_idx in 1..N: scores[i][img_idx] = dot(t[i], V[img_idx]) ranking[i] = argsort_descending(scores[i]) topK_set[i] = set(ranking[i][1:K]) sum_rho = 0 sum_overlap = 0 num_pairs = 0 for i in 1..m: for j in i+1..m: rho_ij = SpearmanCorrelation(ranking[i], ranking[j]) overlap_ij = len(topK_set[i] & topK_set[j]) / K sum_rho += rho_ij sum_overlap += overlap_ij num_pairs += 1 PRSM_global = (2 / (m*(m-1))) * sum_rho PRSM_local = (2 / (m*(m-1))) * sum_overlap return PRSM_global, PRSM_local |
4. Illustrative Examples
Consider a synthetic example with five images and two paraphrases , :
- (Top-2 = )
- (Top-2 = )
For this pair:
Thus, and , indicating perfect agreement among the highest-ranked images despite some reordering further down the ranking.
5. Empirical Results Using Social Counterfactuals
PRSM was evaluated using the Social Counterfactuals dataset, which provides 170,832 image-caption pairs with controlled male- and female-associated paraphrase groups. Three major paraphrasing strategies were assessed:
- P1 (LLM-generated paraphrases): Two Llama-3 paraphrases per caption.
- P2 (Prefix swaps): Interchangeable prefixes (e.g., “a photo of”, “a picture of”, none).
- P3 (Attribute swaps): Swapping synonyms for demographic attributes (e.g., “young” vs. “youthful”).
Main findings:
| Paraphrase Type | Global Spearman () | Local Top-100 Overlap |
|---|---|---|
| P1 | 0.04 | 0.64 |
| P2 | 0.04 | 0.89 |
| P3 | 0.04 | 0.63 – 0.74 |
Interpretation: Global ranking stability is extremely low across all paraphrasing types, with top-100 image overlap indicating somewhat greater local consistency for prefix-based but not meaning-altering paraphrases. Attribute-based paraphrasing demonstrated notable instability, especially for gender-related synonyms.
6. Demographic Factors and Fairness Considerations
When stratified by query association (female or male), PRSM reveals systematic, albeit small, differences in retrieval stability:
- Female-associated queries exhibit slightly higher local overlap for prefix and attribute paraphrases.
- Male-associated queries sometimes have marginally higher Spearman global stability.
Implication: Even slight consistency differences may accumulate across large-scale retrieval deployments, contributing to disparate user experiences or reinforcing stereotypes, particularly when rankings influence downstream tasks or user perceptions (Schlegel et al., 14 Nov 2025). The metric provides actionable diagnostics for identifying and quantifying such fairness-related discrepancies.
7. Limitations and Prospects for Extension
PRSM, as currently formulated, is evaluated over a single dataset with binary gender distinctions and is applied to encoder-only CLIP models. Notable limitations:
- The Social Counterfactuals dataset addresses only gender; there is no coverage of race, religion, or disability.
- The metric is not directly applicable to autoregressive or decoder-based VLMs.
- All images are treated with equal importance; there is no weighting for social sensitivity or application-specific utility.
Potential enhancements outlined in (Schlegel et al., 14 Nov 2025) include substituting normalized Discounted Cumulative Gain (nDCG) for Spearman rank-correlation, extending the metric to multilingual and multi-sentence queries, and developing paraphrase-invariant training objectives to directly optimize for PRSM. This suggests the metric may serve as both an evaluation tool and a training objective for future fair and robust vision-LLMs.