PRSM: Paraphrase Ranking Stability Metric

Updated 21 November 2025

The paper introduces PRSM as a quantitative measure to assess the robustness of CLIP models against semantic variations in text-to-image retrieval.
It computes global stability via Spearman rank-correlation and local stability through top-k image overlaps for different paraphrased queries.
PRSM highlights fairness concerns by revealing numerical discrepancies in retrieval stability, which may amplify biases in vision-language applications.

The Paraphrase Ranking Stability Metric (PRSM) is a quantitative measure introduced to assess the robustness of Contrastive Language-Image Pre-training (CLIP) models against linguistic paraphrasing in text-to-image retrieval settings. PRSM specifically captures the degree to which semantically equivalent text queries—paraphrases—yield stable and consistent image rankings in retrieval outputs. This metric addresses the real-world requirement for retrieval systems to remain invariant under natural linguistic variation, which is particularly pertinent to the trustworthiness and fairness of deployed vision-LLMs, especially in socially sensitive contexts such as demographic or gendered queries (Schlegel et al., 14 Nov 2025).

1. Motivation and Conceptual Overview

PRSM was developed in response to the observation that CLIP, despite strong performance in zero-shot and few-shot scenarios, exhibits limited robustness to changes in textual phrasing that do not alter semantic content. In deployment scenarios, inconsistent retrieval rankings for paraphrased but semantically equivalent queries can erode user trust and potentially amplify harmful biases if certain phrasings systematically benefit or disadvantage particular demographics. PRSM operationalizes the notion of paraphrase robustness by measuring the stability of image rankings—both overall and in the highest-ranked results—across diverse paraphrases of the same query. The core principle is that a robust model should yield nearly identical image retrievals regardless of superficial linguistic variation.

2. Formal Definition

PRSM is defined for a fixed set of paraphrases and a corresponding fixed gallery of images. Let $Q = \{q_1, \ldots, q_m\}$ denote a set of $m$ paraphrases (including the original query), and let $R(q)$ represent the full ranking of the image gallery by CLIP for query $q$ . The top- $k$ ranked images are denoted $R_k(q)$ . Global stability is evaluated using the pairwise Spearman rank-correlation $\rho$ between rankings; local stability considers the fraction of overlap in the top- $k$ images. The respective formulae are:

$\text{PRSM}_{\text{global}} = \frac{2}{m(m-1)} \sum_{1 \leq i < j \leq m} \rho\bigl(R(q_i), R(q_j)\bigr)$

$\text{PRSM}_{\text{local}}(k) = \frac{2}{m(m-1)} \sum_{1 \leq i < j \leq m} \frac{|R_k(q_i) \cap R_k(q_j)|}{k}$

Interpretation: A value of $\text{PRSM}_{\text{global}}$ close to 1 indicates nearly identical full rankings for all paraphrase pairs; $\text{PRSM}_{\text{local}}(k)$ close to 1 indicates nearly complete agreement in the top- $k$ retrievals.

3. Computation and Algorithmic Workflow

PRSM is computed using the following procedure:

Embedding and Similarity Calculation: For each query $q \in Q$ , compute the CLIP text embedding $t_q$ and, for a fixed gallery of $N$ images with embeddings $v_1, \ldots, v_N$ , determine similarity scores $s_{q, i} = \langle t_q, v_i \rangle$ .
Ranking and Extraction: Sort these similarity scores in descending order to obtain $R(q)$ , and extract $R_k(q)$ for desired $k$ .
Pairwise Stability Assessment:
- For all unordered pairs $(q_i, q_j)$ , compute $\rho(R(q_i), R(q_j))$ for global, and $|R_k(q_i) \cap R_k(q_j)|/k$ for local.
Averaging: Aggregate across all $\binom{m}{2}$ pairs according to the PRSM formulae.

This process is described algorithmically in the following pseudocode from (Schlegel et al., 14 Nov 2025):

for i in 1..m:
    t[i] = CLIP_encode_text(q[i])
    for img_idx in 1..N:
        scores[i][img_idx] = dot(t[i], V[img_idx])
    ranking[i] = argsort_descending(scores[i])
    topK_set[i] = set(ranking[i][1:K])

sum_rho = 0
sum_overlap = 0
num_pairs = 0
for i in 1..m:
    for j in i+1..m:
        rho_ij = SpearmanCorrelation(ranking[i], ranking[j])
        overlap_ij = len(topK_set[i] & topK_set[j]) / K
        sum_rho += rho_ij
        sum_overlap += overlap_ij
        num_pairs += 1

PRSM_global = (2 / (m*(m-1))) * sum_rho
PRSM_local = (2 / (m*(m-1))) * sum_overlap
return PRSM_global, PRSM_local

4. Illustrative Examples

Consider a synthetic example with five images $\{A, B, C, D, E\}$ and two paraphrases $q_1$ , $q_2$ :

$R(q_1) = [A, B, C, D, E]$ (Top-2 = $\{A, B\}$ )
$R(q_2) = [B, A, D, C, E]$ (Top-2 = $\{B, A\}$ )

For this pair:

$\rho([1,2,3,4,5], [2,1,4,3,5]) = 0.7$
$|\{A, B\} \cap \{B, A\}| / 2 = 1.0$

Thus, $\text{PRSM}_{\text{global}} \approx 0.7$ and $\text{PRSM}_{\text{local}}(2) = 1.0$ , indicating perfect agreement among the highest-ranked images despite some reordering further down the ranking.

PRSM was evaluated using the Social Counterfactuals dataset, which provides 170,832 image-caption pairs with controlled male- and female-associated paraphrase groups. Three major paraphrasing strategies were assessed:

P1 (LLM-generated paraphrases): Two Llama-3 paraphrases per caption.
P2 (Prefix swaps): Interchangeable prefixes (e.g., “a photo of”, “a picture of”, none).
P3 (Attribute swaps): Swapping synonyms for demographic attributes (e.g., “young” vs. “youthful”).

Main findings:

Paraphrase Type	Global Spearman ( $\le$ )	Local Top-100 Overlap
P1	0.04	0.64
P2	0.04	0.89
P3	0.04	0.63 – 0.74

Interpretation: Global ranking stability is extremely low across all paraphrasing types, with top-100 image overlap indicating somewhat greater local consistency for prefix-based but not meaning-altering paraphrases. Attribute-based paraphrasing demonstrated notable instability, especially for gender-related synonyms.

6. Demographic Factors and Fairness Considerations

When stratified by query association (female or male), PRSM reveals systematic, albeit small, differences in retrieval stability:

Female-associated queries exhibit slightly higher local overlap for prefix and attribute paraphrases.
Male-associated queries sometimes have marginally higher Spearman global stability.

Implication: Even slight consistency differences may accumulate across large-scale retrieval deployments, contributing to disparate user experiences or reinforcing stereotypes, particularly when rankings influence downstream tasks or user perceptions (Schlegel et al., 14 Nov 2025). The metric provides actionable diagnostics for identifying and quantifying such fairness-related discrepancies.

7. Limitations and Prospects for Extension

PRSM, as currently formulated, is evaluated over a single dataset with binary gender distinctions and is applied to encoder-only CLIP models. Notable limitations:

The Social Counterfactuals dataset addresses only gender; there is no coverage of race, religion, or disability.
The metric is not directly applicable to autoregressive or decoder-based VLMs.
All images are treated with equal importance; there is no weighting for social sensitivity or application-specific utility.

Potential enhancements outlined in (Schlegel et al., 14 Nov 2025) include substituting normalized Discounted Cumulative Gain (nDCG) for Spearman rank-correlation, extending the metric to multilingual and multi-sentence queries, and developing paraphrase-invariant training objectives to directly optimize for PRSM. This suggests the metric may serve as both an evaluation tool and a training objective for future fair and robust vision-LLMs.

PDF Markdown Chat (Pro)

References (1)

PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Paraphrase Ranking Stability Metric (PRSM).