Papers
Topics
Authors
Recent
2000 character limit reached

LGIP: Invariance & Sensitivity in VLMs

Updated 24 November 2025
  • LGIP is a diagnostic benchmark that quantifies invariance to meaning-preserving paraphrases and sensitivity to contradictory captions in VLMs.
  • It introduces detailed metrics like invariance error, semantic sensitivity gap, and positive-rate to reveal nuanced model behaviors beyond standard retrieval accuracy.
  • Experimental evaluations demonstrate trade-offs among VLM families, highlighting critical semantic grounding issues in models such as SigLIP.

Language-Guided Invariance Probing (LGIP) is a behavioral diagnostic benchmark for vision–LLMs (VLMs) that quantifies two essential properties: invariance to meaning-preserving paraphrases of image captions, and sensitivity to meaning-changing (contradictory) “flipped” captions. LGIP introduces a suite of metrics that disentangle linguistic robustness from traditional aggregate retrieval accuracy, offering fine-grained insight into how VLMs such as CLIP, OpenCLIP, EVA02-CLIP, and SigLIP compose visual and textual information at the level of semantic and syntactic variation (Lee, 17 Nov 2025).

1. Motivation and Diagnostic Gap

State-of-the-art VLMs leverage large dual-tower encoders to align image and text representations, typically evaluated by zero-shot retrieval or classification accuracy. However, such aggregate scores do not decompose two distinct desiderata:

  • Linguistic invariance: The model should preserve similarity across meaning-equivalent (paraphrased) captions paired with the same image.
  • Semantic sensitivity: The model should sharply penalize captions that semantically contradict the image (e.g., altering object, color, or count).

Traditional benchmarks conflate these factors, obscuring failure modes: a model may be overly sensitive to trivial rephrasings or insufficiently responsive to core semantic changes. LGIP addresses this by providing independent metrics, illuminating the invariance–sensitivity frontier and exposing robustness breakdowns masked by conventional metrics (Lee, 17 Nov 2025).

2. Dataset Selection and Perturbation Protocol

LGIP operates on the MS COCO dataset, comprising M=40,000M=40,000 images, each with N=5N=5 human-written captions. For every (I,c)(I, c) pair, two sets of text perturbations are generated:

  • Paraphrase set P(I,c)\mathcal{P}(I, c):
    • Paraphrases are formed by wrapping the original caption cc with lightweight templates (e.g., “a photo of cc”, “in this picture cc”, “this image shows cc”).
    • Content words remain unchanged; only stylistic or framing aspects vary.
    • Paraphrases are deduplicated and capped at Ksame=6K_{\text{same}} = 6 variants per caption.
  • Semantic flip set F(I,c)\mathcal{F}(I, c):
    • Lexical fields for COCO-object nouns, color adjectives, and number words are maintained.
    • Captions are mutated by substituting qualifying tokens with alternatives from the same field: object ("dog" → "person"), color ("brown" → "red"), number ("two" → "five").
    • Each flip consists of a single substitution, with trivial or duplicate edits discarded, and a maximum of Kdiff=6K_{\text{diff}} = 6 per base caption (yielding approximately 80,632 valid flipped captions).
    • Flip type t{obj,col,num}t \in \{\text{obj}, \text{col}, \text{num}\} is recorded for granular evaluation.

In all cases, the image II remains constant; only the captions are perturbed.

3. Formal Metric Definitions

Let fimg(I)f_{\text{img}}(I) and ftext(c)f_{\text{text}}(c) denote 2\ell_2-normalized image and text encoder outputs. The primary metrics are:

  • Cosine Similarity:

s(I,c)=sim(fimg(I),ftext(c))s(I, c) = \mathrm{sim}(f_{\text{img}}(I), f_{\text{text}}(c))

  • Invariance Error EinvE_{\text{inv}}:

Einv=E(I,c)EcP(I,c)s(I,c)s(I,c)E_{\text{inv}} = \mathbb{E}_{(I, c)}\,\mathbb{E}_{c' \in \mathcal{P}(I, c)}\,|s(I, c) - s(I, c')|

Quantifies the average fluctuation in similarity score across paraphrased captions.

  • Semantic Sensitivity Gap GsensG_{\text{sens}} (also EsensE_{\text{sens}}):

g(I,c,c)=s(I,c)s(I,c)g(I, c, c^{\dagger}) = s(I, c) - s(I, c^{\dagger})

Gsens=E(I,c)EcF(I,c)[g(I,c,c)]G_{\text{sens}} = \mathbb{E}_{(I, c)}\,\mathbb{E}_{c^{\dagger} \in \mathcal{F}(I, c)}\,[g(I, c, c^{\dagger})]

Measures the margin in similarity between correct and contradicted captions.

  • Positive-Rate Statistic R+R^+:

1pos(I,c,c)={1if s(I,c)>s(I,c) 0otherwise1_{\text{pos}}(I, c, c^{\dagger}) = \begin{cases} 1 & \text{if } s(I, c) > s(I, c^{\dagger}) \ 0 & \text{otherwise} \end{cases}

R+=E(I,c)EcF(I,c)[1pos(I,c,c)]R^+ = \mathbb{E}_{(I,c)}\,\mathbb{E}_{c^{\dagger} \in \mathcal{F}(I, c)} [1_{\text{pos}}(I, c, c^{\dagger})]

Indicates the empirical fraction where the original caption is scored higher than its flip; R+1R^+ \approx 1 denotes near-perfect detection of contradictions, R+0.5R^+ \approx 0.5 is random, R+<0.5R^+ < 0.5 reflects preference for flipped (incorrect) captions.

Each metric can be calculated overall or stratified by flip type tt (object, color, count) for fine-grained analysis.

4. Benchmark Execution and Aggregation

In the practical LGIP pipeline, each of the 40,000 images, and their five captions, yields up to six paraphrases and six semantic flips per caption. The following steps are executed for every (I,c)(I, c) pair:

  1. Generate P(I,c)\mathcal{P}(I, c) and F(I,c)\mathcal{F}(I, c).
  2. Compute s(I,c)s(I, c), s(I,c)s(I, c') for all cP(I,c)c' \in \mathcal{P}(I, c), and s(I,c)s(I, c^{\dagger}) for all cF(I,c)c^{\dagger} \in \mathcal{F}(I, c).
  3. Aggregate metrics across all paraphrase and flip comparisons—approximately 1.2 million paraphrase and 80,000 flip comparisons per evaluated model.

Evaluations are performed in zero-shot image–text matching mode with frozen encoders and consistent preprocessing: OpenCLIP pipelines for CLIP/EVA/OpenCLIP models and HuggingFace processor for SigLIP/SigLIP2.

5. Experimental Protocol and Model Comparison

Nine prominent VLM variants are assessed:

Model E_inv ↓ G_sens ↑ R⁺ ↑
CLIP ViT-B/16 0.008 0.024 0.866
CLIP ViT-L/14 0.009 0.027 0.873
OpenCLIP ViT-L/14 0.008 0.046 0.898
OpenCLIP ViT-H/14 0.010 0.050 0.908
EVA02-CLIP L/14 0.005 0.030 0.896
SigLIP base-224 0.055 –0.017 0.474
SigLIP base-384 0.058 –0.021 0.464
SigLIP large-384 0.013 0.002 0.538
SigLIP 2 base-224 0.041 0.008 0.649

Key findings include:

  • CLIP and OpenCLIP (especially larger variants) improve semantic sensitivity GsensG_{\text{sens}} with only modest increases in invariance error EinvE_{\text{inv}}, forming an empirical Pareto frontier.
  • EVA02-CLIP L/14 achieves the lowest EinvE_{\text{inv}} with Gsens0.030G_{\text{sens}} \approx 0.030, R+0.90R^+ \approx 0.90.
  • SigLIP family models exhibit significantly higher EinvE_{\text{inv}} (up to 0.058) and often negative GsensG_{\text{sens}}, indicating a tendency to prefer flipped (incorrect) captions; R+R^+ values cluster around random.

When metrics are stratified by flip type, CLIP/EVA achieve R+0.95R^+ \approx 0.95 for object flips, $0.83$–$0.87$ for color, and $0.66$–$0.77$ for count. In contrast, SigLIP base models perform at or below random on all types, demonstrating severe deficiencies in semantic grounding that are not apparent in standard retrieval evaluations.

6. Model Behavior and Failure Modes

Qualitative analysis reveals that CLIP, OpenCLIP, and EVA02-CLIP reliably down-weight captions with incorrect objects, colors, or counts. By contrast, SigLIP frequently assigns higher similarity to captions that contradict the image, including cases where objects are swapped entirely (e.g., “cat” \rightarrow “person” or a swapped color adjective). Such behavior is invisible to aggregate retrieval metrics but critical for applications requiring prompt-level or fine-grained semantic alignment.

LGIP further reveals that robust invariance under paraphrasing does not trivially guarantee sensitivity to semantic flips. There exists an inherent trade-off, highlighted quantitatively in the E_inv–G_sens frontier.

7. Recommendations and Prospects

The LGIP benchmark is model-agnostic, leveraging frozen encoders and simple rule-based caption perturbations. Its empirical findings lead to the following recommendations:

  • LGIP should be employed as a standard behavioral diagnostic, complementing accuracy-driven benchmarks for comprehensive evaluation of linguistic robustness.
  • Failures exposed by LGIP (especially in SigLIP-family models) indicate the value of using invariance and sensitivity as explicit regularizers or validation signals in the model training process, potentially penalizing paraphrase variance and encouraging larger margins for semantically contradictory captions.
  • Adoption of LGIP-like evaluations is recommended for the design and selection of future vision–LLMs, ensuring improved semantic grounding and linguistic robustness that extend beyond coarse-grained accuracy (Lee, 17 Nov 2025).

Language-Guided Invariance Probing thus provides a scalable, interpretable, and actionable framework to quantify and disentangle how VLMs respond to meaning-preserving versus meaning-altering textual variation, crucial for advancing the reliability of multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Language-Guided Invariance Probing (LGIP).