Papers
Topics
Authors
Recent
Search
2000 character limit reached

SugarCrepe++ Compositionality Benchmark

Updated 8 February 2026
  • SugarCrepe++ is a next-generation benchmark that rigorously tests compositional reasoning in vision-language models through tightly balanced, adversarial triplets.
  • It uses a structured triplet schema—comprising an image, a semantically correct caption, a hard negative caption, and a paraphrase—to ensure models rely on true cross-modal understanding.
  • Empirical evaluations show that while state-of-the-art models perform well on simple replacements, they struggle with complex swaps and relational changes, highlighting the need for advanced training strategies.

SugarCrepe++ Compositionality Benchmark

SugarCrepe++ is a rigorous next-generation benchmark designed to diagnose and quantify compositional reasoning in vision-LLMs (VLMs). Evolved from previous efforts like SugarCrepe, it targets models' ability to distinguish semantic content under tightly controlled lexical and plausibility constraints, exposing key limitations in current VLMs and unimodal LLMs (ULMs) with respect to both lexical and semantic alterations. By explicitly eliminating shortcuts and spurious cues prevalent in prior benchmarks, SugarCrepe++ offers a more reliable testbed for evaluating compositional generalization and cross-modal understanding.

1. Motivations and Design Principles

Earlier benchmarks in vision-language compositionality—such as ARO, CREPE, and the original SugarCrepe—were shown to be "hackable": strong performance could be achieved by unimodal heuristics that exploited spurious correlations or distributional artifacts, including systematic token length differences or lower plausibility in negative captions. Blind models using LLM log-likelihoods or token-length rules achieved accuracy competitive with state-of-the-art CLIP models, revealing deep constructional biases (Udandarao et al., 9 Jun 2025, Hsieh et al., 2023).

SugarCrepe++ directly addresses these pitfalls via:

  • Distributional equivalence: Positives and negatives are sampled so as to be matched in length and LLM-assessed plausibility.
  • Adversarial stress testing: The benchmark construction includes formulaic evaluations to ensure that blind text-only or length-based classifiers cannot achieve above-chance accuracy.
  • Semantic focus: Each instance is framed to require genuine visual-grounded compositional reasoning, rather than word-match or fluency cues.

2. Dataset Construction and Triplet Schema

SugarCrepe++ extends the methodology of SugarCrepe through (a) stricter lexical balancing, (b) paraphrase-based positives, and (c) fine-grained semantic perturbations. The core of each example is a triplet:

  • Image II
  • Positive caption P1P_1 (semantically correct)
  • Hard negative caption NN (lexically similar, semantically incorrect)
  • Positive paraphrase caption P2P_2 (strictly paraphrastic: equivalent in meaning, lexically distinct from P1P_1)

Generation Pipeline

  1. Starting point: From MS-COCO, select examples (I,P1,N)(I, P_1, N), where NN is a structure-preserving but semantically corrupted variant with high lexical overlap.
  2. Second positive (P2P_2) generation: Use an instruction-fine-tuned LLM with an elaborate meta-prompt to produce P2P_2. This prompt enforces:
    • Fluency and grammaticality
    • Semantic equivalence to P1P_1
    • Minimal lexical overlap (paraphrastic constraint)
    • No hallucinations or missing attributes/relations
  3. Automated and human filtering: Any P2P_2 that is duplicate or superfluous is rerun through the generator; semantic equivalence is double-checked by an independent LLM, and final triplets are human-validated.

Semantic Categories

The benchmark partitions test cases into five controlled "hard negative" types:

Category #Instances
Replace–Object 1,652
Replace–Attribute 788
Replace–Relation 1,406
Swap–Object 245
Swap–Attribute 666

(From (Dumpala et al., 2024))

3. Task Definition and Evaluation Metrics

SugarCrepe++ employs a three-way semantic (in)equivalence protocol:

Image-to-Text (ITT)

For an image II and captions P1P_1, P2P_2, NN, the model must assign higher similarity to both P1P_1 and P2P_2 than to NN:

ITThit(I,P1,P2,N)={1if s(I,P1)>s(I,N) and s(I,P2)>s(I,N) 0otherwise\text{ITT}_{\mathrm{hit}}(I, P_1, P_2, N) = \begin{cases} 1 &\text{if } s(I, P_1) > s(I, N) \text{ and } s(I, P_2) > s(I, N) \ 0 &\text{otherwise} \end{cases}

Text-to-Text (TOT)

Evaluate the compositionality of the text encoder in isolation:

TOThit(P1,P2,N)={1if s(P1,P2)>s(P1,N) and s(P2,P1)>s(P2,N) 0otherwise\text{TOT}_{\mathrm{hit}}(P_1, P_2, N) = \begin{cases} 1 &\text{if } s(P_1, P_2) > s(P_1, N) \text{ and } s(P_2, P_1) > s(P_2, N) \ 0 &\text{otherwise} \end{cases}

  • Similarity function s(,)s(\cdot, \cdot) is typically cosine similarity of frozen CLIP text/image embeddings.
  • Accuracy metric:

Accuracy=1Ni=1N1[hiti=1]\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[ \mathrm{hit}_i = 1 ]

(Dumpala et al., 2024)

4. Comparative Positioning, Categories, and Stress-Testing

Relation to Past Benchmarks

SugarCrepe++ addresses and supersedes earlier benchmarks by:

  • Using paraphrastic positive pairs (P1,P2P_1, P_2), enforcing the need for semantic and not only lexical alignment for success.
  • Drawing both positives and negatives from distributions balanced with respect to token length and LLM-assessed plausibility (log-likelihood).
  • Including adversarial stress-tests ensuring that any method exploiting low-level lexical or plausibility cues cannot outperform random guessing (chance level determined by number of choices per sample).

Stress-Tests (as specified in (Udandarao et al., 9 Jun 2025)):

  • Token-Length Test: Accuracy of length-based classifier AlengthA_{\text{length}} must be at or near chance.
  • LLM-Likelihood Test: Accuracy of plausibility-based classifier ALLMA_{\text{LLM}} must be at or near chance.
  • Text-only Classifier Test: Unimodal text classifier cannot go above baseline.
  • Shuffled-Pairs Test: Shuffling negatives across image pairs drops performance to chance.

Semantic Axes

Distinct compositional phenomena are explicitly tested:

Category Discriminative Focus
Replace–Object Semantic identity of object altered
Replace–Attribute Attribute changes (e.g., color, size)
Replace–Relation Agent/recipient or spatial preposition modified
Swap–Object Order of objects changed
Swap–Attribute Attribute assignments permuted

(Brusini et al., 1 Feb 2026, Dumpala et al., 2024)

5. Empirical Results and Model Behavior

Baseline Performance

Reported model results on SugarCrepe++ exhibit the following trends (Dumpala et al., 2024):

  • Humans perform nearly perfectly (97100%\sim 97-100\%) on all tasks.
  • State-of-the-art VLMs (CLIP, FLAVA, BLIP, etc.) show high accuracy on Replace–Object (8595%\sim 85-95\%), but much lower performance (down to 3050%30-50\%) on Swap–Object/Attribute and Replace–Relation, indicating severe difficulty with fine-grained compositionality.
  • All models, including large ULMs (e.g., AngleBERT, 7B parameter models), have especially low scores in the TOT setting, confirming that the text encoder is a primary failure bottleneck.

Model-Size, Data, and Objective Correlations

  • Larger pretrain datasets and model scale yield incremental gains, but the main determinant of SC++ performance is the training objective: multi-objective models (contrastive + ITM, MLM, scene graph) perform better than contrastive-only models.
  • Fine-tuning with retrieval supervision (COCO, etc.) provides limited gains, still falling short of human-level accuracy (Dumpala et al., 2024).

Response to Compositional Training

Research on new training strategies—such as PolyGen's multi-generator synthetic data (Brusini et al., 1 Feb 2026), CLIC fine-tuning (Peleg et al., 30 May 2025), and advanced contrastive curricula—shows non-trivial improvements:

Model Mean SugarCrepe++ Acc. (%) Key Innovations
SynthCLIP 36.7 Standard single-generator synthetic data
PolyGen 41.6 (+9.1% rel) Ensemble of 3 generators + hard-neg curriculum
CLIC-RedCaps 76.0 (Replace ITT) Multi-positive/negative concat, multi-term loss
CLIP (baseline) 69.5 (Replace ITT) Standard contrastive pretraining

(Brusini et al., 1 Feb 2026, Peleg et al., 30 May 2025)

Failure modes and open directions

  • Object/attribute swaps and, especially, complex relational changes (e.g., subject-object reversals) remain the hardest categories for all current models.
  • Empirically, restricted or bag-of-words models consistently fail to capture the required compositional rebindings, confirming the necessity of scene-level grounding and structural alignment.

6. Advances Driven by SugarCrepe++

SugarCrepe++ has rapidly become the compositionality benchmark of record for the vision-language community. Recent modeling advances explicitly target its challenge points:

  • PHyCLIP (Yoshikawa et al., 10 Oct 2025): Decomposes semantics into an 1\ell_1-product of hyperbolic factors, yielding interpretable Boolean-like conjunctions across concept families and strong compositional generalization, including robust transfer to SugarCrepe++ scenarios.
  • CLIC (Peleg et al., 30 May 2025): Achieves state-of-the-art SugarCrepe++ performance via concatenation-based multi-positive alignment, breaking the bag-of-words bottleneck by forcing true scene-level reasoning.
  • PolyGen (Brusini et al., 1 Feb 2026): Exploits synthetic ensemble diversity and hard negative programming to achieve robust, data-efficient scaling and +9.1% relative gains versus single-generator baselines.
  • SPARO (Vani et al., 2024): Partitions transformer representations into disjoint attention slots, further improving compositional benchmarks such as SugarCrepe by up to +9% in selective slot settings.
  • Iterated Learning (Zheng et al., 2024): Aligns learned representations with easier-to-learn, more systematic languages; proposed as a paradigm likely to yield gains as SC++ increases compositional depth.

7. Outlook and Recommendations

SugarCrepe++ established several new standards for vision-language evaluation:

  • In-distribution balancing: By exactly matching length, plausibility, and other possible cues between positives and negatives, it prevents non-compositional shortcutting.
  • Three-way (or higher-order) evaluation protocols: Complex paraphrastic and semantic redundancy forces models to rely on true grounding, not pattern matching.
  • Group and bidirectional retrieval: As advocated in BiVLC (Miranda et al., 2024), fully bidirectional tests and group metrics prevent cherry-picking success in a single modality.
  • Automated robustification: Formulaic stress-tests are included in benchmark construction, certifying that only models with true visio-linguistic compositional reasoning can excel (Udandarao et al., 9 Jun 2025).
  • Open frontier: Substantial headroom remains—no model achieves consistent human-level performance; highest metrics are observed only with innovative architecture or fine-tuning strategies.

A plausible implication is that future compositional benchmarks should further integrate group and bidirectional retrieval, scene graph alignment, multimodal negative mining, and systematic coverage of complex attributes and relations. SugarCrepe++ serves as the current foundation for progress in this direction.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SugarCrepe++ Compositionality Benchmark.