SugarCrepe++ Compositionality Benchmark

Updated 8 February 2026

SugarCrepe++ is a next-generation benchmark that rigorously tests compositional reasoning in vision-language models through tightly balanced, adversarial triplets.
It uses a structured triplet schema—comprising an image, a semantically correct caption, a hard negative caption, and a paraphrase—to ensure models rely on true cross-modal understanding.
Empirical evaluations show that while state-of-the-art models perform well on simple replacements, they struggle with complex swaps and relational changes, highlighting the need for advanced training strategies.

SugarCrepe++ is a rigorous next-generation benchmark designed to diagnose and quantify compositional reasoning in vision-LLMs (VLMs). Evolved from previous efforts like SugarCrepe, it targets models' ability to distinguish semantic content under tightly controlled lexical and plausibility constraints, exposing key limitations in current VLMs and unimodal LLMs (ULMs) with respect to both lexical and semantic alterations. By explicitly eliminating shortcuts and spurious cues prevalent in prior benchmarks, SugarCrepe++ offers a more reliable testbed for evaluating compositional generalization and cross-modal understanding.

1. Motivations and Design Principles

Earlier benchmarks in vision-language compositionality—such as ARO, CREPE, and the original SugarCrepe—were shown to be "hackable": strong performance could be achieved by unimodal heuristics that exploited spurious correlations or distributional artifacts, including systematic token length differences or lower plausibility in negative captions. Blind models using LLM log-likelihoods or token-length rules achieved accuracy competitive with state-of-the-art CLIP models, revealing deep constructional biases (Udandarao et al., 9 Jun 2025, Hsieh et al., 2023).

SugarCrepe++ directly addresses these pitfalls via:

Distributional equivalence: Positives and negatives are sampled so as to be matched in length and LLM-assessed plausibility.
Adversarial stress testing: The benchmark construction includes formulaic evaluations to ensure that blind text-only or length-based classifiers cannot achieve above-chance accuracy.
Semantic focus: Each instance is framed to require genuine visual-grounded compositional reasoning, rather than word-match or fluency cues.

2. Dataset Construction and Triplet Schema

SugarCrepe++ extends the methodology of SugarCrepe through (a) stricter lexical balancing, (b) paraphrase-based positives, and (c) fine-grained semantic perturbations. The core of each example is a triplet:

Image $I$
Positive caption $P_1$ (semantically correct)
Hard negative caption $N$ (lexically similar, semantically incorrect)
Positive paraphrase caption $P_2$ (strictly paraphrastic: equivalent in meaning, lexically distinct from $P_1$ )

Generation Pipeline

Starting point: From MS-COCO, select examples $(I, P_1, N)$ , where $N$ is a structure-preserving but semantically corrupted variant with high lexical overlap.
Second positive ( $P_2$ ) generation: Use an instruction-fine-tuned LLM with an elaborate meta-prompt to produce $P_2$ $P_{2}$ . This prompt enforces:
- Fluency and grammaticality
- Semantic equivalence to $P_1$
- Minimal lexical overlap (paraphrastic constraint)
- No hallucinations or missing attributes/relations
Automated and human filtering: Any $P_2$ that is duplicate or superfluous is rerun through the generator; semantic equivalence is double-checked by an independent LLM, and final triplets are human-validated.

Semantic Categories

The benchmark partitions test cases into five controlled "hard negative" types:

Category	#Instances
Replace–Object	1,652
Replace–Attribute	788
Replace–Relation	1,406
Swap–Object	245
Swap–Attribute	666

(From (Dumpala et al., 2024))

3. Task Definition and Evaluation Metrics

SugarCrepe++ employs a three-way semantic (in)equivalence protocol:

Image-to-Text (ITT)

For an image $I$ and captions $P_1$ , $P_2$ , $N$ , the model must assign higher similarity to both $P_1$ and $P_2$ than to $N$ :

$\text{ITT}_{\mathrm{hit}}(I, P_1, P_2, N) = \begin{cases} 1 &\text{if } s(I, P_1) > s(I, N) \text{ and } s(I, P_2) > s(I, N) \ 0 &\text{otherwise} \end{cases}$

Text-to-Text (TOT)

Evaluate the compositionality of the text encoder in isolation:

$\text{TOT}_{\mathrm{hit}}(P_1, P_2, N) = \begin{cases} 1 &\text{if } s(P_1, P_2) > s(P_1, N) \text{ and } s(P_2, P_1) > s(P_2, N) \ 0 &\text{otherwise} \end{cases}$

Similarity function $s(\cdot, \cdot)$ is typically cosine similarity of frozen CLIP text/image embeddings.
Accuracy metric:

$\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[ \mathrm{hit}_i = 1 ]$

(Dumpala et al., 2024)

4. Comparative Positioning, Categories, and Stress-Testing

Relation to Past Benchmarks

SugarCrepe++ addresses and supersedes earlier benchmarks by:

Using paraphrastic positive pairs ( $P_1, P_2$ ), enforcing the need for semantic and not only lexical alignment for success.
Drawing both positives and negatives from distributions balanced with respect to token length and LLM-assessed plausibility (log-likelihood).
Including adversarial stress-tests ensuring that any method exploiting low-level lexical or plausibility cues cannot outperform random guessing (chance level determined by number of choices per sample).

Stress-Tests (as specified in (Udandarao et al., 9 Jun 2025)):

Token-Length Test: Accuracy of length-based classifier $A_{\text{length}}$ must be at or near chance.
LLM-Likelihood Test: Accuracy of plausibility-based classifier $A_{\text{LLM}}$ must be at or near chance.
Text-only Classifier Test: Unimodal text classifier cannot go above baseline.
Shuffled-Pairs Test: Shuffling negatives across image pairs drops performance to chance.

Semantic Axes

Distinct compositional phenomena are explicitly tested:

Category	Discriminative Focus
Replace–Object	Semantic identity of object altered
Replace–Attribute	Attribute changes (e.g., color, size)
Replace–Relation	Agent/recipient or spatial preposition modified
Swap–Object	Order of objects changed
Swap–Attribute	Attribute assignments permuted

(Brusini et al., 1 Feb 2026, Dumpala et al., 2024)

5. Empirical Results and Model Behavior

Baseline Performance

Reported model results on SugarCrepe++ exhibit the following trends (Dumpala et al., 2024):

Humans perform nearly perfectly ( $\sim 97-100\%$ ) on all tasks.
State-of-the-art VLMs (CLIP, FLAVA, BLIP, etc.) show high accuracy on Replace–Object ( $\sim 85-95\%$ ), but much lower performance (down to $30-50\%$ ) on Swap–Object/Attribute and Replace–Relation, indicating severe difficulty with fine-grained compositionality.
All models, including large ULMs (e.g., AngleBERT, 7B parameter models), have especially low scores in the TOT setting, confirming that the text encoder is a primary failure bottleneck.

Model-Size, Data, and Objective Correlations

Larger pretrain datasets and model scale yield incremental gains, but the main determinant of SC++ performance is the training objective: multi-objective models (contrastive + ITM, MLM, scene graph) perform better than contrastive-only models.
Fine-tuning with retrieval supervision (COCO, etc.) provides limited gains, still falling short of human-level accuracy (Dumpala et al., 2024).

Response to Compositional Training

Research on new training strategies—such as PolyGen's multi-generator synthetic data (Brusini et al., 1 Feb 2026), CLIC fine-tuning (Peleg et al., 30 May 2025), and advanced contrastive curricula—shows non-trivial improvements:

Model	Mean SugarCrepe++ Acc. (%)	Key Innovations
SynthCLIP	36.7	Standard single-generator synthetic data
PolyGen	41.6 (+9.1% rel)	Ensemble of 3 generators + hard-neg curriculum
CLIC-RedCaps	76.0 (Replace ITT)	Multi-positive/negative concat, multi-term loss
CLIP (baseline)	69.5 (Replace ITT)	Standard contrastive pretraining

(Brusini et al., 1 Feb 2026, Peleg et al., 30 May 2025)

Failure modes and open directions

Object/attribute swaps and, especially, complex relational changes (e.g., subject-object reversals) remain the hardest categories for all current models.
Empirically, restricted or bag-of-words models consistently fail to capture the required compositional rebindings, confirming the necessity of scene-level grounding and structural alignment.

6. Advances Driven by SugarCrepe++

SugarCrepe++ has rapidly become the compositionality benchmark of record for the vision-language community. Recent modeling advances explicitly target its challenge points:

PHyCLIP (Yoshikawa et al., 10 Oct 2025): Decomposes semantics into an $\ell_1$ -product of hyperbolic factors, yielding interpretable Boolean-like conjunctions across concept families and strong compositional generalization, including robust transfer to SugarCrepe++ scenarios.
CLIC (Peleg et al., 30 May 2025): Achieves state-of-the-art SugarCrepe++ performance via concatenation-based multi-positive alignment, breaking the bag-of-words bottleneck by forcing true scene-level reasoning.
PolyGen (Brusini et al., 1 Feb 2026): Exploits synthetic ensemble diversity and hard negative programming to achieve robust, data-efficient scaling and +9.1% relative gains versus single-generator baselines.
SPARO (Vani et al., 2024): Partitions transformer representations into disjoint attention slots, further improving compositional benchmarks such as SugarCrepe by up to +9% in selective slot settings.
Iterated Learning (Zheng et al., 2024): Aligns learned representations with easier-to-learn, more systematic languages; proposed as a paradigm likely to yield gains as SC++ increases compositional depth.

7. Outlook and Recommendations

SugarCrepe++ established several new standards for vision-language evaluation:

In-distribution balancing: By exactly matching length, plausibility, and other possible cues between positives and negatives, it prevents non-compositional shortcutting.
Three-way (or higher-order) evaluation protocols: Complex paraphrastic and semantic redundancy forces models to rely on true grounding, not pattern matching.
Group and bidirectional retrieval: As advocated in BiVLC (Miranda et al., 2024), fully bidirectional tests and group metrics prevent cherry-picking success in a single modality.
Automated robustification: Formulaic stress-tests are included in benchmark construction, certifying that only models with true visio-linguistic compositional reasoning can excel (Udandarao et al., 9 Jun 2025).
Open frontier: Substantial headroom remains—no model achieves consistent human-level performance; highest metrics are observed only with innovative architecture or fine-tuning strategies.

A plausible implication is that future compositional benchmarks should further integrate group and bidirectional retrieval, scene graph alignment, multimodal negative mining, and systematic coverage of complex attributes and relations. SugarCrepe++ serves as the current foundation for progress in this direction.