Winoground: Vision-Language Compositionality Benchmark

Updated 18 March 2026

Winoground is a visio-linguistic compositionality benchmark that uses minimal pairs of images and captions to test order-sensitive reasoning.
The dataset comprises 400 quadruples annotated by experts to capture subtle linguistic swaps and visual discrepancies.
Evaluation metrics emphasize fine-grained alignment between visual scenes and linguistic structure, highlighting significant gaps compared to human performance.

Winoground is a visio-linguistic compositionality benchmark expressly designed to expose and quantify the inability of modern vision–LLMs to perform fine-grained, order-sensitive reasoning about objects and relations in multimodal data. It consists of minimal pairs of images and captions where the captions contain exactly the same words (including content and function words), differing only in syntactic order so that each caption matches exactly one of two images. Successful performance requires genuine compositional alignment between linguistic structure and visual scene content, not merely object or phrase recognition. Winoground has become a cornerstone diagnostic dataset for evaluating and advancing compositional reasoning in vision–language systems (Thrush et al., 2022).

1. Dataset Construction and Challenge Structure

Winoground comprises 400 instances (quadruples), each with two distinct images $(I_0, I_1)$ and two minimally different captions $(C_0, C_1)$ . The distinguishing characteristic is that both captions contain exactly the same lexical items, but in different orders, leading to meaning-changing permutations (e.g., "a mug in grass" vs. "grass in a mug"). The images were sourced from licensed Getty Images assets and each annotation was curated by four expert annotators (computational linguists and vision–language specialists). For every quadruple $(C_0, I_0, C_1, I_1)$ , annotators confirmed that $(C_0, I_0)$ and $(C_1, I_1)$ are preferred over the swapped alternatives.

The items are annotated with a fine-grained tag schema (Thrush et al., 2022, Diwan et al., 2022):

Linguistic swap-dependent tags: object swaps (reordered nouns), relation swaps (verbs, prepositions, adjectives), both swaps (simultaneous object and relation change).
Linguistic swap-independent tags: one or two main predicates.
Visual reasoning tags: symbolic (diagram/graphic), series (strong visual distractors), pragmatics (idiomatic/language-specific cues).
A further annotation campaign introduced tags to capture NonCompositional, AmbiguouslyCorrect, VisuallyDifficult, UnusualImage, UnusualText, and ComplexReasoning items, isolating a core "NoTag" subset of 171 minimal, compositional-only examples (Diwan et al., 2022).

The construction explicitly echoes the Winograd Schema Challenge for language, but in the visio-linguistic domain, enforcing that only compositional structure—not mere object or word presence—can solve the task.

2. Task Definition and Evaluation Metrics

Each model is expected to match images to captions under strict requirements that penalize "bag-of-words" methods and reward true compositional grounding.

Let $s(I, C)$ be the modeled similarity score between image $I$ and caption $C$ . For each instance:

Text Score $f$ : Model must select the correct caption for each image:

$f = \begin{cases} 1 & \text{if } s(C_0, I_0) > s(C_1, I_0) \land s(C_1, I_1) > s(C_0, I_1) \ 0 & \text{otherwise} \end{cases}$

Image Score $g$ : Model must select the correct image for each caption:

$g = \begin{cases} 1 & \text{if } s(C_0, I_0) > s(C_0, I_1) \land s(C_1, I_1) > s(C_1, I_0) \ 0 & \text{otherwise} \end{cases}$

Group Score $h$ : Simultaneous correctness on both axes:

$h = \begin{cases} 1 & \text{if } f = 1 \land g = 1 \ 0 & \text{otherwise} \end{cases}$

Scores are aggregated as mean accuracy across all 400 examples. Human performance (as reference) is approximately 89.5% (text), 88.5% (image), and 85.5% (group).

A parallel metric, the "group matching score" (Zhu et al., 9 Oct 2025), compares the sum of similarity scores for the two possible bijections (identity and swap), relaxing individual pairwise maxima. This metric better exposes hidden potential in models otherwise obscured by the strict group score.

3. Experimental Results and Model Failures

Early evaluations (Thrush et al., 2022, Diwan et al., 2022, Pandey et al., 2022, Ossowski et al., 2024) revealed that both single-stream transformers (UNITER, VILLA, VinVL, VisualBERT, ViLT) and dual-stream models (ViLBERT, LXMERT, CLIP, FLAVA, UniT) perform only marginally above chance; best group scores are 14.5% (VinVL), 14.3% (FLAVA), with none exceeding 40% on text or image score. CLIP, despite large-scale pretraining, is particularly weak on order sensitivity (Text 30.8%, Image 10.5%, Group 8.0) (Ossowski et al., 2024). Even with hard negative pretraining, masked language modeling, or cross-modal alignment regularization (e.g., CACR (Pandey et al., 2022)), gains increase group score by ~5–7 points but do not approach human baselines.

Fine-grained breakdowns indicate that:

Object and relation swaps (classic compositional minimal pairs) yield best text/group scores ≈39/14%.
Symbolic or series distractors: models are incapable (group ≈0%) where humans reach 91%.
Items requiring world knowledge or high visual acuity: models perform at or below random.

All baseline models exhibit pronounced bag-of-words bias, frequently assigning nearly identical scores under swaps of adjectives, prepositions, or verb arguments (Thrush et al., 2022). Attention heatmaps reveal near-indistinguishable cross-modal alignment whether word order is swapped or not (e.g., "brown dog" vs. "white dog").

Table: Representative Model Scores on Winoground (Group Score) | Model | Group Score (%) | |----------------------|----------------| | Random | 16.7 | | CLIP | 8.0 | | UNITER_base | 8.5 | | VinVL | 14.5 | | FLAVA (ITM) | 14.3 | | CACR_base (regularized) | 14.25 | | Human (MTurk) | 85.5 |

4. Methodological Advances: Regularization, Prompting, and Test-Time Adaptation

Several approaches have attempted to overcome these failures by introducing explicit mechanisms for order/relational sensitivity, compositional reasoning, or attention-level alignment.

Cross-modal Attention Congruence Regularization (CACR) (Pandey et al., 2022): CACR introduces a regularizer that enforces congruence between directed intra-language and intra-visual attention matrices, mapped via cross-modal attention matrices. For instance, attention from "mug" to "grass" in the caption must correspond to attention from the mug region to the grass region in the image. The regularizer minimizes the symmetric matrix-KL divergence between projected attention distributions. CACR increases UNITER group score from 8.5% to 14.25% and improves other order-sensitive swaps.

Prompt-based Generative Reasoning (Ossowski et al., 2024): The KeyComp pipeline decomposes the Winoground task into (1) identifying key terms, (2) generating keyword-guided scene descriptions for images using a vision-LLM (VLM), and (3) chain-of-thought prompting of an LLM to reason over the structured description and candidate captions. This method, though bottlenecked by VLM description quality, outperforms CLIP and CACR group scores, peaking at 18.2% when optimal descriptions are selected. This approach sidesteps the single-vector bottleneck and injects multi-step inference, compensating for the loss of relational information in classical embedding models.

Test-Time Matching (TTM) and Group Matching Metrics (Zhu et al., 9 Oct 2025): By exploiting group structure and using a group matching score (selecting the bijection that maximizes total score), hidden compositionality emerges that strict metrics obscure. TTM further self-trains the model on its confident group matches at test time, boosting SigLIP-B16 from a raw 10% to 67% (simple match), and to 72.5% (TTM) under the strict group score. This systematic pseudo-labeling exposes latent order/relational sensitivity and brings MLLMs like GPT-4.1 above estimated human performance (91.4%) for the first time on this dataset.

Causality-aware Adaptive Intervention (LINA) (Yu et al., 15 Dec 2025): In compositional image generation, LINA improves Winoground performance for text-to-image DMs by (a) amplifying relational tokens in the prompt embedding, (b) injecting contrastive guidance at the latent level relative to a "neutralized" prompt, and (c) re-allocating denoising steps to emphasize early structure formation. On the 171-item "NoTag" core, LINA reaches 79.5% success rate (SD-3.5 backbone), surpassing all open-source diffusion baselines.

5. Error Analysis and Dataset-specific Bottlenecks

Analyses find that Winoground's difficulty arises from both linguistic and visual axes (Diwan et al., 2022, Thrush et al., 2022):

Bag-of-words/Order Neglect: Models typically do not distinguish between "mug in grass" and "grass in mug," reflecting an inability to encode directed, order-sensitive dependencies.
Spatial and Relational Reasoning: Prepositions and argument swaps confound both language and vision branches, with model attention barely shifting under critical swaps.
Visual Attribute-Role Alignment: Failure to bind fine-grained visual details (e.g., small, out-of-focus objects) to textual roles is evident in both qualitative outputs and separability analyses.
Tag-wise Failures: Models approach chance on items involving pragmatic or world knowledge, or where image quality or visual detail falls outside pretraining distribution.
Layerwise Probing: Embedding branch text variants become separable in mid-layers but lose this at the cross-attention bottleneck, indicating fusion failures.
Quantitative Correlations: Longer captions degrade accuracy, and larger-scale image pretraining only marginally closes the gap except in outlier cases.

This suggests that classical embedding techniques lack the mechanisms necessary for true multimodal symbolic composition and cross-modal alignment at relational/structural levels.

6. Implications, Benchmark Use, and Future Directions

Winoground functions as a high-precision diagnostic for compositional generalization in visio-linguistic models. Empirical and probe studies across multiple architectures converge on several implications:

Benchmark Partitioning: Subtasks orthogonal to compositionality (e.g., visual difficulty, pragmatic inference) must be separately reported to avoid confounding progress on pure compositionality.
Evaluation Metrics: Strict aggregation (group score) masks partial capacities; matching-based and best-bijection metrics reveal hidden ability and are recommended for future evaluations (Zhu et al., 9 Oct 2025).
Model Improvements: Order-sensitive fusion, explicit attention-level alignment, chain-of-thought multimodal reasoning, and test-time self-training substantially improve compositional benchmarks, but full closing of the gap remains elusive.
Dataset Development: Larger, multilingual, and controlled-compositional datasets, as well as pretraining objectives penalizing bag-of-words solutions and promoting explicit relational encoding, are recommended (Thrush et al., 2022).
Compositional Generation: For DMs, adaptive prompt/latent intervention and causality emphasis substantially close the compositionality gap with minimal augmentation (Yu et al., 15 Dec 2025).

Winoground thus continues to structure research at the intersection of compositional linguistics, visual scene understanding, and multimodal representation learning, serving as a transparent stress-test for the next generation of vision–language architectures.