SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Abstract: In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-LLMs have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in all these benchmarks rendering them hackable. This hackability is so dire that blind models with no access to the image outperform state-of-the-art vision-LLMs. To remedy this rampant vulnerability, we introduce SugarCrepe, a new benchmark for vision-language compositionality evaluation. We employ LLMs, instead of rule-based templates used in previous benchmarks, to generate fluent and sensical hard negatives, and utilize an adversarial refinement mechanism to maximally reduce biases. We re-evaluate state-of-the-art models and recently proposed compositionality inducing strategies, and find that their improvements were hugely overestimated, suggesting that more innovation is needed in this important direction. We release SugarCrepe and the code for evaluation at: https://github.com/RAIVNLab/sugar-crepe.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. arXiv preprint arXiv:2211.03759, 2022.
- Léon Bottou. From machine learning to machine reasoning. Machine learning, 94(2):133–149, 2014.
- Going beyond nouns with vision & language models using synthetic data, 2023.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053, 2022.
- Some controversial questions in phonological theory. Journal of linguistics, 1(2):97–138, 1965.
- MJ Cresswell. Logics and languages. 1973.
- Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768, 2022.
- Teaching structured vision & language concepts to vision & language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657–2668, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324, 2018.
- Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
- Openclip, July 2021.
- Compositionality. In Handbook of logic and language, pages 417–473. Elsevier, 1997.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
- Adversarial filters of dataset biases. In International Conference on Machine Learning, pages 1078–1088. PMLR, 2020.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Vera: A general-purpose plausibility estimation model for commonsense statements. arXiv preprint arXiv:2305.03695, 2023.
- Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 852–869. Springer, 2016.
- Crepe: Can vision-language foundation models reason compositionally? arXiv preprint arXiv:2212.07796, 2022.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126, 2020.
- OpenAI. Chatgpt. 2022.
- Data cards: Purposeful and transparent dataset documentation for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Cola: How to adapt vision-language models to compose objects localized with attributes?, 2023.
- Fighting bias with bias: Promoting model robustness by amplifying dataset biases. arXiv preprint arXiv:2305.18917, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- Coarse-to-fine contrastive learning in image-text-graph space for improved vision-language compositionality. 2023.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
- Image captioners are scalable vision learners too. arXiv preprint arXiv:2306.07915, 2023.
- Omnivl: One foundation model for image-language and video-language tasks. arXiv preprint arXiv:2209.07526, 2022.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023.
- Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326, 2018.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.
- Dataset scope: SugarCrepe is built exclusively on COCO images/captions, limiting coverage to everyday objects and common scenes; it is unclear how results transfer to other domains (e.g., medical, satellite, indoor/egocentric video frames) and datasets (e.g., OpenImages, Flickr30k, Visual Genome).
- Language coverage: All negatives are generated in English; there is no evaluation of multilingual compositionality or cross-lingual robustness.
- Composition types not covered: The benchmark omits several important compositional phenomena—e.g., Swap-Rel, Add-Rel, negation without keyword artifacts, quantifiers, counting, coreference, comparatives/superlatives, temporal relations, logical operators—leaving untested large swaths of compositional reasoning.
- Reliance on LLM-generated negatives: The hard negatives are dependent on ChatGPT prompts/demonstrations; sensitivity to prompt design, LLM version drift, decoding settings, and reproducibility across API versions is not quantified.
- Residual style artifacts: While adversarial refinement balances two proxy scores, the paper does not analyze residual text-only cues (e.g., sentence length, n-gram frequency, punctuation patterns, POS distributions, LM perplexity, stylistic markers) that could still differentiate COCO (human) positives from LLM-generated negatives.
- Limited bias proxies: Debiasing and vulnerability checks rely on two text-only models (Vera for plausibility and a grammar checker); potential exploitability by stronger blind attackers (e.g., NLI entailment, large LMs, perplexity/fluency ensembles, style classifiers, learned linear probes over bag-of-words) is not tested.
- Potential circularity in debiasing: The adversarial refinement explicitly symmetrizes against Vera/grammar scores; this may overfit to these proxies while leaving other latent artifacts intact—no evaluation against a diverse suite of unseen blind baselines is provided.
- Selection bias from adversarial refinement: Subsampling to enforce symmetric score gaps may skew the distribution of caption types, lengths, or concepts; the paper does not report how refinement changes content distributions or difficulty.
- Human validation details: The false-negative filtering (Stage 2) lacks transparent protocol details—annotator numbers, instructions, inter-annotator agreement, resolution policy for disagreements, and quality control—hindering reproducibility and reliability assessment.
- Positive-caption artifacts: Positives come from COCO; the paper does not examine whether COCO captions contain their own stylistic or semantic artifacts that could interact with LLM-negatives even after refinement.
- Train–test contamination risks: NegCLIP is trained/finetuned on COCO while SugarCrepe is built from COCO; the paper does not detail train/test splits, image overlap controls, or de-duplication procedures to preclude leakage and overfitting.
- Evaluation protocol narrowness: The benchmark is framed as 2-choice image-to-text retrieval; it does not assess ranking over large candidate pools, text-to-image retrieval, caption generation, or localized grounding (region-level compositionality).
- Generalization to modern VLMs: Evaluations focus on CLIP variants; compositionality of contemporary multimodal LLMs (e.g., BLIP-2, Flamingo, PaLI, LLaVA-style models) remains unassessed on SugarCrepe.
- Construct validity: Model performance on SugarCrepe correlates with ImageNet zero-shot accuracy; the paper does not disentangle whether SugarCrepe primarily measures general recognition ability vs true compositional reasoning.
- Difficulty calibration: Items are not labeled for difficulty (e.g., number/type of atoms changed, lexical overlap, ambiguity); there is no item-response analysis to ensure a well-calibrated difficulty spectrum.
- Coverage of relations and attributes: Replace-Rel performance is reported, but there is no fine-grained breakdown of relation types (spatial vs action vs prepositional vs verb-argument) or attribute categories (color, material, size, state), limiting diagnostic insight.
- Robustness to paraphrase: The benchmark fixes COCO positives without systematic paraphrastic variants; it is unclear whether performance persists under paraphrased positives or stylistically harmonized positives/negatives.
- Scalability and maintenance: LLM generation plus human validation is costly; the paper does not propose a scalable protocol for continual expansion, periodic refresh against new blind attacks, or automated quality control for future releases.
- Fairness and demographic bias: No analysis of demographic content (e.g., gender, age, race mentions), representational harms, or subgroup performance is provided, despite people appearing in COCO scenes.
- Residual hackability audits: Beyond Vera/grammar, the paper does not conduct red-teaming with composite blind heuristics, adversarial training of text-only classifiers, or diagnostic probes (e.g., length/lexical ablations) to empirically bound residual hackability.
- Negative-type completeness: The rationale for excluding Shuffle and Negate is sound for artifact concerns, but the paper does not propose alternative “clean” constructions for negation or word-order sensitivity, leaving these capabilities unevaluated.
- External validity to downstream tasks: The paper does not test whether SugarCrepe scores predict compositional performance on downstream tasks (e.g., VQA compositional splits, referring expressions, structured scene reasoning), limiting claims about real-world utility.
- Reproducibility artifacts and licensing: Dependency on proprietary LLMs (ChatGPT) and potential API changes may hinder reproducibility; details on deterministic generation, seeds, and licensing constraints are not provided.
Collections
Sign up for free to add this paper to one or more collections.