ARO Benchmark: Compositionality in VLMs
- The ARO Benchmark is a large-scale evaluation suite with 50,000+ test cases that assesses vision-language models’ compositional understanding across attribution, relation, and order.
- It employs multi-choice tasks derived from Visual Genome, COCO, and Flickr30k to isolate challenges like attribute binding, spatial relations, and word order sensitivity.
- Findings show that models primarily exploit bag-of-words shortcuts, while composition-aware hard negative mining significantly boosts performance metrics.
The Attribution, Relation, and Order (ARO) benchmark is a systematic evaluation suite introduced to probe the compositional understanding of vision-LLMs (VLMs). It specifically tests whether these models can correctly interpret the way objects, attributes, and word order contribute to meaning, or if they instead rely on “bag-of-words” matching that disregards compositional structure. ARO is orders of magnitude larger than previous compositionality benchmarks, encompassing more than 50,000 test cases. The benchmark isolates and tests three key aspects of semantic representation: attribution (binding properties to objects), relation (reasoning over object-object relations), and order (sensitivity to word sequence in captions). The development and analysis of ARO reveal significant deficiencies in standard VLMs and propose principled solutions to address these gaps (Yuksekgonul et al., 2022).
1. Structure and Components of the ARO Benchmark
ARO comprises three multi-choice sub-benchmarks, each targeting a distinct facet of compositionality:
- Visual Genome Relation: Requires models to distinguish between “the X is relation Y” (correct) and “the Y is relation X” (foil) given an image crop with two objects. It evaluates spatial and verb relations (e.g., "to the left of," "watching"), where argument order flips semantic meaning.
- Visual Genome Attribution: Presents two objects, each with distinct attributes, and asks models to differentiate “the attribute₁ X and the attribute₂ Y” from the attribute-swapped “the attribute₂ X and the attribute₁ Y,” measuring attribute-object binding.
- COCO & Flickr30k Order: For each image, models are given five caption choices—one human-written and four word-order-perturbed versions (e.g., shuffled or permuted in various schemes). Models must select the semantically correct caption, probing robustness to word order perturbations.
Each sub-benchmark is designed to decouple compositional understanding from simple lexical matching, ensuring that high performance requires sensitivity to relations, attributions, and syntactic order.
2. Dataset Construction and Scale
ARO leverages large-scale real-world datasets filtered and curated for compositional testing:
- Visual Genome Relation: Uses the GQA-filtered subset of Visual Genome scene graphs, covers 48 relation types across 23,937 cases (~500 per relation). Test images are cropped to display both relevant objects, and paired with contrasting captions differing only by argument order.
- Visual Genome Attribution: Also built from GQA-filtered Visual Genome, spans 117 attribute pairs (“small vs. brown”, “open vs. white”, etc.) and totals 28,748 test cases where candidate captions differ only by attribute assignment.
- COCO & Flickr30k Order: Drawn from the Karpathy split (5,000 COCO and 1,000 Flickr30k test images), each with one original and four systematically shuffled captions, forming 5-way multiple-choice tasks (chance = 20%).
The scale and breadth of ARO—over 50,000 cases across three axes of compositionality—distinguish it from prior benchmarks, supporting rigorous statistical analysis and covering diverse semantic phenomena.
| Sub-benchmark | # Cases | # Classes / Types |
|---|---|---|
| VG Relation | 23,937 | 48 relations |
| VG Attribution | 28,748 | 117 attribute pairs |
| COCO Order | 5,000 | 5 choices/image |
| Flickr30k Order | 1,000 | 5 choices/image |
3. Evaluation Protocol and Metrics
All ARO sub-benchmarks are cast as contrastive multiple-choice tasks. For Visual Genome Relation and Attribution, each case provides two options (chance level = 50%); for COCO/Flickr30k Order, five options (chance = 20%). Accuracy is measured as:
Additionally, standard retrieval settings on COCO/Flickr30k use Recall@k:
where is the position of the correct item in the retrieval list. The chosen format isolates compositional errors, minimizing confounds from out-of-distribution or long-tail effects.
4. Model Performance and Failure Analysis
Four off-the-shelf VLMs were evaluated on ARO: CLIP (ViT-B/32), BLIP (base, finetuned), FLAVA (flava-full), and X-VLM (base, finetuned). Results highlight pervasive compositional deficiencies:
| Sub-benchmark | CLIP | BLIP | X-VLM | FLAVA |
|---|---|---|---|---|
| VG Relation (chance 50%) | 59% | 59% | 73% | 24% |
| VG Attribution (50%) | 62% | 88% | 87% | 73% |
| COCO Order (20%) | 46% | 32% | 36% | 4% |
| Flickr30k Order (20%) | 60% | 37% | 47% | 13% |
Key failure modes include near-chance relational reasoning for many prepositions and verbs, failures to bind attributes to the correct object, and an almost complete insensitivity to word order (models often fail to prefer unshuffled captions). These patterns are consistent across architectures and pretraining regimens.
The authors further demonstrate that retrieval performance on standard datasets remains high even when captions or image patches are randomly shuffled. For instance, BLIP’s Recall@1 on COCO drops only marginally under completely shuffled captions, directly implicating “bag-of-words” shortcuts rather than true compositionality.
5. Underlying Causes: Shortcut Learning and Contrastive Objectives
A central finding is that current contrastive pretraining objectives and datasets allow models to succeed via shortcut learning—maximizing lexical overlap without representing compositional structure. The standard contrastive loss
can be minimized without attending to argument order or attribute binding. As such, large-scale retrieval tasks fail to penalize or expose compositional deficiencies, explaining why they persist across state-of-the-art models. This reveals a core limitation of both pretraining and prevailing evaluation methodologies (Yuksekgonul et al., 2022).
6. Addressing Compositional Deficits: Composition-Aware Hard Negative Mining
To directly counter shortcut learning, composition-aware hard negative mining is introduced. In this regime, training batches include two additional types of negatives:
- Caption Negatives: Synthetic captions created by swapping nouns, adjectives, or verb phrases in existing captions, directly manipulating compositional elements.
- Visual Negatives: Image samples drawn from the top-K nearest neighbors in the encoder feature space (e.g., CLIP-space), introducing visually plausible distractors.
The modified contrastive objective is:
Finetuning CLIP (ViT-B/32) on COCO using these hard negatives (“NegCLIP”) yields substantial gains on all ARO metrics with negligible loss (<2%) on downstream retrieval and zero-shot classification. Improvements include:
- VG Relation: 59% → 81%
- VG Attribution: 62% → 71%
- COCO Order: 46% → 86%
- Flickr30k Order: 60% → 91%
These results underscore that exposing models to composition-confounding negatives during contrastive learning is both necessary and effective for fostering genuine compositional representations.
7. Implications, Recommendations, and Best Practices
ARO’s design and associated empirical analysis have several implications for VLM training and evaluation:
- Pretraining objectives should systematically include composition-aware negatives, preventing reliance on word co-occurrence and lexical shortcuts.
- Evaluation suites must adopt compositional tests like ARO to avoid overestimating grounding and cross-modal alignment.
- Future datasets should be constructed with multiple captions or images sharing lexical content but differing in argument order, attributes, or relations, thereby disabling bag-of-words strategies.
A plausible implication is that without such interventions, vision-LLMs will continue to advance superficially on retrieval metrics without genuine semantic or compositional understanding—a fundamental gap for applications involving nuanced reasoning. Incorporating composition-aware hard negative mining presents a principled and scalable pathway for addressing this challenge and aligning model behavior more closely with human-like understanding (Yuksekgonul et al., 2022).