Compositional Reasoning Benchmarks

Updated 25 October 2025

Compositional reasoning benchmarks are curated datasets and evaluation protocols designed to assess AI's ability to integrate multi-step inference across varied modalities.
They employ systematic generalization tests and diagnostic metrics, such as instance-level scores and group matching, to pinpoint model limitations.
Innovative methodologies like programmatic generation, bias mitigation, and task decomposition drive advancements in robust, interpretable AI performance.

Compositional reasoning benchmarks are curated datasets and evaluation protocols specifically designed to assess the ability of artificial intelligence systems—particularly LLMs, vision-LLMs (VLMs), multimodal LLMs (MLLMs), and formal symbolic reasoning agents—to solve tasks requiring the structured integration of multiple concepts, relations, or reasoning steps. These benchmarks are characterized by their emphasis on systematic generalization, multi-step inference, and explicit error analyses aimed at diagnosing the limitations and inductive biases of current architectures. They encompass a wide array of modalities (text, vision, audio, code, math, robotics) and span both synthetic and real-world domains.

1. Principles and Taxonomy of Compositional Reasoning Benchmarks

Compositional reasoning—the ability to understand and generate complex concepts as systematic compositions of simpler elements—is foundational to robust generalization. Benchmarks in this field are constructed to test distinct axes of compositionality:

Systematicity: Probing generalization to novel compositions not seen during training (e.g., new conjugations of known verbs, or unseen subject–predicate–object triples).
Multi-step Inference: Tasks require explicit chaining, e.g., multi-hop reasoning over knowledge graphs, repeated symbolic manipulation, or following a sequence of visual, linguistic, or logical relations.
Cross-modality Integration: Many modern benchmarks require compositional integration across vision, language, and action (e.g., visual scenes and textual queries, action sequences with linguistic or symbolic goals).
Diagnosis of Failure Modes: Advanced benchmarks systematically label and categorize errors at the sub-step, or reasoning primitive, level.

A broad taxonomy, synthesized from recent surveys (Ke et al., 24 Aug 2025), includes:

Reasoning Type	Example Domains	Benchmark Instances
Visual compositional	Attribute binding, scene graph queries	CLEVR, GQA, SugarCrepe, MathSticks
Relational reasoning	Family, spatial, comparative, logic	GAR, MCR, Ineq-Comp, AgentCoMa
Formal/mathematical	Proof composition, code verification	DafnyCOMP, Ineq-Comp
Audio compositional	Event ordering, attribute binding	CompA
Agentic/robotic	Skill chaining, multi-modal planning	ClevrSkills

2. Evolution of Benchmarks and Methodologies

The field has evolved from early toy datasets with synthetic structure (e.g., CLEVR's quantifiers, relations, and logic operators), through large-scale, real-world video (AGQA, ANetQA) and text (GAR, CryptoX), to agentic, mixed-modality, and functionally compositional environments (ClevrSkills, AgentCoMa, MathSticks).

Major design innovations include:

Programmatic Generation: Questions and tasks are constructed from underlying functional programs or scene graphs, allowing explicit control over the number and type of reasoning steps (Grunde-Mclaughlin et al., 2021, Yu et al., 2023).
Bias Mitigation: Balancing procedures ensure that answer distributions (e.g., yes/no in binary questions) do not permit shortcut solutions based on linguistic or statistical priors rather than true compositional understanding (Grunde-Mclaughlin et al., 2022).
Decomposition Frameworks: Decomposing questions or tasks into directed acyclic graphs (DAGs) of sub-questions, enabling the isolation and evaluation of compositional subroutine performance (Gandhi et al., 2022).
Hybrid Synthetic–Real Data: Fine-grained, attribute-rich annotations and large-scale curation are used to push the limits of real-world applicability and complexity (Yu et al., 2023, Ji et al., 1 Oct 2025, Haresh et al., 13 Nov 2024).

3. Performance Evaluation and Diagnostic Metrics

Compositional reasoning benchmarks commonly report not only aggregate accuracy but also diagnostic and process-oriented metrics. These include:

Instance-Level and Breakdown Scores: Performance stratified by question type, reasoning chain length, answer modality, or semantic category (e.g., attribute, relationship, action) (Grunde-Mclaughlin et al., 2021, Yu et al., 2023).
Group Matching and Assignment Metrics: In group-structured multimodal benchmarks, group matching scores (e.g., comparing global assignment similarity in Winoground or MMVP-VLM) reveal latent capabilities otherwise hidden by strict pairwise metrics (Zhu et al., 9 Oct 2025).
Consistency and Faithfulness Metrics: Compositional Accuracy (CA) measures the ability to answer parent questions when all sub-questions are answered correctly; Right-for-the-Wrong-Reasons (RWR) captures success despite intermediate errors; Internal Consistency (IC) measures logical coherence between hierarchical sub-answers (Gandhi et al., 2022).
Sample Efficiency and Generalization: Area Under the Curve (AUC) or Sample Efficiency Score (SES) summarize performance as a function of available training data, emphasizing few-shot or systematic generalization regimes (Zerroug et al., 2022).
Human–AI Performance Gaps: Direct head-to-head comparisons highlight substantial gaps between human and model sample efficiency and compositional robustness, even (and especially) on tasks deemed trivial by humans (Zerroug et al., 2022, Ji et al., 1 Oct 2025, Haresh et al., 13 Nov 2024).
Interpretable Circuit Analysis: Recent works employ attribution patching and neuron/attention head analyses to relate task performance to specific compositional “circuits” and wiring in large models (Ni et al., 17 Dec 2024, Shi et al., 8 Feb 2025).

4. Notable Benchmarks Across Modalities and Domains

Video and Vision-Language Benchmarks

AGQA/AGQA 2.0: Automatic question generation over spatio-temporal graphs; splits for novel composition, indirect reference, multi-step complexity; models typically achieve <50% human performance, with bias controls limiting language-only heuristics (Grunde-Mclaughlin et al., 2021, Grunde-Mclaughlin et al., 2022).
ANetQA: Incorporates fine-grained taxonomy, untrimmed real-world video, hierarchical attributes, over an order of magnitude more QA pairs compared to AGQA. Models score ~44.5% where humans approach 84.5% (Yu et al., 2023).
CVR: Evaluates odd-one-out visual rule compositions; convolutional nets outperform transformers, but all models are far less data-efficient than humans (Zerroug et al., 2022).
MathSticks: Matchstick puzzles for visual–symbolic compositional reasoning. Task requires perception, symbolic edit planning, and arithmetic verification; humans exceed 90% but best models only reach ~60% (Ji et al., 1 Oct 2025).
ConMe: Hard negative CR benchmarking for VLMs using model–to–model “conversation” pipeline; induces up to 33% drops in performance for SoTA models, restoring the CR challenge even at the frontier (Huang et al., 12 Jun 2024).
SCRAMBLe: Enhances MLLMs by synthetic, hard-negative, preference-labeled scene compositions; models fine-tuned with SCRAMBLe show 5–10% gains on group-structure benchmarks (Mishra et al., 7 Apr 2025).
Test-Time Matching (TTM): Iterative pseudo-label self-training leveraging group structure and matching margins; provides significant performance boosts (often >40% absolute) even when evaluation metrics historically underestimate model capacity (Zhu et al., 9 Oct 2025).
READ-CLIP: Augments contrastive learning with auxiliary token reconstruction and paraphrase-alignment objectives; improves compositionality across standard benchmarks by up to 4.1% over conventional fine-tuning (Kwon et al., 18 Oct 2025).
CounterCurate: Uses counterfactual data (physically — via bounded box/image flips, semantically — via GPT-4V/DALLE-3) to improve spatial, counting, and attribute-based CR, resulting in >30% gains on positional/semeantic reasoning datasets (Zhang et al., 20 Feb 2024).

Audio and Multimodal

CompA: Benchmarks compositionality in audio–LLMs (ALMs) via paired audio–caption tasks probing event order and attribute binding, with composition-aware hard negatives; baseline ALMs perform near chance, but compositional-aware training yields marked improvements (Ghosh et al., 2023).

Language and Formal Reasoning

GAR: Synthetic compositional relational reasoning in LLMs, with loops over key–value–query–answer chains; performance drops significantly in the presence of non-same or negated semantic relations, even for SoTA models (Ni et al., 17 Dec 2024).
CryptoX: Framework leveraging cryptographic encoding/transformation of instructions to force subproblem decomposition and integration; closed-source models outperform open-source LLMs in compositional AUC, highlighting generalization gaps (Shi et al., 8 Feb 2025).
AgentCoMa: Mixed-type composition (commonsense + math) in real-world agentic scenarios; models show a ~30% "compositionality gap," succeeding on individual sub-tasks, but failing when required to combine reasoning types—a failure not present in single-type benchmarks (Alazraki et al., 27 Aug 2025).
DafnyCOMP: Compositional formal verification for code; model verification rates drop from >90% in single-function benchmarks to under 5% for multi-function compositions due to specification fragility, proof–implementation misalignment, and reasoning instability (Xu et al., 27 Sep 2025).
Ineq-Comp: Human-intuitive algebraic inequality transformations (variable duplication, algebraic rewriting) expose major generalization gaps in Lean-based proof assistants, even with in-context demonstrations; models exhibit severe brittleness compared to humans (Zhao et al., 19 May 2025).

5. Open Challenges and Directions for Research

Current state-of-the-art models consistently demonstrate substantial, often abrupt, drops in performance on compositional reasoning tasks as complexity, modality integration, or step count increases. Persistent issues include:

Fragile compositional generalization, even after explicit stepwise demonstrations or fine-tuning, particularly when reasoning chains mix distinct types (e.g., math and commonsense in AgentCoMa (Alazraki et al., 27 Aug 2025)).
Overreliance on language priors, dataset bias, or surface-level cues, which balancing and debiasing schemes in benchmarks (e.g., AGQA 2.0, CounterCurate) attempt to mitigate.
Limited sample efficiency when compared to humans, with current architectures requiring orders of magnitude more data (e.g., in visual rule induction (Zerroug et al., 2022)).
Disconnection between informal, chain-of-thought style "explanations" and formal, verifiable composition in code or formal proof generation (DafnyCOMP, Ineq-Comp).
Insufficiently rich or realistic negative samples in earlier benchmarks (addressed by ConMe, CounterCurate, MathSticks) leading to overestimated model capabilities.

Emerging directions, synthesized from multiple surveys and benchmark proposals, include:

Architectures incorporating explicit program induction, symbolic interfaces, or neural modules specialized for subproblem decomposition and solution composition (as seen in attempts at neuro-symbolic integration for AGQA (Grunde-Mclaughlin et al., 2021), modular CR circuits in GAR (Ni et al., 17 Dec 2024)).
Training on large, curated pools of hard counterfactual tasks to drive robust reasoning (leveraging generative modeling pipelines as in CounterCurate and ConMe).
Integration of analysis tools for error taxonomy and attribution-informed circuit refinement, enabling interpretable diagnosis and guided improvements (Ni et al., 17 Dec 2024, Shi et al., 8 Feb 2025).
Compositional preference learning and test-time adaptation (SCRAMBLe, TTM) to further leverage synthetic structure and group-matching effects in data.

6. Benchmarks, Evaluation, and Research Infrastructure

The expanding landscape of compositional reasoning benchmarks now covers over 60 datasets across modalities and domains (see (Ke et al., 24 Aug 2025)). These provide not just evaluation standards but also serve as challenging training and ablation environments. Notably:

data, code, and model checkpoints for MathSticks, CounterCurate, GAR, CryptoX, SCRAMBLe, and many other benchmarks are publicly available, facilitating systematic comparison and further extension.
Multiple benchmarks (e.g., AgentCoMa, DafnyCOMP, Ineq-Comp) include detailed interpretability or error analysis modules, now considered essential for advancing the field beyond raw performance statistics.

7. Impact and Outlook

Compositional reasoning benchmarks have exposed persistent, cross-cutting deficiencies in model generalization, stepwise reasoning, and multi-modal integration—despite rapid scaling of training data and model size. Progress in these domains is increasingly measured not by aggregate accuracy alone but by fine-grained, diagnostic evidence of systematic, interpretable reasoning beyond shallow heuristics or pattern matching.

The field is now converging on the principle that robust, generalizable AI requires compositional benchmarks of sufficient scale, diversity, and diagnostic depth, with associated tools for evaluation and interpretability. Future progress is anticipated to arise from both improved benchmark design—tied to richer, human-aligned task decomposition—and from architectures and training routines explicitly engineered to support these compositional, systematic abstractions.