Visual Reasoning Types: Taxonomies & Benchmarks
- Visual reasoning types are classifications that extract, manipulate, and infer complex relationships from visual inputs using object-level, symbolic, temporal, and causal dimensions.
- They are defined through taxonomies, including relational, symbolic, perceptual, temporal, causal, and arithmetic reasoning, and are benchmarked with datasets like CLEVR and Visual Genome.
- Empirical findings highlight challenges such as perceptual grounding, systematic generalization, and the need for hybrid models integrating multiple reasoning paradigms.
Visual reasoning encompasses a spectrum of computational and cognitive processes by which systems extract, manipulate, and infer complex relationships, structures, and abstract rules from visual inputs. Unlike surface-level perception or object recognition, visual reasoning entails leveraging multiple levels of abstraction—including object-level, relational, symbolic, temporal, and causal dimensions—to answer queries, solve puzzles, ground answers in image evidence, or generate novel visual predictions. Recent research has formalized, benchmarked, and dissected visual reasoning types across multiple modalities, output targets, and application domains.
1. Core Taxonomies of Visual Reasoning Types
Research identifies several orthogonal and overlapping axes that stratify visual reasoning. Representative frameworks include:
- Five-Dimensional Taxonomy (AVR perspective (Małkiński et al., 2022)):
- Input Shapes: e.g., geometric primitives vs. abstract forms
- Hidden Rules: explicit symbolic (e.g., AND, OR, progression) vs. abstract/intuitive constraints
- Target Task: classification, generation, natural-language rule description
- Cognitive Function: completion (fill in missing piece) vs. discrimination (pattern break/assignment)
- Specific Challenge: domain transfer, extrapolation, arithmetic integration
- Benchmark-Oriented Categories:
- Relational Reasoning: captures object–object, part–whole, or scene graph-level relations (Webb et al., 2023, Sarkar et al., 14 Aug 2025)
- Symbolic/Programmatic Reasoning: involves extraction or manipulation of structured symbols or logic (Sarkar et al., 14 Aug 2025)
- Perceptual/Pattern Completion: demands filling in missing information or matching under perturbations (Törtei et al., 24 Dec 2025)
- Temporal Reasoning: tracks sequences, event orderings, or changes over time (Sarkar et al., 14 Aug 2025, Zhao et al., 3 Apr 2025, Shen et al., 17 May 2025)
- Causal Reasoning: infers interventions, counterfactuals, or mechanistic relations (Sarkar et al., 14 Aug 2025, Zhao et al., 3 Apr 2025)
- Commonsense/Intent: models latent goals, affordances, or background world knowledge (Sarkar et al., 14 Aug 2025, Zhong et al., 2024)
The table below synthesizes frequently cited axes:
| Axis | Examples/Subtypes | Benchmarks/Papers |
|---|---|---|
| Relational | Object–object, spatial, scene graph, part–whole | CLEVR, Visual Genome |
| Symbolic | Logic, rule extraction, program tracing | CLEVR, NMN, NS-CL |
| Temporal | Action/event tracking, anticipation, video QA | TVQA, VSTaR, RISEBench |
| Causal | Interventions, counterfactual, SCM-based prediction | CLEVRER, CausalVQA |
| Commonsense | Intent, affordance, latent goal inference, world knowledge | VCR, VisualCOMET |
| Perceptual | Pattern completion, matching under noise, low-level correspondence | VisRes, BLINK-Twice |
| Arithmetic | Counting, measurement, ratio, number sense | VisualPuzzles, VisuLogic |
Each axis subsumes a range of formally distinct tasks, output types, and methodological demands.
2. Fine-Grained Reasoning Types in Contemporary Benchmarks
Current benchmarks operationalize visual reasoning through well-defined, often orthogonal, task templates:
- SCB (Seeing Culture Benchmark) (Satar et al., 20 Sep 2025): Defines three distractor-sampling visual reasoning types in cultural VQA:
- Type 1 (Same-Country): Visual choices all from same country/category; demands fine-grained, within-culture concept discrimination.
- Type 2 (Different-Country): Same category/different countries; can be solved by exploiting high-level country-specific cues.
- Type 3 (Mixed-Group): Balanced distractor set; probes both broad and fine-grained cues.
- All are embedded in a two-stage protocol: (1) multiple-choice VQA, (2) segmentation grounding.
- BLINK-Twice (Ye et al., 10 Oct 2025): Characterizes seven visual reasoning types geared to vision-centric perceptual inference:
- Visual Misleading: Detecting appearance–reality dissociation.
- Visual Dislocation: Foreground–background alignment illusions.
- Art Illusion: Painted vs. real geometry distinction.
- Visual Occlusion: Inferring identity/count under partial occlusion.
- Forced Perspective: Resolving scale/distance ambiguities from camera pose.
- Physical Illusion: Disambiguating phenomena like refraction/lighting.
- Motion Illusion: Differentiating dynamic context from static arrangement.
- RISEBench (Zhao et al., 3 Apr 2025): Structurally delineates:
- Temporal Reasoning: Scene/object evolution over time.
- Causal Reasoning: Direct action-induced state change.
- Spatial Reasoning: Object arrangement and geometric manipulation.
- Logical Reasoning: Rule-based, symbolic inference applied to visual input (e.g., puzzle solving).
- VisualPuzzles (Song et al., 14 Apr 2025): Encodes five reasoning categories:
- Algorithmic: Multi-step transformation chains.
- Analogical: Relational analogy completion.
- Deductive: Propositional logic with visual facts.
- Inductive: Pattern induction/generalization.
- Spatial: Spatial transformation and visualization.
- VisuLogic (Xu et al., 21 Apr 2025): Six orthogonal reasoning types:
- Quantitative: Counting, arithmetic, and set operations.
- Spatial: 3D structure inference.
- Positional: In-plane transformations.
- Attribute: Symmetry, curvature, property detection.
- Stylistic: Overlays, Boolean manipulation.
- Other: Alphanumeric and cultural symbol transformations.
- VisRes (Törtei et al., 24 Dec 2025): Three-level regime—(1) perceptual completion, (2) rule-based single-attribute, (3) compositional multi-attribute abstraction.
Each benchmark defines category-specific input distributions, selection strategies, and formal evaluation protocols to isolate the corresponding reasoning dimension.
3. Formal Models, Architectures, and Abstractions
Visual reasoning types entail distinct computational architectures and formal representations:
- Object-Centric and Relational Models: Architectures such as OCRA (Webb et al., 2023) extract object slots, encode pairwise relations via learned bottlenecks, and apply transformers for rule abstraction. Tasks include same/different, match-to-sample, higher-order patterns (e.g., ABA rules), and systematic generalization under held-out shape identities.
- Graph-Based and Symbolic Networks: Scene graphs (), Neuro-Symbolic Concept Learner (NS-CL), and neural module networks utilize discrete symbols, logic-based pipelines, and continuous relaxation to perform program execution over parsed scene elements (Sarkar et al., 14 Aug 2025).
- Temporal and Causal Modules: Spatio-temporal transformers, memory-augmented nets, and structural causal models (SCMs) encode time-indexed variables or causal graphs to reason about action chains, interventions (), and counterfactuals.
- Visual Table Representation: Hierarchical scene representation structured as a "visual table" (scene-level and object-level descriptions with knowledge fields), bridging grounding and symbolic knowledge for MLLMs (Zhong et al., 2024).
- Attention and Perception Modules: Spatial and feature-based attention significantly modulate performance in spatial-relation and same-different tasks, respectively, confirming the need for both object-centric and spatial-contextual information (Vaishnav et al., 2021).
Systems leveraging explicit relational bottlenecks, object abstraction, or structured symbolic intermediaries generally demonstrate higher systematic generalization than monolithic architectures.
4. Evaluation Protocols and Metrics
Categorically appropriate quantitative and qualitative metrics are used:
- Multiple-Choice Accuracy: Fraction of correct answers in VQA-style and puzzle-based formats (e.g., accuracy in SCB, BLINK-Twice, VisualPuzzles).
- Segmentation and Grounding: Intersection-over-Union (), mean IoU, or mask-prediction performance following successful high-level inference (SCB Type-I/II/III).
- Chain-of-Thought Score (CoT-Score): Normalized stepwise reasoning-chain evaluation (BLINK-Twice), combining identification of visual clues and correct inference.
- Multi-Dimensional Scoring: RISEBench employs a weighted sum of instruction reasoning, appearance consistency, and plausibility; "solved" status is gated by perfect scores across all dimensions.
- Structural and Causal Validity: Accuracy of intermediate representations (e.g., GraphSim for scene graphs (Sarkar et al., 14 Aug 2025)), counterfactual consistency, average causal effect (ACE) for causal benchmarks.
- Error Decomposition: Attribution of model failures to perception-only, reasoning-only, or combined failures (VRIQ (Khezresmaeilzadeh et al., 5 Feb 2026)), fine-grain field probe accuracy (e.g., shape/count/position/depth—VRIQ; attribute/operation chain—VisualPuzzles).
Benchmark design emphasizes separating reasoning skill from domain knowledge, perceptual extraction, or statistical priors, highlighting the precise axis under evaluation.
5. Empirical Findings and Model Limitations
Empirical results from recent benchmarks converge on several key observations:
- Perceptual Grounding as Limiting Factor: In VRIQ, over half of model failures stem from perceptual deficits (e.g., counting, 3D/depth reasoning), with only 1% due solely to logical error (Khezresmaeilzadeh et al., 5 Feb 2026).
- Category-Specific Weaknesses: Logical, analogical, and inductive reasoning show the lowest model–human gap closure (Song et al., 14 Apr 2025), while fine-grained discrimination among visually similar distractors is especially challenging (SCB Type 1 (Satar et al., 20 Sep 2025)).
- Reasoning-to-Grounding Gap: Even high VQA accuracy (Type 2 in SCB) does not translate to reliable evidence localization (segmentation mIoU ≈ 31–33%) (Satar et al., 20 Sep 2025).
- Pattern Matching vs. Abstract Reasoning: VisRes reveals performance near chance on low-layer, perceptual-completion tasks, with moderate gains in rule-based abstraction and a collapse in compositional, multi-attribute scenarios, implicating missing attribute-binding and generalization mechanisms (Törtei et al., 24 Dec 2025).
- Interaction of Reasoning Types: Temporal and causal reasoning are distinct (passive change vs. active intervention), while spatial and logical axes probe geometry, rule-based manipulation, or compositional inference (Zhao et al., 3 Apr 2025, Sarkar et al., 14 Aug 2025).
These findings underline the limitations of current VLMs in moving beyond pattern recognition toward robust, attribute-rich, and structured visual reasoning.
6. Synthesis: Integrated Taxonomy Across Modalities and Outputs
Unifying analysis across sources underscores that visual reasoning types are best represented on a multidimensional spectrum:
- Input complexity: Ranges from simple patches and geometric forms (object-centric, symbolic), through relational scenes, temporal chains, to compositional video queries.
- Hidden rules: Span explicit transformations, arithmetic or arithmetic-like operations, relational mapping, temporal/causal chains, and commonsense/rule abstraction.
- Task format: Multiple-choice, generation, segmentation/grounding, rule description, natural language output.
- Output requirements: Binary/multiclass answers, mask or region predictions, candidate selection, structured scene graphs, free-form captions.
- Difficulty and generalization: Systematic generalization often demands compositionality, variable binding, multi-step induction, or counterfactual prediction—rarely achieved with current models.
The table below summarizes principal reasoning types, their operational definitions, and some archetypal benchmarks:
| Type | Formal/Core Definition | Representative Benchmarks |
|---|---|---|
| Object-Centric | Slot-based, binding of features and positions | OCRA, CLEVR-ART (Webb et al., 2023) |
| Relational | Scene graph, pairwise or triple-wise abstract reasoning | CLEVR, Visual Genome (Sarkar et al., 14 Aug 2025) |
| Symbolic | Program tracing, logic, rule extraction/execution | CLEVR, NS-CL, NMN |
| Temporal | Sequential prediction, state evolution, event ordering | RISEBench, TVQA (Zhao et al., 3 Apr 2025) |
| Causal | Structural causal modeling, counterfactual/photo-realistic edit | CausalVQA, CLEVRER |
| Commonsense/Intent | Goal, intention, affordance, world knowledge inference | VCR, VisualCOMET (Sarkar et al., 14 Aug 2025) |
| Perceptual/Pattern | Completion, amodal filling, local/global matching | VisRes, BLINK-Twice (Törtei et al., 24 Dec 2025, Ye et al., 10 Oct 2025) |
| Quantitative/Arithmetic | Counting, number-sense, set arithmetic | VisualPuzzles, VisuLogic |
| Logical/Rule-based | Deduction, induction, analogical completion | VisualPuzzles |
| Compositional/Attribute | Multi-attribute integration, combinatorial rule chaining | VisRes, OCRA |
Benchmarks increasingly combine these types in hybrid queries and tasks to more rigorously probe the generalization and systematic reasoning abilities of advanced vision-language architectures.
7. Open Challenges and Research Directions
Persistent challenges and future opportunities identified across the literature include:
- Perceptual–Reasoning Integration: Bottlenecks in perceptual skill (e.g., counting, depth reconstruction) severely limit downstream abstract reasoning (Khezresmaeilzadeh et al., 5 Feb 2026); research is shifting toward joint training and explicit representation fusion.
- Unified, Multi-Paradigm Systems: Hybrid architectures combining graph-based, symbolic, temporal, and causal modules are advocated for robust zero-shot and few-shot reasoning (Sarkar et al., 14 Aug 2025).
- Comprehensive Benchmarks and Adaptive Evaluation: There remains a need for datasets that integrate functional correctness, structural consistency, and causal validity in a cohesive testbed, avoiding shortcut-prone or single-mode evaluations (Sarkar et al., 14 Aug 2025).
- Weak and Self-Supervision: Most reasoning models depend on densely annotated datasets or programmatic supervision; advances in weakly supervised representation learning are essential for scalable, real-world generalization.
- Explainability and Trust: Embedding interpretable reasoning traces, diagnostic error decomposition, and uncertainty quantification is needed, particularly for safety–critical and high-stakes applications (Sarkar et al., 14 Aug 2025).
This synthesis of visual reasoning types reveals an actively evolving field, with ongoing integrative efforts required to move from specialized, isolated capabilities to general, transparent, and human-aligned visual intelligence.