VisRes Bench: Visual Reasoning Benchmark

Updated 27 December 2025

VisRes Bench is a diagnostic benchmark that isolates perceptual, relational, and compositional visual reasoning abilities using exclusively image-based tasks.
It systematically decouples semantic abstraction from linguistic cues with graded challenges like local completion, Raven-style grids, and multi-attribute composition tasks.
Evaluations reveal significant model performance gaps, highlighting the need for improved integration of robust visual feature extraction with symbolic reasoning.

VisRes Bench is a diagnostic benchmark for evaluating the genuine visual reasoning capabilities of Vision-LLMs (VLMs) in the absence of contextual language supervision. It systematically isolates perceptual, relational, and compositional reasoning abilities using exclusively image-based tasks, thereby aiming to decouple semantic abstraction from linguistic priors and textual scaffolding. The benchmark is structured along a graded complexity axis to reveal the loci of failure in state-of-the-art VLMs and set rigorous evaluation standards for future multimodal research (Törtei et al., 24 Dec 2025).

1. Conceptual Motivation and Objectives

The motivation for VisRes Bench stems from the observation that leading VLMs often succeed at tasks such as visual question answering (VQA) and image captioning by leveraging experience with textual patterns, rather than engaging in abstract visual reasoning per se. In contrast to humans—who can complete visual patterns, recover occluded content, and infer relational rules without any explicit cues—models often collapse to random performance when deprived of text-based prompts. VisRes Bench specifically addresses this issue by providing a controlled, image-only suite that:

Probes VLMs for true perceptual grounding and abstraction,
Suppresses linguistic shortcuts,
Localizes weaknesses across a hierarchy: basic perception, single-attribute inference, and multi-attribute compositionality.

2. Task Taxonomy and Formal Structure

VisRes Bench encompasses three canonical levels of visual reasoning, each contributing a distinct set of cognitive demands:

Level 1: Perceptual Completion and Global Matching

Local patch completion: Given a 512×512 image with an 80×80 px masked region and four candidate patches (A–D), select the patch that completes the blank. Distractors are sampled either randomly or via DINOv2 similarity.
Global occlusion: 50% or 80% random square-tile masking; the correct answer is a scene continuation from nearby frames.

Level 2: Single-Attribute Raven-Style Grids

3×3 grids with missing center cells per row. The missing value is determined solely by a single attribute—color, count, or orientation—according to one of several categorical, progression, or arithmetic rule families.
All non-target attributes are randomized to isolate the abstraction under test.

Level 3: Multi-Attribute Compositional Reasoning

3×3 grids where two or more visual attributes (color, count, object type, orientation) vary under logical constraints:
- Coupled rules (deterministic mappings, e.g. color ↔ orientation),
- Independent rules (parallel distributions/arithmetic),
- Spatial-compositional patterns (e.g. spiral traversals with coupled rules).

Formally, for each task:

The model receives a query image (or grid) $I$ and four candidates $\{C_k\}_{k=1}^4$ .
The decision is $\hat{y} = \arg\max_{k\in\{1..4\}} R(I, \{C_k\})$ , with $R$ the internal model score.
Level-specific equations define attribute rule-checking, e.g., for single-attribute grids $a_{r,3} = f(a_{r,1}, a_{r,2})$ ; for multi-attribute, $a_{r,3}^{(l)} = f^{(l)}(a_{r,1}^{(l)}, a_{r,2}^{(l)})$ for multiple attributes $l$ .

3. Dataset Construction and Statistical Properties

VisRes Bench comprises approximately 19,000 total samples distributed as follows:

Task Level	Samples	Subtasks
Level 1	~10,500	8 local, 2 global
Level 2	5,956	12 attribute subtasks
Level 3	2,522	6 composition subtasks

Images are sourced from Google Street View (global occlusion) and diverse web crawls. All images are filtered and cropped for content safety.
For Levels 2–3, each image is annotated for ground-truth attributes via a semi-automated procedure combining metadata, model predictions, and human checks.
Disturbances include blur, brightness modulation, rotation, Canny edge maps, and occlusion to systematically degrade and probe visual robustness.

4. Evaluation Protocol and Metrics

Benchmarked models are evaluated using the following procedure:

Each task presents a strict four-choice multiple-choice question.
Prompt types include a "generic" format (absence of any attribute cues) and a "guided" format (explicit mention of attributes or rule types). This enables isolation of purely visual performance versus language-anchored reasoning.
Core metric is accuracy: $\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{\hat{y}_i = y_i\}$ , where $N$ is the sample count.
Supplementary ablations probe step-by-step ("thinking mode") versus direct output, and context length (up to 32K tokens) to mitigate truncation.

If a model fails to provide a valid letter answer (e.g., looping, timeout), the sample is marked as incorrect—this can lower accuracy below the random-choice baseline (25%).

5. Key Empirical Findings and Interpretation

Level 1:

Under DINOv2-similar distractors (DS, hardest) with generic prompts, state-of-the-art models such as GPT-5 reach only ~35.4% accuracy on rotation (others 24–31%), not far above random (25%).
With 50% occlusion, Gemini-2.5 achieves 57.1%, GPT-4o 20.9%, all far below human baseline (96%).
Increasing image resolution boosts performance (up to 56.5% for GPT-5 at $2048^2$ px), but does not close the gap to humans.
Few-shot fine-tuning increases open-source model accuracy (Qwen2.5-VL-3B: 24.5% → 43.7%), but performance remains well below human (90.4%).

Level 2:

Top models are near-perfect on color-uniform grids (96–97%), while open-source models range 66–91%.
More complex patterns with count/arithmetic yield 77–90% for top models, open-source 37–49%.
Orientation tasks are a major failure point (19–30%), exposing geometric grounding deficits.
Guided/language-anchored prompts and few-shots improve accuracy by 10–40 points, highlighting reliance on textual cues.

Level 3:

Overall average accuracy is low (~34%), with the best cases (spiral color–count–object) at ~56% for GPT-5 and 54% for Gemini.
Multi-attribute composition sharply degrades performance even when single-attribute recognition is robust.

Cross-Modal Gap:

When identical reasoning tasks are "verbalized" (textual descriptors for images), GPT-5 achieves 85% (Level 2) and 66% (Level 3), a 30–50 point improvement over visual-only. This demonstrates that logical reasoning components exist in these models but are not transferrable from linguistic to visual abstraction.

6. Implications, Limitations, and Path Forward

VisRes Bench exposes three principal bottlenecks in current VLMs:

Incomplete perceptual grounding, particularly for occlusion and orientation variants.
Visual tokenization and resolution constraints that cause truncation of essential stimuli.
Ineffective integration of extracted low-level attributes into higher-order relational rules.

Performance improvements from model scaling, resolution increase, or dataset augmentation remain bounded. Only architectural advances—such as robust visual feature extractors tightly coupled with symbolic abstraction layers—are likely to close the gap to human-level visual reasoning.

By defining a perceptual-to-compositional continuum with fully image-based, naturalistic tasks, VisRes Bench serves as a rigorous stress test for future multimodal models, protecting the research community from misleading progress driven by language priors rather than genuine visual intelligence (Törtei et al., 24 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VisRes Bench.

VisRes Bench: Visual Reasoning Benchmark

1. Conceptual Motivation and Objectives

2. Task Taxonomy and Formal Structure

3. Dataset Construction and Statistical Properties

4. Evaluation Protocol and Metrics

5. Key Empirical Findings and Interpretation

6. Implications, Limitations, and Path Forward

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VisRes Bench: Visual Reasoning Benchmark

1. Conceptual Motivation and Objectives

2. Task Taxonomy and Formal Structure

3. Dataset Construction and Statistical Properties

4. Evaluation Protocol and Metrics

5. Key Empirical Findings and Interpretation

6. Implications, Limitations, and Path Forward

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research