Papers
Topics
Authors
Recent
2000 character limit reached

VisChainBench: LVLM Visual Reasoning Benchmark

Updated 14 December 2025
  • VisChainBench is a large-scale benchmark designed to evaluate LVLMs’ ability to perform multi-turn, interdependent visual reasoning with minimal language support.
  • It employs a structured task format that uses sequential images and distractor options to enforce genuine, visual-to-visual inference in real-world scenarios.
  • The evaluation protocol utilizes metrics such as accuracy, task completion rate, and chain consistency to measure model performance in context-dependent reasoning.

VisChainBench is a large-scale benchmark specifically designed to rigorously evaluate the multi-turn, multi-image visual reasoning capabilities of Large Vision-LLMs (LVLMs) in real-world, context-dependent decision-making scenarios. Unlike previous benchmarks that rely predominantly on static, language-driven comparisons or single-step visual tasks, VisChainBench explicitly minimizes language scaffolding and enforces visual-to-visual inference across interdependent procedural chains, thereby targeting reasoning beyond linguistic shortcuts (Lyu et al., 7 Dec 2025).

1. Motivation and Rationale

VisChainBench addresses critical deficiencies in existing LVLM evaluation regimes:

  • Real-world agent requirements: Many downstream applications, such as technical troubleshooting and assistive robotics, demand that agents reason over sequences of evolving visual inputs rather than isolated images and text prompts. Existing benchmarks typically evaluate only shallow multi-image comparisons or static tasks, and rely heavily on textual hints.
  • Beyond language priors: Prior datasets insufficiently probe models’ capacity for progressive, context-sensitive reasoning, as they allow models to exploit language priors. VisChainBench is intentionally structured to force models to infer objectives and intermediate states from visuals alone, requiring the propagation of visual context and prevention of language-overfitting.

This paradigm compels models to demonstrate (a) inference of goals from sequences of images, (b) maintenance of procedural context across multiple steps, and (c) robustness against superficial exploitation of residual textual cues (Lyu et al., 7 Dec 2025).

2. Dataset Composition and Taxonomy

VisChainBench comprises 1,457 tasks totalling 20,431 images, with an average of 14 images per task. Task structure and domains are purposefully varied to elicit robust generalization:

Format #Tasks #Images Avg. Images/Task Domains Turns Text
Image-Text Multi-Turn Reasoning (ITMR) 646 9,826 15 Daily, Engineering, Science, IT 3–6 Minimal
In-Context Image-Only Reasoning (ICIR) 437 3,192 7 Daily, Engineering, IT 1 None
Image-Only Multi-Turn Reasoning (IOMR) 437 7,413 20 Daily, Engineering, IT 2–6 Minimal
  • Visual diversity: Over 20,000 unique Creative Commons photos (<5% are synthetic). Procedural chains are constructed to range from 2 to 6 turns, with up to 27 images in complex tasks.
  • Plausibility controls: Each turn includes at least one distractor image to deter feature-matching heuristics and enforce genuine reasoning.

3. Task Format and Interdependency

At every turn tt in a VisChainBench task, the model is provided:

  • A “condition” image or a sequence of images reflecting previous choices (visual context propagation).
  • 3–4 option images for selection.
  • (ITMR only) A single, minimal text question.
  • The requirement to output the format “ANSWER: k”, where kk indexes the chosen image.

Task correctness at turn tt governs the starting visual context at turn t+1t+1, enforcing a strict dependency and simulating real-world procedural workflows. For example, in the “Making Tea” scenario, progressing from an empty kettle to boiling water, steeping the tea bag, and finishing the brew captures the recursive, context-aware reasoning required (Lyu et al., 7 Dec 2025).

4. Data Construction and Annotation Pipeline

The VisChainBench data pipeline is a multi-agent framework prioritizing both scalability and quality assurance:

  1. Automated Task Generation: Structured procedural chains are created in JSON format by Llama3.3-70B, encompassing initial scenes, stepwise questions with distractors, and embedded answer indices. Prompts enforce multi-step dependency and plausibility.
  2. Image Retrieval: Qwen2-VL-72B extracts key terms, conducts web image searches, and verifies contextual match. For rare missing cases (<5%), T2I synthesis via doubao-t2i-drawing is used.
  3. Automated Verification: Qwen2-VL-72B “solves” each task zero-shot; discrepancies with ground-truth trigger severity-ranked flags.
  4. Human Quality Control: Six annotators (MS/PhD-level) execute initial filtering, flagging errors, and perform quiz-style validation. If annotation accuracy is sub-threshold, tasks are iteratively re-reviewed, ensuring both high visual integrity and minimal language influence.

This pipeline ensures dataset validity, visual diversity, controlled text usage, and reproducibility.

5. Evaluation Protocols and Metrics

A disciplined evaluation procedure is employed:

  • Prompting: All models receive the same zero-shot prompt structures, enforcing the standard “ANSWER: k” response.
  • Metrics:
    • Accuracy: Accuracy=#correctly answered questions#total questions\mathrm{Accuracy} = \frac{\# \text{correctly answered questions}}{\# \text{total questions}}
    • Task Completion Rate (TC): TC=#tasks where every turn was answered correctly#total tasks\mathrm{TC} = \frac{\# \text{tasks where every turn was answered correctly}}{\# \text{total tasks}}
    • F₁ Score: Precision and Recall computed in multi-label/free-form settings.
    • Chain Consistency Score (proposed): Assesses if the correct image is selected at turn t+1t+1 given correctness at turn tt—i.e., the conditional probability of correct chain continuation.
    • Chain-of-Thought (CoT) Ablation: CoT prompting yields +6.95% CA on ITMR but only marginal gains in image-only formats, indicating limited transfer of language-driven inductive biases into purely visual reasoning.

6. Baseline Model Performance and Observed Failure Modes

Empirical results highlight various system behaviors:

Model ITMR CA / TC ICIR CA IOMR CA Overall CA
Gemini-2.0-Flash 82.0 / 46.1 70.7 75.8 ~68.0
GPT-4o 77.7 / 31.6 71.7 75.8 ~73.9
Qwen2.5-VL-32B 71.4 / 25.9 57.9 52.0 52.0
InternVL3-14B 65.7 / 23.0 57.7 52.2 52.2
Small (3–11B) < 36 < 36 < 36 < 36
  • Key errors: Discrete instruction-following failures (output format mistakes), visual hallucinations (object identity errors), propagation of early-step mistakes, and pronounced performance drops when text prompts are removed or minimized.

A plausible implication is that current LVLM architectures are still limited in compositional and truly context-dependent visual reasoning when deprived of language scaffolding, as evidenced by accumulation of turn-wise errors and reliance on text cues.

7. Resource Access and Impact

VisChainBench, along with its generation pipeline code, is hosted at https://huggingface.co/datasets/eyehole/VisChainBench under an MIT-style license, with all images CC-licensed for research use. Its open-source approach encourages iterative improvement of LVLMs capable of multi-step planning and adaptation in dynamic visual contexts, moving LVLM research toward genuine visual-to-visual reasoning capabilities rather than "see-and-answer" paradigms (Lyu et al., 7 Dec 2025).

By structurally precluding trivial language-based shortcuts, VisChainBench constitutes a necessary stress-test for LVLMs and establishes a new empirical basis for benchmarking procedural visual reasoning. Comparative efforts, such as ViC-Bench, extend this line of work by probing visual-interleaved chain-of-thought with free-style intermediate visual state interventions (Wu et al., 20 May 2025); however, VisChainBench's focus on context propagation with minimal text remains distinctive.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VisChainBench.