VisChainBench: LVLM Visual Reasoning Benchmark
- VisChainBench is a large-scale benchmark designed to evaluate LVLMs’ ability to perform multi-turn, interdependent visual reasoning with minimal language support.
- It employs a structured task format that uses sequential images and distractor options to enforce genuine, visual-to-visual inference in real-world scenarios.
- The evaluation protocol utilizes metrics such as accuracy, task completion rate, and chain consistency to measure model performance in context-dependent reasoning.
VisChainBench is a large-scale benchmark specifically designed to rigorously evaluate the multi-turn, multi-image visual reasoning capabilities of Large Vision-LLMs (LVLMs) in real-world, context-dependent decision-making scenarios. Unlike previous benchmarks that rely predominantly on static, language-driven comparisons or single-step visual tasks, VisChainBench explicitly minimizes language scaffolding and enforces visual-to-visual inference across interdependent procedural chains, thereby targeting reasoning beyond linguistic shortcuts (Lyu et al., 7 Dec 2025).
1. Motivation and Rationale
VisChainBench addresses critical deficiencies in existing LVLM evaluation regimes:
- Real-world agent requirements: Many downstream applications, such as technical troubleshooting and assistive robotics, demand that agents reason over sequences of evolving visual inputs rather than isolated images and text prompts. Existing benchmarks typically evaluate only shallow multi-image comparisons or static tasks, and rely heavily on textual hints.
- Beyond language priors: Prior datasets insufficiently probe models’ capacity for progressive, context-sensitive reasoning, as they allow models to exploit language priors. VisChainBench is intentionally structured to force models to infer objectives and intermediate states from visuals alone, requiring the propagation of visual context and prevention of language-overfitting.
This paradigm compels models to demonstrate (a) inference of goals from sequences of images, (b) maintenance of procedural context across multiple steps, and (c) robustness against superficial exploitation of residual textual cues (Lyu et al., 7 Dec 2025).
2. Dataset Composition and Taxonomy
VisChainBench comprises 1,457 tasks totalling 20,431 images, with an average of 14 images per task. Task structure and domains are purposefully varied to elicit robust generalization:
| Format | #Tasks | #Images | Avg. Images/Task | Domains | Turns | Text |
|---|---|---|---|---|---|---|
| Image-Text Multi-Turn Reasoning (ITMR) | 646 | 9,826 | 15 | Daily, Engineering, Science, IT | 3–6 | Minimal |
| In-Context Image-Only Reasoning (ICIR) | 437 | 3,192 | 7 | Daily, Engineering, IT | 1 | None |
| Image-Only Multi-Turn Reasoning (IOMR) | 437 | 7,413 | 20 | Daily, Engineering, IT | 2–6 | Minimal |
- Visual diversity: Over 20,000 unique Creative Commons photos (<5% are synthetic). Procedural chains are constructed to range from 2 to 6 turns, with up to 27 images in complex tasks.
- Plausibility controls: Each turn includes at least one distractor image to deter feature-matching heuristics and enforce genuine reasoning.
3. Task Format and Interdependency
At every turn in a VisChainBench task, the model is provided:
- A “condition” image or a sequence of images reflecting previous choices (visual context propagation).
- 3–4 option images for selection.
- (ITMR only) A single, minimal text question.
- The requirement to output the format “ANSWER: k”, where indexes the chosen image.
Task correctness at turn governs the starting visual context at turn , enforcing a strict dependency and simulating real-world procedural workflows. For example, in the “Making Tea” scenario, progressing from an empty kettle to boiling water, steeping the tea bag, and finishing the brew captures the recursive, context-aware reasoning required (Lyu et al., 7 Dec 2025).
4. Data Construction and Annotation Pipeline
The VisChainBench data pipeline is a multi-agent framework prioritizing both scalability and quality assurance:
- Automated Task Generation: Structured procedural chains are created in JSON format by Llama3.3-70B, encompassing initial scenes, stepwise questions with distractors, and embedded answer indices. Prompts enforce multi-step dependency and plausibility.
- Image Retrieval: Qwen2-VL-72B extracts key terms, conducts web image searches, and verifies contextual match. For rare missing cases (<5%), T2I synthesis via doubao-t2i-drawing is used.
- Automated Verification: Qwen2-VL-72B “solves” each task zero-shot; discrepancies with ground-truth trigger severity-ranked flags.
- Human Quality Control: Six annotators (MS/PhD-level) execute initial filtering, flagging errors, and perform quiz-style validation. If annotation accuracy is sub-threshold, tasks are iteratively re-reviewed, ensuring both high visual integrity and minimal language influence.
This pipeline ensures dataset validity, visual diversity, controlled text usage, and reproducibility.
5. Evaluation Protocols and Metrics
A disciplined evaluation procedure is employed:
- Prompting: All models receive the same zero-shot prompt structures, enforcing the standard “ANSWER: k” response.
- Metrics:
- Accuracy:
- Task Completion Rate (TC):
- F₁ Score: Precision and Recall computed in multi-label/free-form settings.
- Chain Consistency Score (proposed): Assesses if the correct image is selected at turn given correctness at turn —i.e., the conditional probability of correct chain continuation.
- Chain-of-Thought (CoT) Ablation: CoT prompting yields +6.95% CA on ITMR but only marginal gains in image-only formats, indicating limited transfer of language-driven inductive biases into purely visual reasoning.
6. Baseline Model Performance and Observed Failure Modes
Empirical results highlight various system behaviors:
| Model | ITMR CA / TC | ICIR CA | IOMR CA | Overall CA |
|---|---|---|---|---|
| Gemini-2.0-Flash | 82.0 / 46.1 | 70.7 | 75.8 | ~68.0 |
| GPT-4o | 77.7 / 31.6 | 71.7 | 75.8 | ~73.9 |
| Qwen2.5-VL-32B | 71.4 / 25.9 | 57.9 | 52.0 | 52.0 |
| InternVL3-14B | 65.7 / 23.0 | 57.7 | 52.2 | 52.2 |
| Small (3–11B) | < 36 | < 36 | < 36 | < 36 |
- Key errors: Discrete instruction-following failures (output format mistakes), visual hallucinations (object identity errors), propagation of early-step mistakes, and pronounced performance drops when text prompts are removed or minimized.
A plausible implication is that current LVLM architectures are still limited in compositional and truly context-dependent visual reasoning when deprived of language scaffolding, as evidenced by accumulation of turn-wise errors and reliance on text cues.
7. Resource Access and Impact
VisChainBench, along with its generation pipeline code, is hosted at https://huggingface.co/datasets/eyehole/VisChainBench under an MIT-style license, with all images CC-licensed for research use. Its open-source approach encourages iterative improvement of LVLMs capable of multi-step planning and adaptation in dynamic visual contexts, moving LVLM research toward genuine visual-to-visual reasoning capabilities rather than "see-and-answer" paradigms (Lyu et al., 7 Dec 2025).
By structurally precluding trivial language-based shortcuts, VisChainBench constitutes a necessary stress-test for LVLMs and establishes a new empirical basis for benchmarking procedural visual reasoning. Comparative efforts, such as ViC-Bench, extend this line of work by probing visual-interleaved chain-of-thought with free-style intermediate visual state interventions (Wu et al., 20 May 2025); however, VisChainBench's focus on context propagation with minimal text remains distinctive.