ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Published 26 Mar 2026 in cs.CV and cs.AI | (2603.25823v1)

Abstract: Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a criticalstress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces ViGoR-Bench to measure visual generative models’ logical reasoning using dual-track process and result metrics.
It employs a diverse dataset across physical, knowledge, and symbolic tasks to expose cognitive and logical gaps.
Reward-driven reinforcement learning with complex data improves generalization compared to standard supervised fine-tuning.

ViGoR-Bench: Systematic Evaluation of Visual Generative Models for Zero-Shot Reasoning

Motivation and Problem Statement

Visual generative models have advanced dramatically in photorealistic synthesis, yet their logical and causal reasoning capabilities remain largely unprobed. Existing benchmarks predominantly focus on visual fidelity and semantic alignment, overlooking the cognitive dimension required for genuine visual intelligence. Notably, evaluation protocols prioritize final outcome metrics (e.g., CLIPScore, FID), ignoring the generative process and failing to distinguish between models that understand structural constraints and those that merely replicate statistical data distributions. This deficiency manifests as a "logical desert," where even high-fidelity outputs contain physically or logically incoherent artifacts.

Figure 1: Overview of ViGoR-Bench; domain distribution, reasoning process examples, and performance comparison for state-of-the-art models.

Benchmark Architecture and Data Construction

ViGoR-Bench is designed as a unified evaluation suite spanning three primary reasoning domains: Physical, Knowledge, and Symbolic. The benchmark encompasses 20 distinct task categories, ranging from embodied spatial reasoning to algorithmic symbolic manipulation. Data generation follows a tripartite strategy:

Generative Synthesis: Leveraging multimodal LLMs and high-fidelity generative image models to construct synthetic physical scenarios.
Real-World Acquisition: Curating and manually photographing authoritative real-world data for knowledge grounding.
Algorithmic Generation: Using procedural engines for tasks demanding mathematical rigor and unique solutions.

Each sample undergoes stringent human review for semantic consistency and, where applicable, symbolic solver validation. Ground-truth references are provided either as validated images or human-verified captions to facilitate evidence-grounded evaluation.

Figure 2: Dataset construction and dual-track evaluation pipelines, with MLLM used for process and result assessment.

Figure 3: Task suite overview; hierarchical organization of representative tasks across Physical, Knowledge, and Symbolic Reasoning.

Dual-Track Evaluation Protocol

To address the ambiguity between procedural reasoning and final output validity, ViGoR-Bench establishes two orthogonal metric tracks:

Process Metrics: Quantitative assessment of intermediate states, focusing on background consistency, instruction compliance, visual quality, and reasoning progression (beneficial action). Scores are real-valued (0--100), suitable for temporal and multi-step outputs.
Result Metrics: Binary evaluation (pass/fail) of the final output over four dimensions: integrity, instruction following, realism, and task completion.

An automated judge based on Gemini-2.5-Pro achieves high MAE and Pearson correlation alignment with human expert rating, especially when ground-truth references are provided. The reliability analysis demonstrates the necessity of ground-truth inclusion to stabilize LLM-based evaluation.

Model Performance Analysis

Comprehensive experiments are performed on proprietary and open-source models, categorizing them as image editing, unified (with/without CoT), and video generation. Notable findings include:

Proprietary models (e.g., Nano Banana Pro, Sora 2 Pro) consistently outperform open-source baselines in both process and result metrics, especially for complex reasoning tasks (Physical, Symbolic).
Chain-of-Thought (CoT) prompting improves process transparency but does not guarantee higher final accuracy, underscoring the gap between logical planning and actual execution fidelity. Error propagation across reasoning chains can degrade end performance, and visually interpretable reasoning does not equate to correct task completion.
Video generation models exhibit an "illusion of reasoning": strong temporal and visual consistency masks poor logical adherence, with result metrics (Reasoning Success) frequently near zero despite visually plausible outputs.
Figure 4: Qualitative comparison across reasoning domains, highlighting instruction following and logical execution failures.

Diagnostic Granularity and Failure Profiling

ViGoR-Bench extends beyond leaderboards by granularly profiling reasoning failures across cognitive axes. In symbolic domains (Sudoku, Maze, Jigsaw), performance degrades monotonically with increased problem complexity, consistent with human-like error scaling only in specific tasks. For others (Sudoku), inverted-U patterns indicate data distribution bias or overfitting to canonical instance sizes.

Figure 5: Impact of problem complexity on Reasoning Success for Sudoku, Jigsaw Puzzle, and Maze Navigation.

Process-level and result-level metrics are plotted for symbolic, physical, and knowledge tasks; diagnostic radar charts (Figures 10, 11, 12) reveal significant gaps in rule compliance and reasoning reliability, especially for multi-step combinatorial and embodied reasoning.

Figure 6: Symbolic reasoning profiling; pronounced gaps in reasoning accuracy and rule obey metrics for puzzle-oriented tasks.

Figure 7: Physical reasoning profiling; strong visual quality but weak rule adherence and reasoning accuracy, especially in object assembly and verification.

Figure 8: Knowledge reasoning profiling; factual grounding and causal reasoning remain limiting bottlenecks.

Post-Training Interventions and Generalization

ViGoR-Bench is also leveraged for model improvement via reward-driven RL. Supervised fine-tuning (SFT) induces saturation in validation metrics, whereas RL (GRPO) unlocks further gains, especially on high-complexity OOD tasks. Training on harder instances (e.g., $8\times8$ mazes) fosters robust logic transfer to easier in-distribution cases, suggesting that complex reasoning data is essential for generalization.

Figure 9: RL fine-tuning performance; RL elevates reasoning metrics above SFT, demonstrating regime unlocking.

Figure 10: Qualitative results on ViGoR-Bench after SFT and RL; RL models achieve higher task completion and rule compliance.

Practical and Theoretical Implications

The findings unequivocally demonstrate that visual generative models, despite architectural and scaling advances, are still deficient in physical reasoning, world-knowledge grounding, and multi-step symbolic manipulation. Perceptual quality, compositionality, and instruction following can be decoupled from genuine reasoning capacity. Dual-track evaluation reveals systematic logical gaps not captured by conventional fidelity metrics.

Practical implications include the need for richer, process-aware benchmarks and evidence-grounded automated judges to steer model development. RL-based post-training—guided by ViGoR-style stress tests—proves critical for overcoming overfitting and saturation, enabling performance gains in reasoning-centric domains.

Theoretically, the results reinforce the paradigm shift from mere visual realism toward cognitive alignment and world modeling. Future research must focus on unified architectures capable of robust symbolic and causal reasoning, scalable data curation for complex tasks, and interpretability in generation chains.

Conclusion

ViGoR-Bench provides a holistic, cross-modal benchmark to diagnose and guide visual generative model development for zero-shot reasoning. The dual-track metrics and granular analysis expose persistent deficits in reasoning capabilities, undetectable by fidelity-centric evaluation. Reward-driven learning and complex data exposure are essential for unlocking generalization. The benchmark establishes an actionable framework for elevating visual intelligence beyond photorealistic synthesis toward robust logical understanding and execution (2603.25823).

Markdown Report Issue