Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
135 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology (2507.07999v1)

Published 10 Jul 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

Summary

  • The paper introduces TreeBench, a diagnostic benchmark that rigorously evaluates visual grounded reasoning using traceable evidence and precise bounding box annotations.
  • The paper proposes TreeVGR, a two-stage training paradigm employing reinforcement learning with dual IoU rewards to improve object localization and reasoning chains.
  • Empirical results show significant performance gains across benchmarks, underscoring the value of integrating visual perception with explainable, traceable reasoning.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

This paper addresses a critical gap in the evaluation and training of large multimodal models (LMMs) for visual grounded reasoning (VGR), introducing both a new benchmark, TreeBench, and a novel training paradigm, TreeVGR. The work is motivated by the observation that recent LMMs, despite advances in text-based reasoning, exhibit significant limitations in perception-heavy tasks that require precise visual grounding and explainable, traceable reasoning chains. The authors systematically analyze these limitations and propose solutions that advance both the evaluation and methodology of VGR.

TreeBench: A Diagnostic Benchmark for Visual Grounded Reasoning

TreeBench is designed to holistically evaluate the "thinking with images" capability of LMMs. The benchmark is constructed around three core principles:

  1. Focused Visual Perception: Tasks require models to identify subtle targets in complex, cluttered scenes, emphasizing hierarchical scene understanding and discrimination among visually similar distractors.
  2. Traceable Evidence: Each question is annotated with bounding boxes for target instances, enabling quantifiable evaluation of both final answers and intermediate reasoning steps.
  3. Second-Order Reasoning: Beyond object localization, tasks probe object interactions, spatial hierarchies, and perspective transformations, requiring reasoning about relationships and context.

The dataset comprises 405 challenging visual question-answering (VQA) pairs, curated through a rigorous multi-stage pipeline involving LMM-assisted question generation and expert human validation. Images are sampled from SA-1B, prioritizing high object density and real-world complexity. The annotation process ensures high difficulty and correctness, with questions filtered to exclude those easily solved by state-of-the-art models.

Key properties of TreeBench:

  • Small Target Objects: The average target occupies only 3.05% of the image area, increasing the challenge for visual localization.
  • Traceable Evaluation: Bounding box annotations allow for detailed error analysis, distinguishing between failures in localization and reasoning.
  • Task Difficulty: No evaluated model, including OpenAI-o3, exceeds 60% accuracy, with Qwen2.5-VL-72B achieving only 42.2%, indicating substantial headroom for future research.

The benchmark is structured into two protocols: "Perception" (attributes, material, physical state, object retrieval, OCR) and "Reasoning" (perspective transform, ordering, contact/occlusion, spatial containment, comparison), with a deliberate emphasis on higher-order reasoning.

TreeVGR: Reinforcement Learning with Traceable Evidence

TreeVGR is a two-stage training paradigm for VGR, explicitly supervising both localization and reasoning. The methodology is as follows:

  1. Cold-Start Initialization: The model is first fine-tuned on a curated dataset with multimodal samples, including images, questions, reasoning trajectories, bounding boxes, and answers. This stage establishes the model's ability to output bounding boxes and structure reasoning chains.
  2. Reinforcement Learning with Dual IoU Reward: The model is further trained using RL, where the reward function combines:
    • Accuracy Reward: For correct final answers.
    • Formatting Reward: For proper output structure.
    • Dual IoU Reward: A novel metric that averages recall and precision of predicted bounding boxes against ground truth, ensuring both complete and precise localization.

This dual IoU reward is critical: recall encourages coverage of all ground-truth boxes, while precision penalizes spurious or redundant predictions. The approach avoids the inefficiency of image cropping replay used in prior work, instead leveraging text-space grounding for efficiency.

Empirical results demonstrate:

  • TreeVGR-7B, initialized from Qwen2.5-VL-7B, achieves substantial improvements over baselines: +16.8 on V* Bench, +12.6 on MME-RealWorld, and +13.4 on TreeBench.
  • The model achieves higher mIoU and overall accuracy than prior VGR models (e.g., DeepEyes, Pixel-Reasoner), with performance comparable to much larger models (e.g., InternVL3-78B).
  • Ablation studies confirm the necessity of both recall and precision terms in the reward function; omitting either degrades localization or leads to degenerate behaviors (e.g., excessive box enumeration).

Numerical Results and Claims

  • TreeBench: OpenAI-o3 achieves 54.87% accuracy; Qwen2.5-VL-72B achieves 42.2%; TreeVGR-7B achieves 50.4% and the highest mIoU among open-source models.
  • V* Bench: TreeVGR-7B achieves 91.1% accuracy, outperforming other open-source models and approaching the performance of proprietary models.
  • MME-RealWorld-Lite: TreeVGR-7B achieves 54.9%, a +12.6 improvement over its base model.
  • Generalization: TreeVGR-7B shows consistent improvements on vision-centric and general VQA benchmarks, with the largest gains on tasks requiring precise visual grounding.

Implications and Future Directions

Practical Implications:

  • Evaluation: TreeBench provides a rigorous, traceable benchmark for diagnosing VGR capabilities, enabling fine-grained error analysis and progress tracking.
  • Training Methodology: TreeVGR demonstrates that explicit supervision of reasoning chains and localization, via RL with dual IoU rewards, yields more interpretable and robust VGR models.
  • Efficiency: The text-space grounding approach in TreeVGR reduces computational requirements compared to image-cropping-based methods, facilitating broader adoption.

Theoretical Implications:

  • The positive correlation between localization precision (mIoU) and VQA performance, especially on perception tasks, highlights the centrality of grounding in multimodal reasoning.
  • The decoupling of performance between TreeBench and other multimodal benchmarks suggests that "thinking with images" is a distinct capability not captured by existing evaluations.

Limitations and Future Work:

  • TreeVGR is currently demonstrated on a 7B parameter model; scaling to larger architectures may further improve performance.
  • TreeBench, while high-quality, is limited to 405 questions; expanding its scope and diversity will be important for comprehensive evaluation.
  • Further research is needed to close the gap between current model performance and human-level reasoning on traceable, complex visual tasks.

Conclusion

This work establishes new standards for both the evaluation and training of visual grounded reasoning in LMMs. TreeBench enables rigorous, traceable assessment of "thinking with images," while TreeVGR provides an effective, efficient methodology for training models to reason with explicit visual evidence. The results underscore the importance of traceability and explainability in advancing multimodal AI, and the proposed approaches offer a blueprint for future research in this domain.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com