TreeBench: Traceable Evidence Evaluation Benchmark

Updated 14 July 2025

TreeBench is a benchmark designed to evaluate multimodal AI by testing precise localization of subtle objects amid complex visual environments.
It requires models to provide traceable evidence through explicit bounding boxes, enhancing transparency and diagnostic capabilities.
The benchmark incorporates advanced, second-order spatial reasoning tasks that diagnose both perceptual and reasoning challenges in AI systems.

TreeBench (Traceable Evidence Evaluation Benchmark) is a diagnostic, high-difficulty benchmark explicitly designed to evaluate visual grounded reasoning in multimodal artificial intelligence models. Its methodology focuses on testing systems’ ability to precisely localize subtle objects, provide traceable intermediate evidence, and perform complex second-order spatial reasoning in densely populated, real-world visual environments (Wang et al., 10 Jul 2025).

1. Foundational Principles and Motivation

TreeBench is founded on three central principles that shape both its construction and evaluation goals:

Focused Visual Perception: Models are challenged to identify and differentiate extremely small and subtle target objects embedded amid visually complex scenes. The task demands high spatial acuity, sensitivity to minor attributes, and the ability to handle clutter and overlap.
Traceable Evidence via Localization: Each model response must include explicit bounding boxes indicating the locations of all predicted target objects. This requirement transforms localization into an inspectable, quantifiable form of intermediate evidence, supporting the analysis of reasoning failures and increasing the transparency of the model’s process.
Vision-Centric Second-Order Reasoning: In addition to detection, TreeBench incorporates tasks that assess reasoning about object relations, ordering, containment, perspective transformations, and spatial hierarchies. This dimension tests the model’s ability to perform higher-order operations on the spatial and semantic properties of objects rather than mere recognition.

The integration of these three pillars enables TreeBench to evaluate both the perceptual and reasoning capacities of large multimodal models (LMMs) in a robust, evidence-driven manner (Wang et al., 10 Jul 2025).

2. Dataset Construction and Annotation Workflow

TreeBench’s dataset is generated through a multi-phased, expert-supervised process to ensure the challenge and reliability of each sample:

Source Image Selection: 1,000 high-quality images are sampled from SA-1B, each intentionally chosen for high object density and visual complexity.
Question and Answer Formulation: For each image, state-of-the-art LMMs, specifically OpenAI-o3 and Gemini-2.5-Pro, are prompted to create three unique question–option–answer trios per image.
Expert Annotation and Quality Control:

1. Eight domain experts review and refine the generated questions for relevance, clarity, and task alignment. 2. Any question that is trivial (readily answered by multiple top models) is discarded, focusing the dataset on difficult cases. 3. A final cross-validation stage reviews all entries to maximize both correctness and ambiguity elimination.

Final Dataset Statistics: The resulting TreeBench set contains 405 carefully vetted visual question–answer pairs. The design emphasizes difficult localization: on average, target objects occupy only 3.05% of the image area.

This annotation protocol enhances both the validity and traceability of the resulting benchmark tasks (Wang et al., 10 Jul 2025).

3. Task Taxonomy and Evaluation Methods

TreeBench organizes its questions and evaluation around a fine-grained typology of visual reasoning subdomains:

Question Categories:
- Pure perception (attributes, material, physical state identification)
- Object retrieval and OCR-integrated queries
- Vision-centric reasoning (ordering, spatial containment, perspective transformation, relational/comparison reasoning, temporal or causal interactions)
Traceable Evidence Metrics: The primary evidence output is bounding box localization, scored using mean Intersection-over-Union (mIoU):

$R_{\text{IoU}} = \frac{1}{2}(R_{\text{IoU}}^R + R_{\text{IoU}}^P),$

with

$R_{\text{IoU}}^R = \frac{1}{M} \sum_{k=1}^{M} \max_i \text{IoU}(\hat{b}_i, b_k), \qquad R_{\text{IoU}}^P = \frac{1}{N} \sum_{i=1}^{N} \max_k \text{IoU}(b_k, \hat{b}_i),$

where $N$ is the number of predicted boxes $\{\hat{b}_i\}$ , $M$ is the number of ground truth boxes $\{b_k\}$ . This design ensures the model is penalized for missed targets (recall) and rewarded for precision in predicted bounding boxes.

Overall Accuracy: A response is considered correct only if both the answer and the corresponding bounding box(s) meet stringent alignment criteria.

Evaluation with these metrics enables targeted diagnosis of model failures by separating localization errors from reasoning errors (Wang et al., 10 Jul 2025).

4. Benchmark Results and Model Analysis

TreeBench evaluation results reveal the difficulty of its tasks:

Top Model Performance: Leading LMMs, including OpenAI-o3, attain an accuracy of only 54.87%, with no model exceeding 60%. By comparison, Qwen2.5-VL-72B attains 42.2%.
Localization Quality: Histograms of mIoU illustrate that correct answers correlate strongly with high overlap in localization, especially for tasks focused on object retrieval or basic attributes, whereas higher-order reasoning tasks display a more complex error profile.
Subtask Breakdown: 63% of questions emphasize high-level reasoning, with the remainder focusing on direct perception. Failure cases frequently stem from confusion in spatial or relational reasoning rather than baseline detection failures.

This performance gap highlights the diagnostic power of TreeBench in revealing quantitative and qualitative weaknesses in current visual reasoning systems (Wang et al., 10 Jul 2025).

5. TreeVGR: Traceable Evidence Enhanced Reasoning Methodology

Building on the demands revealed by TreeBench, the accompanying work introduces TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a novel training method designed to improve both answer accuracy and evidence traceability:

Training Stages:

1. Supervised Fine-Tuning (SFT) using curated datasets with explicit reasoning trajectories and bounding box annotations. 2. Reinforcement Learning (RL) that rewards models for correct answers, output format compliance, and dual IoU metrics, as detailed above.

Optimization: The RL phase employs Group Relative Policy Optimization (GRPO) to stabilize and accelerate convergence while maximizing reward on both reasoning and localization.
Empirical Outcomes: TreeVGR-7B outperforms predecessor models by +13.4 accuracy on TreeBench, +16.8 on V* Bench, and +12.6 on MME-RealWorld-Lite, confirming the critical role of explicit traceable evidence in model improvement.

This approach demonstrates the practical value of using benchmarks like TreeBench not only for evaluation but also as a structural guide for developing explainable, evidence-driven multimodal models (Wang et al., 10 Jul 2025).

6. Implications for Vision-Grounded Reasoning and Future Work

TreeBench, together with TreeVGR, provides a framework for aligning model outputs with interpretable reasoning chains:

Transparency and Accountability: Requiring explicit localization as evidence allows direct tracing of outputs, simplifying the identification of the reasoning stage at which errors occur (e.g., misperception vs. misunderstanding spatial hierarchy).
Benchmarking Standards: By formally integrating evidence outputs (bounding boxes) and reasoning analysis, TreeBench sets a new standard for evaluating multimodal model explainability.
Future Development: The benchmark’s methodology suggests future extensions incorporating larger datasets, more diverse visual domains, or richer evidence structures (e.g., segmentation masks or scene graphs) to further refine diagnosis and evaluation in visual reasoning.

TreeBench thus serves as both a diagnostic tool and a methodological reference for advancing traceable, explainable, and high-performance visual grounded reasoning systems in artificial intelligence research (Wang et al., 10 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TreeBench (Traceable Evidence Evaluation Benchmark).