TIR-Bench: Agentic Visual Reasoning Benchmark

Updated 5 November 2025

TIR-Bench is a benchmark suite that rigorously measures multimodal AI models' ability to perform dynamic, code-driven image manipulations in agentic reasoning tasks.
It includes 13 distinct task categories such as color VQA and maze solving, requiring multi-step tool use and external code invocation for image analysis.
Empirical results reveal significant performance gaps between agentic and non-agentic models, underscoring the need for tool-driven architectures.

TIR-Bench designates a comprehensive benchmark suite introduced to rigorously evaluate the capability of multimodal LLMs (MLLMs) and related AI systems in “agentic thinking-with-images” reasoning: the advanced ability to programmatically manipulate, transform, and analyze images while reasoning in a multi-step, chain-of-thought (CoT) style using external tools or code (Li et al., 3 Nov 2025). Positioned as a successor to and radical expansion upon prior visual reasoning and visual search benchmarks, TIR-Bench incorporates a broader range of algorithmic, perceptual, and compositional visual reasoning tasks that are constructed to require explicit tool-use, external code invocation, and dynamic manipulation of images, rather than passive inspection.

1. Motivation and Conceptual Foundations

TIR-Bench was motivated by the emergence of next-generation MLLMs, such as OpenAI’s o3, which exemplify agentic thinking-with-images—the ability to not only inspect but actively operate on images (e.g., cropping, rotating, drawing, segmenting) through code or agentic tool-use as an intrinsic part of reasoning. Existing benchmarks (e.g., Visual Search, V* Bench, HR-Bench) are limited in scope, focusing on region localization or high-resolution search but lacking tests of composite tool-driven reasoning chains or multi-step image manipulation processes. This limitation impedes clear assessment of the advances enabled by agentic architectures and system augmentations. TIR-Bench was thus designed to fill this gap, targeting the robust measurement of these emerging, more powerful capabilities.

2. Task Suite Composition and Agentic Structure

TIR-Bench encompasses 13 task categories, totaling 1,215 examples, each crafted to span distinct modalities of agentic visual reasoning. The tasks require non-trivial tool use, multi-step programmatic actions, and deterministic outputs for rigorous, objective assessment.

Task Category	Example Requirement	Agentic Operation Example
Color VQA	Quantify image color ratios	Python code to count/color pixels
Proportion VQA	Segment & compute area proportions	Semantic segmentation + area calc
Rotated OCR	Read text in rotated image	Rotate via code, then OCR
Symbolic Reasoning	Graph/polygon property counting	Structural feature extraction
Maze	Solve path in maze image	Pathfinding & overlay via drawing
Math	Problem involving geometric construction	Drawing auxiliary lines
Word Search	Find anomaly in grid	Iterative pixel/character compare
Low-Light VQA	Interpret dim image	Enhance image brightness/contrast
Instrument Reading	Read measurement from instrument image	Sequential crop/zoom operations
Spot the Difference	Detect differences between two images	Pixel-wise image diffing
Jigsaw Puzzle	Reassemble shuffled pieces	Iterative matching and reordering
Visual Search	High-res localization (multi-turn)	Crop/zoom sequence
Rotation Game	Reorient image to upright	Iterative rotation, validation

All tasks require agentic tool-induced manipulation of images; most cannot be solved through static visual inspection alone.

3. Comparison with Prior Benchmarks

The most common predecessor task—visual search—primarily evaluates a model’s ability to localize and crop regions of interest in high-resolution images, using iterative zoom or search routines. While challenging, these settings only probe a narrow band of agentic skills.

TIR-Bench explicitly extends beyond:

Incorporating algorithmic tasks (e.g., maze solving, jigsaw assembly), programmatic image transformations (e.g., denoise, rotate, enhance), and compositional operations (e.g., segment-then-measure-then-analyze).
Forcing multi-stage processing chains and feedback loops (e.g., manipulate → re-inject manipulated image → further analysis), which prior benchmarks do not test.
Ensuring that most problems are unsolvable by conventional LLMs or MLLMs without active tool use—even massive parameter scaling fails to close the gap, as quantified in evaluation (see below).

This suggests that TIR-Bench addresses a much broader cognitive and operational space relevant to advanced AI systems than prior visual reasoning benchmarks.

4. Evaluation Methodology, Metrics, and Implementation

TIR-Bench supports automated, deterministic evaluation using reproducible metrics:

Accuracy for single/multi-choice answers and direct numericals:

$\text{Accuracy} = \frac{\text{\# correct predictions}}{\text{Total questions}}$

Intersection over Union (IoU) for list/grounding outputs (e.g., spot the difference, jigsaw permutation correctness):

$\text{IoU} = \frac{|A_{\mathrm{pred}} \cap A_{\mathrm{gt}}|}{|A_{\mathrm{pred}} \cup A_{\mathrm{gt}}|}$

where $A_{\mathrm{pred}}$ and $A_{\mathrm{gt}}$ are predicted and ground truth sets.

22 state-of-the-art models were benchmarked: 11 open-source MLLMs (LLaVA-family, Qwen2.5-VL, InternVL3), 7 proprietary MLLMs (GPT-4.1, GPT-4o, Gemini-2.5 series, Grok-4), and 4 explicitly tool-using/agentic models (DeepEyes, PyVision, o3-TU, o4-mini-TU). Agentic variants were given access to a code interpreter (e.g., Python tool use), enabling dynamic, multi-turn interaction with visual data.

Zero-shot evaluation protocol was followed: models received nothing but the TIR-Bench task inputs (images, questions), with no task-specific fine-tuning.

5. Empirical Model Performance and Diagnostic Patterns

TIR-Bench proved universally challenging, decisively distinguishing agentic from non-agentic systems:

Model Class	Average Accuracy (%)	Notable Agentic-score Gaps
o3-TU (agentic)	46	+19% over non-agentic o3
Gemini-2.5-pro	28.9	+15% over random baseline
Best open-source	17–21	Little gain over baseline
Random baseline	13.5

Key findings:

Only tool-using/agentic models (e.g., o3-TU, PyVision) achieve substantial performance. o3-TU outperforms non-tool o3 by 19% absolute accuracy.
Non-agentic MLLMs (including those with 70B+ parameters) saturate at or slightly above random guess, irrespective of model scale or training corpus size.
Task-dependent performance splits: o3-TU achieves 64% on word search (vs. 4% for o3), and 77.3% vs. 33.3% on rotation game; standard MLLMs virtually cannot attempt such tasks.

A plausible implication is that TIR-Bench provides a well-calibrated diagnostic for truly agentic thinking-with-images competence, exposing categorical limitations of non-agentic architectures.

6. Agentic Thinking-with-Images: Measurement, Significance, and Learning Dynamics

TIR-Bench is predicated on the definition that "agentic thinking-with-images" is the systematic use of tool-executed image manipulations as integral steps in a reasoning chain. TIR-Bench tasks enforce this by:

Requiring models to generate code (typically Python) for image processing (e.g., segment, rotate, enhance, draw), chain these transformations, and feed manipulated images recursively into further steps.
Making purely linguistic or static vision models fail by design—success is only feasible via iterative, compositional image operations and adaptive feedback.

In a targeted supervised fine-tuning (SFT) analysis (Rotated Image OCR task), agentic/tool-use SFT was shown to dominate direct-output SFT: agentic SFT accuracy rises monotonically with more data, whereas direct SFT saturates and exhibits unstable learning curves. This suggests that agentic SFT enables models to learn compositional routines with generalizable structure, in contrast to the “flat” mapping learned by direct answer prediction.

7. Implications, Significance, and Future Benchmarking Directions

TIR-Bench constitutes a pivotal advance in the measurement of agentic, visually‐empowered AI. The gap in performance between agentic and non-agentic models is not incremental: tool use unlocks entire categories of tasks otherwise inaccessible, supporting the claim that agentic reasoning is a qualitative, not quantitative, paradigm shift.

TIR-Bench enables:

Granular, task-wise diagnosis of multimodal model capabilities and deficits.
Objective charting of progress in agentic AI via deterministic, reproducible scoring.
Benchmark-driven guidance for architectural and fine-tuning strategy design (agentic tool-use SFT is singled out as robust and superior to direct answer learning for multi-step visual tasks).

Limitations include: measures in TIR-Bench depend on connection to verifiable visual ground truth and deterministic code-executable manipulations; current tasks, while diverse, may inspire yet wider agentic reasoning challenges in future benchmark iterations.

In summary, TIR-Bench is a comprehensive, high-precision benchmark that decisively tests and quantifies agentic thinking-with-images capabilities in AI, systematically demonstrating the inadequacies of non-agentic models and documenting the decisive performance gains made possible by explicit tool-use and multi-step visual reasoning (Li et al., 3 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning (2025)

TIR-Bench: Agentic Visual Reasoning Benchmark

1. Motivation and Conceptual Foundations

2. Task Suite Composition and Agentic Structure

3. Comparison with Prior Benchmarks

4. Evaluation Methodology, Metrics, and Implementation

5. Empirical Model Performance and Diagnostic Patterns

6. Agentic Thinking-with-Images: Measurement, Significance, and Learning Dynamics

7. Implications, Significance, and Future Benchmarking Directions

Whiteboard

Follow Topic

Continue Learning

TIR-Bench: Agentic Visual Reasoning Benchmark

1. Motivation and Conceptual Foundations

2. Task Suite Composition and Agentic Structure

3. Comparison with Prior Benchmarks

4. Evaluation Methodology, Metrics, and Implementation

5. Empirical Model Performance and Diagnostic Patterns

6. Agentic Thinking-with-Images: Measurement, Significance, and Learning Dynamics

7. Implications, Significance, and Future Benchmarking Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics