Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

TIR-Bench: Agentic Visual Reasoning Benchmark

Updated 5 November 2025
  • TIR-Bench is a benchmark suite that rigorously measures multimodal AI models' ability to perform dynamic, code-driven image manipulations in agentic reasoning tasks.
  • It includes 13 distinct task categories such as color VQA and maze solving, requiring multi-step tool use and external code invocation for image analysis.
  • Empirical results reveal significant performance gaps between agentic and non-agentic models, underscoring the need for tool-driven architectures.

TIR-Bench designates a comprehensive benchmark suite introduced to rigorously evaluate the capability of multimodal LLMs (MLLMs) and related AI systems in “agentic thinking-with-images” reasoning: the advanced ability to programmatically manipulate, transform, and analyze images while reasoning in a multi-step, chain-of-thought (CoT) style using external tools or code (Li et al., 3 Nov 2025). Positioned as a successor to and radical expansion upon prior visual reasoning and visual search benchmarks, TIR-Bench incorporates a broader range of algorithmic, perceptual, and compositional visual reasoning tasks that are constructed to require explicit tool-use, external code invocation, and dynamic manipulation of images, rather than passive inspection.

1. Motivation and Conceptual Foundations

TIR-Bench was motivated by the emergence of next-generation MLLMs, such as OpenAI’s o3, which exemplify agentic thinking-with-images—the ability to not only inspect but actively operate on images (e.g., cropping, rotating, drawing, segmenting) through code or agentic tool-use as an intrinsic part of reasoning. Existing benchmarks (e.g., Visual Search, V* Bench, HR-Bench) are limited in scope, focusing on region localization or high-resolution search but lacking tests of composite tool-driven reasoning chains or multi-step image manipulation processes. This limitation impedes clear assessment of the advances enabled by agentic architectures and system augmentations. TIR-Bench was thus designed to fill this gap, targeting the robust measurement of these emerging, more powerful capabilities.

2. Task Suite Composition and Agentic Structure

TIR-Bench encompasses 13 task categories, totaling 1,215 examples, each crafted to span distinct modalities of agentic visual reasoning. The tasks require non-trivial tool use, multi-step programmatic actions, and deterministic outputs for rigorous, objective assessment.

Task Category Example Requirement Agentic Operation Example
Color VQA Quantify image color ratios Python code to count/color pixels
Proportion VQA Segment & compute area proportions Semantic segmentation + area calc
Rotated OCR Read text in rotated image Rotate via code, then OCR
Symbolic Reasoning Graph/polygon property counting Structural feature extraction
Maze Solve path in maze image Pathfinding & overlay via drawing
Math Problem involving geometric construction Drawing auxiliary lines
Word Search Find anomaly in grid Iterative pixel/character compare
Low-Light VQA Interpret dim image Enhance image brightness/contrast
Instrument Reading Read measurement from instrument image Sequential crop/zoom operations
Spot the Difference Detect differences between two images Pixel-wise image diffing
Jigsaw Puzzle Reassemble shuffled pieces Iterative matching and reordering
Visual Search High-res localization (multi-turn) Crop/zoom sequence
Rotation Game Reorient image to upright Iterative rotation, validation

All tasks require agentic tool-induced manipulation of images; most cannot be solved through static visual inspection alone.

3. Comparison with Prior Benchmarks

The most common predecessor task—visual search—primarily evaluates a model’s ability to localize and crop regions of interest in high-resolution images, using iterative zoom or search routines. While challenging, these settings only probe a narrow band of agentic skills.

TIR-Bench explicitly extends beyond:

  • Incorporating algorithmic tasks (e.g., maze solving, jigsaw assembly), programmatic image transformations (e.g., denoise, rotate, enhance), and compositional operations (e.g., segment-then-measure-then-analyze).
  • Forcing multi-stage processing chains and feedback loops (e.g., manipulate → re-inject manipulated image → further analysis), which prior benchmarks do not test.
  • Ensuring that most problems are unsolvable by conventional LLMs or MLLMs without active tool use—even massive parameter scaling fails to close the gap, as quantified in evaluation (see below).

This suggests that TIR-Bench addresses a much broader cognitive and operational space relevant to advanced AI systems than prior visual reasoning benchmarks.

4. Evaluation Methodology, Metrics, and Implementation

TIR-Bench supports automated, deterministic evaluation using reproducible metrics:

  • Accuracy for single/multi-choice answers and direct numericals:

Accuracy=# correct predictionsTotal questions\text{Accuracy} = \frac{\text{\# correct predictions}}{\text{Total questions}}

  • Intersection over Union (IoU) for list/grounding outputs (e.g., spot the difference, jigsaw permutation correctness):

IoU=ApredAgtApredAgt\text{IoU} = \frac{|A_{\mathrm{pred}} \cap A_{\mathrm{gt}}|}{|A_{\mathrm{pred}} \cup A_{\mathrm{gt}}|}

where ApredA_{\mathrm{pred}} and AgtA_{\mathrm{gt}} are predicted and ground truth sets.

22 state-of-the-art models were benchmarked: 11 open-source MLLMs (LLaVA-family, Qwen2.5-VL, InternVL3), 7 proprietary MLLMs (GPT-4.1, GPT-4o, Gemini-2.5 series, Grok-4), and 4 explicitly tool-using/agentic models (DeepEyes, PyVision, o3-TU, o4-mini-TU). Agentic variants were given access to a code interpreter (e.g., Python tool use), enabling dynamic, multi-turn interaction with visual data.

Zero-shot evaluation protocol was followed: models received nothing but the TIR-Bench task inputs (images, questions), with no task-specific fine-tuning.

5. Empirical Model Performance and Diagnostic Patterns

TIR-Bench proved universally challenging, decisively distinguishing agentic from non-agentic systems:

Model Class Average Accuracy (%) Notable Agentic-score Gaps
o3-TU (agentic) 46 +19% over non-agentic o3
Gemini-2.5-pro 28.9 +15% over random baseline
Best open-source 17–21 Little gain over baseline
Random baseline 13.5

Key findings:

  • Only tool-using/agentic models (e.g., o3-TU, PyVision) achieve substantial performance. o3-TU outperforms non-tool o3 by 19% absolute accuracy.
  • Non-agentic MLLMs (including those with 70B+ parameters) saturate at or slightly above random guess, irrespective of model scale or training corpus size.
  • Task-dependent performance splits: o3-TU achieves 64% on word search (vs. 4% for o3), and 77.3% vs. 33.3% on rotation game; standard MLLMs virtually cannot attempt such tasks.

A plausible implication is that TIR-Bench provides a well-calibrated diagnostic for truly agentic thinking-with-images competence, exposing categorical limitations of non-agentic architectures.

6. Agentic Thinking-with-Images: Measurement, Significance, and Learning Dynamics

TIR-Bench is predicated on the definition that "agentic thinking-with-images" is the systematic use of tool-executed image manipulations as integral steps in a reasoning chain. TIR-Bench tasks enforce this by:

  • Requiring models to generate code (typically Python) for image processing (e.g., segment, rotate, enhance, draw), chain these transformations, and feed manipulated images recursively into further steps.
  • Making purely linguistic or static vision models fail by design—success is only feasible via iterative, compositional image operations and adaptive feedback.

In a targeted supervised fine-tuning (SFT) analysis (Rotated Image OCR task), agentic/tool-use SFT was shown to dominate direct-output SFT: agentic SFT accuracy rises monotonically with more data, whereas direct SFT saturates and exhibits unstable learning curves. This suggests that agentic SFT enables models to learn compositional routines with generalizable structure, in contrast to the “flat” mapping learned by direct answer prediction.

7. Implications, Significance, and Future Benchmarking Directions

TIR-Bench constitutes a pivotal advance in the measurement of agentic, visually‐empowered AI. The gap in performance between agentic and non-agentic models is not incremental: tool use unlocks entire categories of tasks otherwise inaccessible, supporting the claim that agentic reasoning is a qualitative, not quantitative, paradigm shift.

TIR-Bench enables:

  • Granular, task-wise diagnosis of multimodal model capabilities and deficits.
  • Objective charting of progress in agentic AI via deterministic, reproducible scoring.
  • Benchmark-driven guidance for architectural and fine-tuning strategy design (agentic tool-use SFT is singled out as robust and superior to direct answer learning for multi-step visual tasks).

Limitations include: measures in TIR-Bench depend on connection to verifiable visual ground truth and deterministic code-executable manipulations; current tasks, while diverse, may inspire yet wider agentic reasoning challenges in future benchmark iterations.


In summary, TIR-Bench is a comprehensive, high-precision benchmark that decisively tests and quantifies agentic thinking-with-images capabilities in AI, systematically demonstrating the inadequacies of non-agentic models and documenting the decisive performance gains made possible by explicit tool-use and multi-step visual reasoning (Li et al., 3 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TIR-Bench.