TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning (2511.01833v1)

Published 3 Nov 2025 in cs.CV

Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal LLMs (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.

Summary

The paper introduces a novel benchmark, TIR-Bench, that evaluates agentic thinking-with-images through 13 dynamic tasks spanning visual reasoning challenges.
It utilizes tasks like maze solving and jigsaw puzzles, requiring models to employ active image manipulation and integrated tool use.
Results show that models with agentic tool use far outperform conventional MLLMs, underscoring the value of dynamic, multimodal strategies.

TIR-Bench: A Benchmark for Agentic Visual Reasoning

Introduction

The emergence of Multimodal LLMs (MLLMs) has significantly advanced visual reasoning capabilities. However, traditional benchmarks focus mainly on static image processing, such as localization and cropping, limiting their ability to evaluate dynamic, complex reasoning skills. TIR-Bench, a new benchmark, addresses this gap by introducing tasks that assess agentic "thinking-with-images" reasoning capabilities.

Benchmark Overview

TIR-Bench evaluates the agentic image manipulation abilities of MLLMs across 13 diverse tasks. These tasks are designed to require the use of novel image-processing tools, thus moving beyond conventional static image evaluation. Key tasks include mathematical visual question answering, symbolic reasoning, maze solving, jigsaw puzzles, and visual search. Each task necessitates active tool use, such as drawing auxiliary lines or manipulating image orientation, to foster integrated reasoning.

Task Design

The tasks are meticulously designed to challenge models with real-world application scenarios requiring active, tool-based reasoning:

Mathematical Visual Question Answering (VQA): Involves mathematical problem-solving using image-generated auxiliary data.
Symbolic Reasoning: Tests the model's ability to interpret and manipulate symbols within images.
Maze Solving: Requires spatial planning and pathfinding within visual contexts.
Jigsaw Puzzles: Assesses spatial reasoning through image reassembly.
Visual Search and Rotational Games: Evaluates orientation and precise localization abilities through iterative exploration and correction techniques.

Implementation and Results

A comprehensive evaluation of 22 MLLMs, including both open-source and proprietary models, as well as tool-using agents, reveals fundamental insights:

Challenge Level: TIR-Bench proves universally challenging, with no model exceeding a 46% success rate. This highlights the difficulty and relevance of dynamic tool-based image reasoning.
Agentic Tool Use: Models equipped for agentic tool use, such as OpenAI's o3 with code interpreter, perform significantly better, achieving 46% success versus 28.9% for the highest-performing conventional model.
Role of Tool Use: The results emphasize that sophisticated tool use is essential for high performance on complex visual reasoning tasks. Traditional non-agentic models underperform, highlighting the necessity for integrated multimodal reasoning.

Function Call and Fine-Tuning Experiments

Two key experiments were conducted to further understand performance factors:

Function Calling: Evaluations in the Rotation task highlight the importance of effective prompt strategies and model training for iterative function-calling capabilities. Recent models, like o3, exhibit superior performance in generating and executing code.
Fine-Tuning Approaches: An experiment comparing direct supervised fine-tuning (SFT) with agentic SFT reveals that agentic SFT significantly improves performance in tasks involving complex problem-solving trajectories. This underscores the benefits of tool-use strategies for enhancing reasoning capabilities in MLLMs.

Conclusion

TIR-Bench establishes a new standard for evaluating agentic thinking-with-images reasoning, crucial for developing sophisticated multimodal AI systems. The findings from TIR-Bench suggest that the incorporation of dynamic image-processing tools in AI models leads to more robust and sophisticated reasoning capabilities, paving the path for future advancements in AI applications and research.