Robust multi-turn multi-tool composition and generalization to unseen tools in MLLMs

Establish robust multi-turn composition of multiple image-manipulation tools and strong generalization to previously unseen tools for multimodal large language models that perform tool-augmented visual reasoning (“thinking with images”).

Background

The paper surveys recent progress in multimodal LLMs that operate on images using tools, highlighting a shift from merely reasoning about images to interactively manipulating them to gather evidence. Despite this transition, the authors identify that reliably composing different tools across multiple turns and generalizing to tools not encountered during training are not yet solved problems.

This open challenge motivates the CodeVision framework proposed by the authors, which treats code as a universal tool interface and trains models via supervised fine-tuning followed by reinforcement learning with dense process rewards. Their empirical results demonstrate improvements, but they explicitly note that robust multi-turn multi-tool composition and generalization to unseen tools remain unresolved at the field level.

References

Recent efforts are transitioning from thinking about images to thinking with images~\citep{o3,su2025openthinkimg,zhang2025thyme}, but robust, multi-turn, multi-tool composition and strong generalization to unseen tools remain open challenges.

Thinking with Programming Vision: Towards a Unified View for Thinking with Images (2512.03746 - Guo et al., 3 Dec 2025) in Related Work, MLLM Reasoning (Section 2)