Robust multi-turn multi-tool composition and generalization to unseen tools in MLLMs
Establish robust multi-turn composition of multiple image-manipulation tools and strong generalization to previously unseen tools for multimodal large language models that perform tool-augmented visual reasoning (“thinking with images”).
Sponsor
References
Recent efforts are transitioning from thinking about images to thinking with images~\citep{o3,su2025openthinkimg,zhang2025thyme}, but robust, multi-turn, multi-tool composition and strong generalization to unseen tools remain open challenges.
— Thinking with Programming Vision: Towards a Unified View for Thinking with Images
(2512.03746 - Guo et al., 3 Dec 2025) in Related Work, MLLM Reasoning (Section 2)