Realizing the “thinking with image” pattern that integrates operations and web search
Determine concrete methodologies to realize the "thinking with image" reasoning pattern in multimodal large language models, in which the model interleaves visual operations (e.g., cropping, measuring via executable code) and web search within a single reasoning process to solve tasks reliably and autonomously.
References
While o3 has explored "thinking with image" reasoning pattern that combines operations and search, how to realize such capabilities remains unclear.
— DeepEyesV2: Toward Agentic Multimodal Model
(2511.05271 - Hong et al., 7 Nov 2025) in Section 1 (Introduction)