Realizing the “thinking with image” pattern that integrates operations and web search

Determine concrete methodologies to realize the "thinking with image" reasoning pattern in multimodal large language models, in which the model interleaves visual operations (e.g., cropping, measuring via executable code) and web search within a single reasoning process to solve tasks reliably and autonomously.

Background

The paper argues that most multimodal LLMs remain passive and lack the ability to actively invoke external tools for computation and knowledge retrieval. It highlights a gap between current approaches and truly agentic multimodal models that can combine perception, reasoning, and search.

OpenAI’s o3 is cited as having explored a “thinking with image” reasoning pattern that combines image operations and search. However, the authors note that the community lacks clear methods to realize such capabilities. DeepEyesV2 is proposed to advance this direction by integrating code execution and web retrieval within a unified reasoning loop, but the broader methodological question remains unresolved.

References

While o3 has explored "thinking with image" reasoning pattern that combines operations and search, how to realize such capabilities remains unclear.

— DeepEyesV2: Toward Agentic Multimodal Model (2511.05271 - Hong et al., 7 Nov 2025) in Section 1 (Introduction)

Realizing the “thinking with image” pattern that integrates operations and web search

Background

References

Related Problems