Overall effect of image-aided reasoning on LLM performance

Determine the overall effect of image-aided reasoning—defined as prompting large language models to generate and iteratively modify intermediate images while following compositional object reconstruction instructions—on performance relative to language-only reasoning in the mental imagery task adapted from Finke et al. (1989), and identify the conditions and model configurations under which image generation helps or hinders accuracy.

Background

The paper investigated an image-aided paradigm in which models generated and modified images at each step of the object reconstruction task. Across o3, GPT-4.1, and Gemini 2.0 Flash, introducing images generally reduced or did not improve performance compared to language-only reasoning.

Despite these findings, the authors note that the overall effect remains unclear because models occasionally showed some success, suggesting potential benefits under specific conditions and motivating further systematic evaluation of when and how image generation affects reasoning.

References

It is unclear what the overall effect of image-aided reasoning is, as the models still found some success (though diminished), and more exploration of its effects is needed \citep{yang2025, wu2024}.

— Artificial Phantasia: Evidence for Propositional Reasoning-Based Mental Imagery in Large Language Models (2509.23108 - McCarty et al., 27 Sep 2025) in Subsection "Image-aided Reasoning" (Results)

Overall effect of image-aided reasoning on LLM performance

Background

References

Related Problems