Visual-only inference of action consequences by VLMs
Determine whether contemporary vision–language models can infer the validity and consequences of actions solely from visual state transitions, without any textual environment feedback, in visually interactive decision-making tasks such as Maze 3D, Maze 2D, Sliding Block, and Matchstick Equation.
References
Humans can infer action consequences directly from visual changes \citep{michotte1963perception}, but it remains unclear whether VLMs can do the same.
— VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
(2601.16973 - Wang et al., 23 Jan 2026) in Section 4.3 (Removal of Text-based Feedback)