Visual-only inference of action consequences by VLMs

Determine whether contemporary vision–language models can infer the validity and consequences of actions solely from visual state transitions, without any textual environment feedback, in visually interactive decision-making tasks such as Maze 3D, Maze 2D, Sliding Block, and Matchstick Equation.

Background

The paper studies whether models can rely purely on visual changes to understand what their actions accomplished, mirroring human perception of causality in physical scenes. In VisGym, environments typically provide textual feedback describing action execution or constraint violations. The authors remove this feedback channel in several tasks to test if models can still infer consequences from visual transitions alone.

Empirically, all evaluated models show performance drops without textual feedback, suggesting a dependence on text-based signals. This motivates a focused investigation into whether and under what conditions vision–LLMs can recover action validity and outcome inference purely from visual state changes.

References

Humans can infer action consequences directly from visual changes \citep{michotte1963perception}, but it remains unclear whether VLMs can do the same.

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents  (2601.16973 - Wang et al., 23 Jan 2026) in Section 4.3 (Removal of Text-based Feedback)