Evaluation of VLMs on complete, complex, real-time video games

Evaluate the performance of vision-language models (VLMs) on complete, complex, real-time video games, establishing assessment protocols that account for real-time interaction constraints and full-game completion rather than simplified environments or short tasks.

Background

The paper notes that although VLMs have advanced on various tasks, most existing evaluations emphasize simplified environments, short tasks, or settings without real-time interaction constraints. Complete, complex, real-time video games pose a more demanding test of perception, spatial reasoning, memory, and rapid action—all critical for generalization.

To address this gap, the authors introduce VideoGameBench, which requires agents to complete entire games using only raw visual inputs and high-level control descriptions, disallowing overlays, tool-assisted hints, or access to internal game state. The explicit statement that evaluating VLMs on complete, complex, real-time video games remains an open challenge motivates the need for rigorous and comprehensive assessment frameworks.

References

Despite this progress, evaluating VLMs on complete, complex, real-time video games remains an open challenge.

VideoGameBench: Can Vision-Language Models complete popular video games?  (2505.18134 - Zhang et al., 23 May 2025) in Section 2, Related Works