- The paper introduces VideoGameBench, a benchmark assessing VLMs' human-like skills such as perception, spatial navigation, and memory management via 90s video games.
- It outlines a suite of 23 curated games and a VG-Agent using a ReAct framework, enabling rigorous testing of generalization and raw visual input processing.
- Experiments reveal that VLMs, including the top performer completing less than 1%, struggle with real-time reasoning and action execution, highlighting fundamental limits.
VideoGameBench (2505.18134) introduces a new benchmark to evaluate the ability of Vision-LLMs (VLMs) to perform tasks that are intuitive for humans, such as perception, spatial navigation, and memory management, by playing popular video games from the 1990s. The paper argues that while VLMs excel at tasks like coding and math, their performance on these more 'human-like' skills in dynamic, real-time environments is less understood. Real video games, designed for human learnability and engagement, serve as an ideal testbed.
The benchmark consists of a suite of 23 curated games, split into a 13-game development set and a 10-game test set (7 public, 3 secret). These games come from Game Boy, Game Boy Color, and MS-DOS platforms, offering diverse challenges across 2D/3D environments, genres (FPS, Platformer, Action-Adventure/RPG, Racing, Puzzle), and control schemes (controller vs. mouse/keyboard).
Key features of VideoGameBench:
- Complex and Realistic Environments: Unlike grid-world or text-only games, these 90s games present more challenging visuals and require sophisticated planning, puzzle-solving, and real-time reaction.
- Generalization Testing: The benchmark includes three secret games in the test set to encourage models to develop capabilities that generalize to unseen environments, rather than relying on game-specific training.
- Raw Visual Input: Agents only receive raw game frames and a high-level description of objectives and controls. No visual overlays, parsed game state information, or human hints beyond the initial setup are permitted, challenging the VLM's core visual processing abilities. This contrasts with previous efforts that might use game-specific scaffolding or tools.
The paper introduces a standard interface and agent scaffolding, VG-Agent, which utilizes a ReAct (Reasoning, Action, Observation) framework with a textual scratchpad for memory to interact with the game emulators (PyBoy for Game Boy, DOSBox/JS-DOS/Playwright for MS-DOS). Actions are issued in natural language or structured commands (e.g., pressing keys, moving/clicking the mouse).
A significant challenge observed is VLM inference latency in real-time games. To address this, the paper proposes VideoGameBench Lite, a variant where the emulator pauses while the agent processes input and decides on an action. This effectively turns real-time games into turn-based interactions, allowing for evaluation of reasoning capabilities independent of reaction speed.
To provide granular progress tracking beyond simple pass/fail, the paper introduces an automated progress tracking mechanism. This involves scraping checkpoint images from YouTube video walkthroughs with timestamps. Perceptual image hashing is then used to compare the current game screen to these checkpoint images, detecting progress based on the hamming distance between hashes. The percentage of the game completed is determined by the timestamp of the furthest reached checkpoint relative to the walkthrough duration.
Experiments evaluating frontier VLMs (GPT-4o, Claude Sonnet 3.7, Gemini 2.5 Pro, Llama 4 Maverick, Gemini 2.0 Flash) using the VG-Agent scaffolding show that models struggle significantly on VideoGameBench. The best-performing model, Gemini 2.5 Pro, completes only 0.48% of the full benchmark on average.
Results on VideoGameBench Lite show a slight quantitative improvement and notable qualitative improvement, suggesting that while latency is a factor, models still struggle even with ample thinking time. The overall scores remain low (best overall 1.6%), indicating fundamental limitations in reasoning effectively over these environments.
To further probe model weaknesses, the paper presents results on simple practice games designed to test basic skills like location clicking, mouse dragging, and 2D grid navigation. Models, including the best-performing ones on VideoGameBench, show poor performance on dragging and navigation tasks, suggesting challenges even with foundational control and spatial reasoning in simple visual environments. Human testing with the same interface confirms that the benchmark setup itself is not the limiting factor.
Qualitative analysis of game trajectories reveals key failure modes:
- Knowing-Doing Gap: Models often verbalize correct steps (what needs to be done) but fail to execute them correctly with the available actions, indicating a disconnect between high-level reasoning and low-level control.
- Visual Input Processing Issues: Models misinterpret raw screen information, leading to erroneous actions like attacking already defeated enemies or believing an interaction occurred when it did not.
- Lack of Planning and Memory: Agents struggle to track game state, objectives, or map layouts effectively, leading to repeated loops, getting stuck, or forgetting previously identified goals. The ReAct agent's textual memory proved insufficient for robust long-term planning and navigation.
The paper contrasts the poor performance of these unassisted VLMs with the success of heavily engineered, tool-assisted agents (like Gemini Plays Pokemon), highlighting the significant gap between raw VLM capabilities and systems augmented with game-specific scaffolding, hints, and external information access.
VideoGameBench is presented as a challenging benchmark that formalizes human skills like real-time perception, spatial navigation, and memory management in a complex, dynamic environment. The authors hope the benchmark will motivate research into developing more capable and general-purpose autonomous agents that can interpret raw visual inputs and generalize to new tasks, with potential implications for real-world applications like robotics. Limitations include the focus on 90s games and potential data leakage from pre-training corpora containing game information, though efforts were made to mitigate this with dev/test splits and secret games.