lmgame-Bench: How Good are LLMs at Playing Games? (2505.15146v2)
Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern LLM agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
Summary
- The paper introduces lmgame-Bench, a benchmark that assesses LLM abilities in complex gaming environments by integrating standardized perception, memory, and reasoning modules.
- The methodology mitigates challenges like brittle vision, prompt sensitivity, and data contamination through a modular gaming harness and iterative prompt optimization.
- Experimental results demonstrate significant performance gains and reliable differentiation among LLMs, linking game performance to core reasoning and planning skills.
Playing complex video games requires a blend of perception, memory, planning, and sequential decision-making, faculties that are increasingly expected of modern LLM agents. The paper "lmgame-Bench: How Good are LLMs at Playing Games?" (2505.15146) introduces a benchmark designed to evaluate these capabilities by leveraging a suite of popular video games. The authors identify key challenges in directly using games for LLM evaluation: brittle vision perception, prompt sensitivity, and potential data contamination. To address these issues and enable reliable evaluation, they propose lmgame-Bench, which features a standardized interface, lightweight perception and memory scaffolds, and techniques to mitigate prompt variance and data contamination.
lmgame-Bench is built upon six well-known games chosen for their diverse skill requirements:
- Super Mario Bros: A platformer requiring visual perception, 2D spatial reasoning, and goal-directed planning in partially observable environments.
- Tetris: A tile-matching puzzle stressing visual pattern recognition, spatial reasoning (rotation and placement), and long-horizon planning under partial observability.
- Sokoban: A grid-based puzzle emphasizing visual perception, spatial reasoning for character and box movement, and long-horizon planning to avoid deadlocks, known for low fault tolerance.
- Candy Crush: A match-three puzzle testing visual perception, spatial reasoning (anticipating chain reactions), and planning under limited moves.
- 2048: A sliding-tile puzzle evaluating visual perception, spatial reasoning (managing merges), and goal-directed planning to maximize merge potential, where errors compound quickly.
- Ace Attorney: A narrative courtroom drama requiring long-context language understanding, causal deductive reasoning under partial observability, and low-fault-tolerance, long-horizon decision making.
The benchmark environments are formalized as partially or fully observable Markov Decision Processes (MDPs) with defined observation and action spaces. Game states can be represented symbolically or graphically. Evaluation metrics are categorized into progression rewards (for continuous or linear games like Super Mario Bros., Tetris, 2048) and long-horizon rewards (for games with discrete objectives like Sokoban, Ace Attorney). Games are categorized by difficulty based on fault tolerance and state-action space complexity.
To enhance benchmark effectiveness and overcome the identified challenges, lmgame-Bench introduces a gaming harness with modular components:
- Perception Modules: These convert raw game UI inputs into symbolic representations (e.g., text-based tables for grid games listing object coordinates and properties like "Box at (2,3)") or textual descriptions (for text-based or complex graphical games) to mitigate brittle vision and provide structured input for models.
- Memory Modules: For games with large decision spaces or long horizons (e.g., Sokoban, 2048), these modules can record the past N game states and actions (transient memory) and encode explicit lessons learned to avoid failures (reflection module), helping models with long-horizon planning and distinguishing stronger models. The paper uses o3 to generate reflections for the memory module.
- Reasoning Modules: The benchmark framework is designed to support models that generate detailed reasoning traces (like Chain-of-Thought), allowing evaluation of models with or without explicit long CoT prompting.
Data contamination is addressed by checking for overlap with publicly available game assets or transcripts. For vision-level contamination (e.g., Super Mario Bros.), models were prompted to reorder shuffled frames; results showed low alignment and no significant correlation with performance, suggesting limited reliance on memorized visual sequences. For text-level contamination (e.g., Ace Attorney), initial analysis showed a strong correlation between similarity to public fan transcripts and model performance. Mitigation strategies, including entity masking, paraphrasing, and enforced causal reasoning through structured prompts, successfully eliminated this correlation, shifting performance alignment to independently judged reasoning quality (verified by an LLM-as-Judge, o3).
Prompt sensitivity is tackled using a two-stage optimization technique. An initial empirical prompt is designed based on agent workflow best practices, following a format like [{J[min(0,i−N):i−1]},Ri−1,si] for action ai, where J is the game trajectory, R is reflection from memory, and si is the current state. The second stage uses DSPy [khattab2024dspy] with its SIMBA optimizer (Algorithm 1) to refine this prompt iteratively, guided by game rewards as evaluation metrics. This process produces a standardized, optimized system prompt that reduces performance variance across different empirically derived initializations (Table 9).
Experimental evaluation of 13 leading models (including various versions of Claude, Gemini, Grok, Llama, GPT, o1, o3, o4-mini) on lmgame-Bench demonstrates that models initially perform poorly without the harness, often near random baselines (Table 2). Enabling the gaming harness leads to consistent and substantial performance gains, significantly differentiating models (Table 3). Quantitative analysis using Glass's δ and paired-sample t-tests (Table 8, Figure 7) shows that the harness pulls model scores significantly farther from random play and that performance improvements are statistically significant for most games. The Coefficient of Variation analysis (Table 10) indicates that the harness also yields more stable and reliable performance assessments. Perception modules are particularly useful in spatial tasks like Sokoban, while memory modules benefit temporal planning games like 2048 (Table 3, Appendix D).
To understand what core LLM capabilities are probed by lmgame-Bench, correlation analysis (Figure 2a, Appendix C) was performed between game performance and performance on 20 established benchmarks spanning various domains (factual knowledge, physics, math, coding, visual reasoning, language understanding, puzzle solving). Sokoban performance correlates strongly with math and coding, Tetris/2048 with pattern recognition, Candy Crush with coding, and Ace Attorney with long-context language understanding. Latent ability decomposition using low-rank matrix factorization identifies underlying features (language/knowledge, coding, symbolic/puzzle, physical reasoning) and shows that each game loads on unique combinations of these features (Figure 2b, Appendix C), suggesting games test compositional capabilities. Linear modeling further supports these findings, linking game rankings to performance in benchmark categories (Table 4).
Finally, the paper investigates the training generalizability of game-based learning. RL training on simplified versions of Sokoban and Tetris (using the RAGEN framework [wang2025ragen] and Qwen2.5-7B-Instruct [qwen2.5]) shows performance improvements. Training transfers not only to variations of the same game but also to other spatial reasoning tasks (Blocksworld) and real-world agentic tasks (WebShop), yielding improvements of 6% to over 10% (Table 5, Appendix A). However, training on these games did not transfer to math or coding benchmarks like GSM8K or BIRD, suggesting domain specificity in generalization. Analysis on the use of explicit "thinking tokens" during training and inference shows mixed results depending on the task, but training models to "think" generally helps in planning tasks (Table 6, Appendix A). Mixed training on games and math shows some moderate cross-domain generalization but may sacrifice peak performance on domain-specific tasks (Figure 4, Appendix A).
In summary, lmgame-Bench provides a robust and versatile benchmark for evaluating LLM agents on complex interactive game environments. By addressing common challenges like brittle perception, contamination, and prompt sensitivity with practical scaffolds and methodologies, it enables meaningful differentiation of model capabilities. The analysis demonstrates that game performance correlates with and decomposes into fundamental reasoning, planning, and understanding skills also measured by traditional benchmarks. Crucially, the paper provides empirical evidence that training LLMs in these game environments can improve performance on out-of-domain planning and agentic tasks, highlighting the value of games as a testbed and training ground for developing general-purpose AI agents.
Related Papers
- SmartPlay: A Benchmark for LLMs as Intelligent Agents (2023)
- Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games (2024)
- Measuring General Intelligence with Generated Games (2025)
- VideoGameBench: Can Vision-Language Models complete popular video games? (2025)
- Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games (2025)