lmgame-Bench: LLM Evaluation in Games
- lmgame-Bench is a comprehensive evaluation suite that converts real video games into contamination-robust benchmarks using a unified Gym-style API.
- The framework integrates perception, memory, and reasoning scaffolds to decouple visual challenges from high-level planning and long-term strategic reasoning.
- RL fine-tuning on lmgame-Bench not only boosts in-domain performance but also enhances transfer to external planning and decision-making tasks.
lmgame-Bench is a comprehensive evaluation suite developed to rigorously assess LLMs in video game environments that demand perception, memory, planning, and long-term strategic reasoning. Rather than relying on hand-crafted tasks or simulated puzzles, lmgame-Bench transforms a suite of real video games into controlled and contamination-robust benchmarks, pairing each with lightweight scaffolds for perception, memory tracking, and reasoning, and delivering all games through a unified Gym-style API. This structure enables consistent, transferable evaluation across LLMs while explicitly mitigating challenges inherent to using games for model assessment, such as brittle visual perception, prompt sensitivity, and data contamination (2505.15146).
1. Design Principles and Benchmark Structure
lmgame-Bench was constructed to “turn games into reliable evaluations” by anchoring on well-known titles (e.g., Super Mario Bros., Sokoban, Tetris, 2048, Candy Crush, Ace Attorney). The central design principle is encapsulated in three core scaffolds:
- Perception Module: Converts visual or grid-based inputs into symbolic or textual descriptions interpretable by LLMs, circumventing the visual brittleness that undermines head-to-head vision-language input evaluation.
- Memory Scaffold: Maintains persistent context for the model, tracking previous game states or actions, thereby enabling long-horizon reasoning and planning.
- Reasoning Scaffold: Supports explicit chain-of-thought (CoT) output—encouraging models to verbalize their intermediate reasoning before action selection.
These modules decouple raw perception difficulties from higher-level cognition, allowing the benchmark to probe an LLM’s underlying planning and reasoning abilities rather than its capacity for low-level pixel interpretation. The framework unifies all supported games under a Gym-style API with standardized observation and action spaces, enabling iterative, multi-turn interaction with LLM-driven agents.
2. Addressing Key Challenges in Game-Based LLM Evaluation
lmgame-Bench systematically addresses three principal obstacles when deploying LLMs in game settings:
- Brittle Vision Perception: Directly feeding RGB frames to LLMs, even those augmented with vision, yields poor performance due to inadequate object recognition and scene parsing; symbolic state conversion remedies this.
- Prompt Sensitivity: LLM performance exhibits high variance with minor prompt changes. To mitigate this, prompt standardization and a two-stage optimization process based on DSPy and SIMBA routines are implemented, as outlined in the experimental framework.
- Potential Data Contamination: Since many game assets or level layouts are accessible and potentially seen during pretraining, lmgame-Bench incorporates dedicated contamination checks at the visual and script/text level. Prompt interventions—such as entity masking, paraphrasing, and metadata removal—seek to stabilize results and ensure outcome validity.
This tripartite set of interventions ensures relative immunity to prior exposure and emphasizes genuine problem-solving skill over memorization.
3. Game Suite and Unifying API
The benchmark encompasses a diverse selection of platformers, puzzle, and narrative games:
Genre | Example Titles | Cognitive Skills Probed |
---|---|---|
Platformer | Super Mario Bros. | Spatial navigation, timing, dynamic planning |
Puzzle | Sokoban, Tetris, 2048, Candy Crush | Symbolic reasoning, spatial manipulation, fault tolerance |
Narrative/Logic | Ace Attorney | Deductive reasoning over dialogue and evidence, causal inference |
All games are delivered via a standardized Gym API. Game states are encoded as structured observations (e.g., position grids, inventories, or dialogue context), and action spaces are normalized (e.g., discrete moves, block manipulation, dialogue selection), allowing for model-agnostic, iterative interaction.
4. Model Evaluation, Diagnostics, and Correlation Analysis
Thirteen leading LLMs were evaluated on all games both with and without harnessed scaffolding. When evaluated without harness support (i.e., raw outputs on visually ambiguous tasks), almost all models clustered near random performance. Application of the harness substantially improved both the discrimination between models and overall scores: the best models (e.g., o3, o1, Gemini-2.5-pro-preview, Claude-3.7) reliably outperformed random baselines and ranked meaningfully apart from each other.
A comprehensive Spearman correlation and latent factor analysis showed every game in the suite measured a unique, non-redundant mix of abilities. Sokoban, for example, aligned closely with traditional math and code benchmarks; Tetris and 2048 correlated more strongly with pattern recognition, and Ace Attorney mapped uniquely to long-horizon language understanding. This suggests that lmgame-Bench captures a set of competencies complementary to those measured by standard NLP or code/math tests.
5. Reinforcement Learning and Transferability
The benchmark framework demonstrates significant transfer effects from reinforcement learning within video game environments to external planning and problem-solving tasks. For instance, a Qwen-2.5-7B-Instruct model RL-trained on simplified Sokoban or Tetris with explicit reasoning prompts (using a RAGEN-based multi-turn RL system) exhibits not only improved game-specific proficiency but also clear performance gains on larger-scale Sokoban boards, classic planning benchmarks (Blocksworld), and even unrelated external tasks (WebShop).
This cross-task transfer implies that skills learned through game-driven RL fine-tuning in the lmgame-Bench environment generalize to planning and sequential decision-making domains outside the strictly defined input-output spaces of the games themselves.
6. Technical Implementation and Availability
Each game is formalized as a (P)OMDP, with explicit definitions of:
- State space
- Action space
- Transition/reward function
For example, the reward in 2048 is defined by:
Prompt optimization routines employ a two-stage DSPy+SIMBA approach, standardizing instructions and cues fed to models. The entire benchmark framework, including the gymnasium interface, contamination remediation tools, modular harnesses, and reproducible evaluation scripts, is available as open-source code at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
7. Implications and Future Directions
lmgame-Bench demonstrates the viability of robust, game-based evaluation for LLMs. By solving critical issues around perception and prompt sensitivity and enforcing contamination safeguards, the benchmark delivers reliable, granular assessments of core agentic abilities—planning, memory retention, and adaptive reasoning.
A notable finding is that RL fine-tuning on a single game often enhances both in-domain generalization (unseen levels or larger board sizes) and out-of-domain transfer (other planning benchmarks), highlighting the potential of game-based training as a substrate for developing more general-purpose LLM agents.
Going forward, lmgame-Bench’s unified, extensible framework can be expanded to include additional genres and more complex, real-time environments. Its demonstrated discrimination among leading models, together with transferability evidence, positions it as a primary testbed for tracking general cognitive progress in the LLM paradigm.