Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

117 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

lmgame-Bench: LLM Evaluation in Games

Updated 11 July 2025

lmgame-Bench is a comprehensive evaluation suite that converts real video games into contamination-robust benchmarks using a unified Gym-style API.
The framework integrates perception, memory, and reasoning scaffolds to decouple visual challenges from high-level planning and long-term strategic reasoning.
RL fine-tuning on lmgame-Bench not only boosts in-domain performance but also enhances transfer to external planning and decision-making tasks.

lmgame-Bench is a comprehensive evaluation suite developed to rigorously assess LLMs in video game environments that demand perception, memory, planning, and long-term strategic reasoning. Rather than relying on hand-crafted tasks or simulated puzzles, lmgame-Bench transforms a suite of real video games into controlled and contamination-robust benchmarks, pairing each with lightweight scaffolds for perception, memory tracking, and reasoning, and delivering all games through a unified Gym-style API. This structure enables consistent, transferable evaluation across LLMs while explicitly mitigating challenges inherent to using games for model assessment, such as brittle visual perception, prompt sensitivity, and data contamination (2505.15146).

1. Design Principles and Benchmark Structure

lmgame-Bench was constructed to “turn games into reliable evaluations” by anchoring on well-known titles (e.g., Super Mario Bros., Sokoban, Tetris, 2048, Candy Crush, Ace Attorney). The central design principle is encapsulated in three core scaffolds:

Perception Module: Converts visual or grid-based inputs into symbolic or textual descriptions interpretable by LLMs, circumventing the visual brittleness that undermines head-to-head vision-language input evaluation.
Memory Scaffold: Maintains persistent context for the model, tracking previous game states or actions, thereby enabling long-horizon reasoning and planning.
Reasoning Scaffold: Supports explicit chain-of-thought (CoT) output—encouraging models to verbalize their intermediate reasoning before action selection.

These modules decouple raw perception difficulties from higher-level cognition, allowing the benchmark to probe an LLM’s underlying planning and reasoning abilities rather than its capacity for low-level pixel interpretation. The framework unifies all supported games under a Gym-style API with standardized observation and action spaces, enabling iterative, multi-turn interaction with LLM-driven agents.

2. Addressing Key Challenges in Game-Based LLM Evaluation

lmgame-Bench systematically addresses three principal obstacles when deploying LLMs in game settings:

Brittle Vision Perception: Directly feeding RGB frames to LLMs, even those augmented with vision, yields poor performance due to inadequate object recognition and scene parsing; symbolic state conversion remedies this.
Prompt Sensitivity: LLM performance exhibits high variance with minor prompt changes. To mitigate this, prompt standardization and a two-stage optimization process based on DSPy and SIMBA routines are implemented, as outlined in the experimental framework.
Potential Data Contamination: Since many game assets or level layouts are accessible and potentially seen during pretraining, lmgame-Bench incorporates dedicated contamination checks at the visual and script/text level. Prompt interventions—such as entity masking, paraphrasing, and metadata removal—seek to stabilize results and ensure outcome validity.

This tripartite set of interventions ensures relative immunity to prior exposure and emphasizes genuine problem-solving skill over memorization.

3. Game Suite and Unifying API

The benchmark encompasses a diverse selection of platformers, puzzle, and narrative games:

Genre	Example Titles	Cognitive Skills Probed
Platformer	Super Mario Bros.	Spatial navigation, timing, dynamic planning
Puzzle	Sokoban, Tetris, 2048, Candy Crush	Symbolic reasoning, spatial manipulation, fault tolerance
Narrative/Logic	Ace Attorney	Deductive reasoning over dialogue and evidence, causal inference

All games are delivered via a standardized Gym API. Game states are encoded as structured observations (e.g., position grids, inventories, or dialogue context), and action spaces are normalized (e.g., discrete moves, block manipulation, dialogue selection), allowing for model-agnostic, iterative interaction.

4. Model Evaluation, Diagnostics, and Correlation Analysis

Thirteen leading LLMs were evaluated on all games both with and without harnessed scaffolding. When evaluated without harness support (i.e., raw outputs on visually ambiguous tasks), almost all models clustered near random performance. Application of the harness substantially improved both the discrimination between models and overall scores: the best models (e.g., o3, o1, Gemini-2.5-pro-preview, Claude-3.7) reliably outperformed random baselines and ranked meaningfully apart from each other.

A comprehensive Spearman correlation and latent factor analysis showed every game in the suite measured a unique, non-redundant mix of abilities. Sokoban, for example, aligned closely with traditional math and code benchmarks; Tetris and 2048 correlated more strongly with pattern recognition, and Ace Attorney mapped uniquely to long-horizon language understanding. This suggests that lmgame-Bench captures a set of competencies complementary to those measured by standard NLP or code/math tests.

5. Reinforcement Learning and Transferability

The benchmark framework demonstrates significant transfer effects from reinforcement learning within video game environments to external planning and problem-solving tasks. For instance, a Qwen-2.5-7B-Instruct model RL-trained on simplified Sokoban or Tetris with explicit reasoning prompts (using a RAGEN-based multi-turn RL system) exhibits not only improved game-specific proficiency but also clear performance gains on larger-scale Sokoban boards, classic planning benchmarks (Blocksworld), and even unrelated external tasks (WebShop).

This cross-task transfer implies that skills learned through game-driven RL fine-tuning in the lmgame-Bench environment generalize to planning and sequential decision-making domains outside the strictly defined input-output spaces of the games themselves.

6. Technical Implementation and Availability

Each game is formalized as a (P)OMDP, with explicit definitions of:

State space $S$
Action space $A$
Transition/reward function $R: S \times A \times S \to \mathbb{R}$

For example, the reward in 2048 is defined by:

$\mathrm{Score}_{2048} = 10 \times \log_2(\text{total merged sum})$

Prompt optimization routines employ a two-stage DSPy+SIMBA approach, standardizing instructions and cues fed to models. The entire benchmark framework, including the gymnasium interface, contamination remediation tools, modular harnesses, and reproducible evaluation scripts, is available as open-source code at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

7. Implications and Future Directions

lmgame-Bench demonstrates the viability of robust, game-based evaluation for LLMs. By solving critical issues around perception and prompt sensitivity and enforcing contamination safeguards, the benchmark delivers reliable, granular assessments of core agentic abilities—planning, memory retention, and adaptive reasoning.

A notable finding is that RL fine-tuning on a single game often enhances both in-domain generalization (unseen levels or larger board sizes) and out-of-domain transfer (other planning benchmarks), highlighting the potential of game-based training as a substrate for developing more general-purpose LLM agents.

Going forward, lmgame-Bench’s unified, extensible framework can be expanded to include additional genres and more complex, real-time environments. Its demonstrated discrimination among leading models, together with transferability evidence, positions it as a primary testbed for tracking general cognitive progress in the LLM paradigm.

PDF Markdown Chat (Upgrade)

References (1)

lmgame-Bench: How Good are LLMs at Playing Games? (2025)