Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games (2506.03610v1)

Published 4 Jun 2025 in cs.AI

Abstract: LLM agents are reshaping the game industry, particularly with more intelligent and human-preferable game characters. However, existing game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets for aligning pre-trained LLMs into gaming agents. To fill these gaps, we present \textbf{\benchname{}}, a foundational benchmark designed to train and evaluate LLM agents across diverse real-world video games. Unlike existing benchmarks, Orak includes 12 popular video games spanning all major genres, enabling comprehensive studies of LLM capabilities and agentic modules essential for intricate game scenarios. To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP) that enables LLMs to seamlessly connect with games and manipulate agentic modules. Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay trajectories across diverse game genres. Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects, establishing a foundation towards building generic gaming agents. Code is available at https://github.com/krafton-ai/Orak.

Summary

The paper introduces Orak, a benchmark that evaluates LLM agents on 12 games spanning 6 genres to address gaps in diversity and agentic module assessment.
It employs a plug-and-play Model Context Protocol (MCP) to seamlessly integrate LLMs with game environments, testing capabilities like rule following, logical and spatial reasoning.
The benchmark provides a fine-tuning dataset of expert gameplay trajectories, enabling improved performance and generalization on both in-distribution and novel game scenarios.

The paper "Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games" (2506.03610) introduces a new benchmark designed to address shortcomings in existing methods for evaluating LLM agents in gaming. Current benchmarks often lack diversity in game genres, provide insufficient assessment of crucial agentic modules (like self-reflection, memory, and tool use), and do not offer fine-tuning datasets to adapt pre-trained LLMs into effective gaming agents.

Orak aims to fill these gaps by providing:

Diverse Game Environments: It includes 12 popular, real-world video games spanning six major genres: action (Street Fighter III, Super Mario), adventure (Ace Attorney, Her Story), role-playing (Pokémon Red, Darkest Dungeon), simulation (Minecraft, Stardew Valley), strategy (StarCraft II, Slay the Spire), and puzzle (Baba Is You, 2048). This diversity allows for comprehensive testing of various LLM capabilities.
Plug-and-Play Interface: Orak utilizes the Model Context Protocol (MCP) to enable seamless connection of LLMs with game environments and agentic modules. Game environments and agentic modules operate as independent MCP servers, offering callable functions for game mechanics (e.g., get state, execute step) and agentic strategies (e.g., reflection, planning). This facilitates consistent and streamlined evaluation across different games and LLMs.
Fine-tuning Dataset: A dataset of LLM gameplay trajectories across all Orak games is provided. These trajectories, generated by expert LLMs (e.g., GPT-4o) using various agentic strategies, are intended to help fine-tune pre-trained LLMs into more effective and resource-efficient gaming agents. The data encapsulates meta-knowledge on using agentic strategies for different game genres.
Comprehensive Evaluation Framework: Orak offers general game score leaderboards, LLM battle arenas for competitive games, and in-depth analyses of visual input states, agentic strategies, and fine-tuning effects.

LLM Capabilities Assessed by Orak:

The paper outlines seven key LLM capabilities required for gameplay, measured on a 1-3 scale:

Rule Following (RF): Adherence to game-specific rules.
Logical Reasoning (LR): Number of reasoning hops for an action.
Spatial Reasoning (SR): Level of spatial understanding.
Long-text Understanding (LTU): Comprehension of long contexts.
Long-term Planning (LP): Extent of strategic planning needed.
Error Handling (EH): Necessity of error correction and re-planning.
Odds Handling (OH): Understanding and managing randomness. Different games in Orak demand varying levels of these capabilities. For instance, action games heavily rely on SR and RF, while adventure games emphasize LTU and LR.

Orak Game Environments (Summary):

Game	Genre	State Input (to LLM)	Action Space (for LLM)	Evaluation Task & Metric
Street Fighter III	Action	Player/opponent stats, gauges, distance between characters.	15-20 discrete actions (e.g., 'move closer', 'low punch').	Beat game bot; number of stages cleared.
Super Mario	Action	Positions (x,y) and sizes of obstacles/enemies.	Mario moves right; LLM decides jump level (6 bins).	Reach flagpole; horizontal distance traveled.
Ace Attorney	Adventure	Dialogue history, evidence, court records.	Courtroom actions (advance dialogue, press witness, present evidence).	Response correctness and total steps.
Her Story	Adventure	History of queries/results, metadata for first 5 clips (visual desc., date, transcript if played).	Search clips with keywords, select video to play.	Uncover truth; number of distinct clips viewed to complete.
Pokémon Red	Role-Playing	Player location, party Pokémon stats, inventory, battle state, screen text.	High-level tools or low-level joypad actions.	Defeat Brock (1st gym leader); number of 12 predefined storyline flags triggered.
Darkest Dungeon	Role-Playing	Party status (stats, health, stress, effects), skills, enemy encounters.	Combat actions (e.g., 'attack', 'heal', 'swap').	Complete first expedition; sum of successful combats, survived heroes, remaining stress capacities.
Minecraft	Simulation	Player position, inventory, health, nearby blocks/biome.	Executable JavaScript code (Mineflayer).	Craft target item; item collected in inventory.
Stardew Valley	Simulation	Player location, energy, inventory, crop/soil status, date, weather.	Farming actions (till, water, harvest, sell), house actions, sleep.	Earn most money by harvesting crops in first 13 in-game days; total profit.
StarCraft II	Strategy	Resources, unit/building counts, production queues, research, observed enemy info.	72 discrete actions (unit training, building, research, operations).	Beat built-in AI bots; win rate.
Slay the Spire	Strategy	Player class, deck, hand, health, relics, energy, enemy intents/statuses, current floor.	Play card, end turn, select card reward.	Defeat final boss; number of floors reached.
Baba Is You	Puzzle	Coordinates of text/object tiles, active rules.	Single movement (up, down, left, right) or sequence of moves.	Solve first stage; partial credit for sub-goals if not cleared.
2048	Puzzle	Current 4x4 grid configuration.	Four discrete actions (up, down, left, right).	Create 2048 tile; normalized progress toward 2048 tile.

Orak Evaluation Pipeline:

The evaluation pipeline (Figure 2 in the paper) shows how eval.py orchestrates the process. Users can configure the game, LLM backbone, and agentic strategy. The LLM interacts with game and agentic module MCP servers to retrieve game states, perform action inference using agentic strategies, and execute game steps. Submissions can involve new LLMs (customizing LLM.py) or new agentic strategies (customizing agent.py).

Fine-tuning Dataset Implementation:

The dataset is collected from expert LLMs (e.g., GPT-4o) playing the 12 Orak games using various agentic modules (e.g., reflection, planning, action).

Data Format: Gameplay trajectories $\mathcal{T}=\{\tau_1, \ldots, \tau_T\}$ , where $\tau_t$ is a sequence of LLM inferences at game step $t$ . Each inference $\tau = \{(X^{a_i}, S, Y^{a_i})\}_{i=1}^n$ includes the prompt for agentic module $a_i$ ( $X^{a_i}$ ), game state ( $S$ ), and LLM response ( $Y^{a_i}$ ). An example for Super Mario shows separate prompts and responses for a 'reflection' module and an 'action' module (Table 3).
Data Selection: High-score trajectories are selected until the number of LLM inferences exceeds 900 per game (aiming for ~300 samples per agent module if using a reflection-planning-action sequence). This resulted in ~10k samples.
Data Augmentation: Each prompt $X^a$ is paraphrased 10 times by GPT-4o to increase linguistic diversity.

Experimental Findings and Practical Implications:

LLM Gameplay Performance: Proprietary LLMs (Gemini-2.5-pro, o3-mini) generally outperformed open-source LLMs across games. Open-source LLMs struggled significantly in complex games like Pokémon Red, Minecraft, and StarCraft II. This suggests that for complex, long-horizon decision-making, larger, more capable models are currently necessary.
LLM Arena (Competitive Play):
- Street Fighter III: Minitron-8B (open-source) surprisingly outperformed other LLMs, suggesting adversarial dynamics can change performance hierarchies.
- StarCraft II: Claude-3.7-Sonnet performed best.
- The arena results indicate that head-to-head performance can differ from performance against scripted bots, highlighting the need for varied evaluation modes.
Ablation Study for Agentic Modules:
- GPT-4o benefited from extended agentic workflows (e.g., reflection-planning).
- LLaMA-3.2-3B showed limited or even negative impact from more complex agentic modules, suggesting simpler prompts might be better for smaller LLMs.
- This implies that the choice of agentic strategy should be tailored to the LLM's capability; adding complexity isn't always beneficial for smaller models.
Effect of Visual Input:
- Image-only input led to substantial performance drops compared to text-only.
- Combining text and image inputs (Both) had mixed results. For some games/models, it helped (e.g., Claude in Street Fighter III), while for others, it was detrimental (e.g., GPT-4o in Ace Attorney).
- This highlights the current limitations of VLMs in extracting and reasoning over complex game visuals compared to well-structured text. For games where key information is not easily representable in text or requires nuanced visual understanding, multimodal inputs might be beneficial if the VLM can effectively integrate them. However, for games with comprehensive textual state descriptions, visual input can sometimes be a distraction.
Effect of Fine-tuning:
- Intra-game generalization: Fine-tuning smaller LLMs (Llama-3.2-1B/3B) on expert trajectories improved their performance on unseen scenarios within the same game (e.g., new stages, characters). They learned to generate valid actions more reliably. However, spatial reasoning remained a challenge.
- Out-of-distribution (OOD) game generalization: Fine-tuning on trajectories from a set of games improved performance on held-out, unseen games (Super Mario, 2048). This suggests models can learn transferable decision-making routines.
- This is highly practical, showing that smaller, fine-tuned models can potentially achieve better performance, making them more accessible for deployment. The fine-tuning dataset in Orak provides a direct path for this.

Implementation Considerations from the Paper:

MCP Interface: This modular design is key for practical implementation. Developers can swap LLMs, games, or agentic modules with relative ease by adhering to the MCP. This simplifies experimentation and benchmarking.

# Conceptual MCP interaction
# In eval.py
game_server = MCPClient("game_service_url")
agent_module_server = MCPClient("agent_service_url")
LLM = MyLLM()

game_state = game_server.get_state()
# Agentic strategy: e.g., reflection then action
reflection_prompt = construct_reflection_prompt(game_state)
reflection = LLM.generate(reflection_prompt) # or agent_module_server.call_reflection(prompt)
action_prompt = construct_action_prompt(game_state, reflection)
action = LLM.generate(action_prompt) # or agent_module_server.call_action(prompt)
game_server.execute_step(action)

Computational Requirements: Running 12 diverse games, some of which are commercial, implies varying setup complexities. LLM inference, especially for proprietary models via API calls, can be costly and time-consuming. The paper mentions pausing games during LLM inference for most games to manage real-time constraints, which is a practical workaround but differs from actual human play.
Agentic Module Design: The finding that simpler strategies may be better for smaller LLMs is an important consideration. Overly complex agentic pipelines can increase prompt length and inference cost, potentially without proportional performance gains.
State Representation: The paper details how game states (visual or internal) are converted to text/image inputs for LLMs. This is a critical step. For games like Super Mario, they use visual pattern matching. For others like Pokémon Red, they access internal game memory. This choice impacts the information available to the LLM and the engineering effort required.
Action Space Design: Abstracting low-level game controls into higher-level actions (e.g., in Street Fighter III, Stardew Valley) simplifies the LLM's task but requires careful design of these abstractions.

Conclusion:

Orak provides a standardized and comprehensive benchmark for developing and evaluating LLM agents in diverse video games. Its plug-and-play MCP interface, rich set of games, and fine-tuning dataset aim to accelerate research towards more generic and capable gaming LLM agents. The experimental results offer practical insights into model selection, agentic strategy design, and the use of multimodal inputs for game-playing agents.

PDF Markdown

Related Papers

GitHub

GitHub - krafton-ai/Orak (8 stars)

Tweets

https://twitter.com/Kangwook_Lee/status/1930654918576111999

https://twitter.com/harukaze5719/status/1934581736232628259

https://twitter.com/arxivsanitybot/status/1930821392523551140

YouTube

Show All Videos