Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 200 tok/s Pro

GPT OSS 120B 469 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models (2508.21365v1)

Published 29 Aug 2025 in cs.AI

Abstract: LLMs excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a LLMing task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.

Collections

Summary

The paper introduces the TiG framework, bridging declarative and procedural knowledge by recasting reinforcement learning as a language modeling task for game strategy.
It employs Group Relative Policy Optimization (GRPO) to stabilize training, achieving up to 90.91% accuracy in macro-level action prediction with efficient resource use.
Evaluation shows that TiG-trained models maintain general reasoning abilities while providing transparent, step-by-step natural language guidance in dynamic game environments.

Think in Games: Learning to Reason in Games via Reinforcement Learning with LLMs

Introduction and Motivation

The paper addresses a fundamental gap in AI: the distinction between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). While LLMs excel at complex reasoning tasks, they struggle with interactive, procedural tasks that require dynamic decision-making, such as those found in digital games. Conversely, RL agents can acquire procedural knowledge through interaction but lack transparency and require extensive data. The proposed Think-In-Games (TiG) framework aims to bridge this gap by enabling LLMs to develop procedural understanding through direct interaction with game environments, specifically focusing on macro-level strategic reasoning in MOBA games.

Formalization of the TiG Framework

TiG reformulates RL-based decision-making as a LLMing task. The game state is represented as a structured JSON object, capturing all relevant information from the player's perspective (Figure 1).

Figure 1: Demonstration of JSON object for each game state.

The macro-level action space is discretized into 40 interpretable actions, each corresponding to a strategic objective (e.g., "Push Top Lane", "Secure Dragon"). The policy model is an LLM trained to map game states to macro-level actions and provide step-by-step natural language reasoning. The task is formalized as learning a mapping $f: (s_t, i_t) \mapsto (a_t, c_t)$ , where $s_t$ is the game state, $i_t$ is context, $a_t$ is the action, and $c_t$ is the reasoning chain.

Reinforcement Learning with GRPO

TiG employs Group Relative Policy Optimization (GRPO) for RL, which optimizes the policy model using feedback from game state-action pairs. GRPO normalizes rewards within a batch of completions, stabilizing training and encouraging competitive, context-aware responses (Figure 2).

Figure 2: Demonstration of GRPO training with Game State.

The reward function is rule-based and binary, assigning a reward of 1 if the predicted action matches the ground truth and 0 otherwise. This design avoids the complexity of neural reward models and ensures alignment with strategic objectives. The GRPO objective incorporates token-level KL regularization to constrain policy drift, enabling efficient and stable learning.

Experimental Results

Action Prediction and Reasoning Performance

TiG demonstrates strong performance in the action prediction task, with multi-stage training (SFT + GRPO) yielding substantial improvements across model sizes. Notably, Qwen-3-14B with SFT and extended GRPO training achieves 90.91% accuracy, surpassing Deepseek-R1 (86.67%), which is significantly larger in parameter count (Figure 3).

Figure 3: Action Prediction Task.

This result highlights the efficiency and scalability of TiG, enabling smaller models to rival or outperform much larger ones. The introduction of GRPO consistently yields significant accuracy gains, confirming its effectiveness for boosting reasoning capabilities in LLMs.

Generalization and Benchmark Evaluation

TiG-trained models maintain or slightly improve general language and reasoning abilities on standard benchmarks (Ape210K, MMLU, CEval, BBH, IfEval, CharacterEval), indicating that domain-specific improvements do not compromise general capabilities.

Training Dynamics

Analysis of the RL training process shows that response length and reward trends vary by model architecture, with larger models benefiting from longer, more detailed reasoning chains (Figure 4).

Figure 4: Qwen2.5-14B training dynamics.

Error Analysis

Error cases are categorized into basic game knowledge errors, game state misinterpretation, critical event oversight, situational misjudgment, and spatio-temporal miscoordination. TiG reduces the frequency of these errors compared to baseline models, achieving results comparable to much larger models.

Case Studies

Qualitative case studies illustrate TiG's capacity for deep, context-aware reasoning and actionable guidance. The model analyzes complex game states, prioritizes objectives, formulates strategies, and integrates hero-specific playstyles, providing clear and executable instructions (Figures 4–13).

Figure 5: One of the cases of TiG. > </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 6: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 7: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 8: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 9: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 10: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 11: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 12: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 13: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 14: One of the cases of TiG. <think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Implications and Future Directions

TiG demonstrates that LLMs can acquire procedural knowledge and strategic reasoning through RL-based interaction with game environments, while retaining transparency and interpretability. The framework achieves competitive performance with reduced data and computational requirements, and produces step-by-step natural language explanations for its decisions.

Implications:

TiG enables the development of AI agents that can both act effectively and explain their reasoning in dynamic environments.
The approach is scalable and efficient, making it suitable for deployment in real-time applications.
The language-guided policy paradigm enhances interpretability and trust in AI decision-making.

Future Directions:

Extending TiG to other interactive domains (e.g., robotics, real-world tasks) to assess generalizability.
Incorporating multimodal feedback (visual, auditory) for richer procedural learning.
Investigating long-term reasoning and memory mechanisms for tasks requiring extended temporal abstraction.

Conclusion

The Think-In-Games framework provides a principled approach for bridging the gap between declarative and procedural knowledge in LLMs. By reformulating RL as a LLMing task and leveraging GRPO, TiG enables LLMs to develop strategic reasoning capabilities through direct interaction with game environments. The framework achieves strong empirical results, preserves general language abilities, and enhances interpretability, representing a significant step toward more capable and transparent AI agents in complex, interactive domains.