Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models (2508.21365v1)

Published 29 Aug 2025 in cs.AI

Abstract: LLMs excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a LLMing task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the TiG framework, bridging declarative and procedural knowledge by recasting reinforcement learning as a language modeling task for game strategy.
  • It employs Group Relative Policy Optimization (GRPO) to stabilize training, achieving up to 90.91% accuracy in macro-level action prediction with efficient resource use.
  • Evaluation shows that TiG-trained models maintain general reasoning abilities while providing transparent, step-by-step natural language guidance in dynamic game environments.

Think in Games: Learning to Reason in Games via Reinforcement Learning with LLMs

Introduction and Motivation

The paper addresses a fundamental gap in AI: the distinction between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). While LLMs excel at complex reasoning tasks, they struggle with interactive, procedural tasks that require dynamic decision-making, such as those found in digital games. Conversely, RL agents can acquire procedural knowledge through interaction but lack transparency and require extensive data. The proposed Think-In-Games (TiG) framework aims to bridge this gap by enabling LLMs to develop procedural understanding through direct interaction with game environments, specifically focusing on macro-level strategic reasoning in MOBA games.

Formalization of the TiG Framework

TiG reformulates RL-based decision-making as a LLMing task. The game state is represented as a structured JSON object, capturing all relevant information from the player's perspective (Figure 1). Figure 1

Figure 1: Demonstration of JSON object for each game state.

The macro-level action space is discretized into 40 interpretable actions, each corresponding to a strategic objective (e.g., "Push Top Lane", "Secure Dragon"). The policy model is an LLM trained to map game states to macro-level actions and provide step-by-step natural language reasoning. The task is formalized as learning a mapping f:(st,it)↦(at,ct)f: (s_t, i_t) \mapsto (a_t, c_t), where sts_t is the game state, iti_t is context, ata_t is the action, and ctc_t is the reasoning chain.

Reinforcement Learning with GRPO

TiG employs Group Relative Policy Optimization (GRPO) for RL, which optimizes the policy model using feedback from game state-action pairs. GRPO normalizes rewards within a batch of completions, stabilizing training and encouraging competitive, context-aware responses (Figure 2). Figure 2

Figure 2: Demonstration of GRPO training with Game State.

The reward function is rule-based and binary, assigning a reward of 1 if the predicted action matches the ground truth and 0 otherwise. This design avoids the complexity of neural reward models and ensures alignment with strategic objectives. The GRPO objective incorporates token-level KL regularization to constrain policy drift, enabling efficient and stable learning.

Experimental Results

Action Prediction and Reasoning Performance

TiG demonstrates strong performance in the action prediction task, with multi-stage training (SFT + GRPO) yielding substantial improvements across model sizes. Notably, Qwen-3-14B with SFT and extended GRPO training achieves 90.91% accuracy, surpassing Deepseek-R1 (86.67%), which is significantly larger in parameter count (Figure 3). Figure 3

Figure 3: Action Prediction Task.

This result highlights the efficiency and scalability of TiG, enabling smaller models to rival or outperform much larger ones. The introduction of GRPO consistently yields significant accuracy gains, confirming its effectiveness for boosting reasoning capabilities in LLMs.

Generalization and Benchmark Evaluation

TiG-trained models maintain or slightly improve general language and reasoning abilities on standard benchmarks (Ape210K, MMLU, CEval, BBH, IfEval, CharacterEval), indicating that domain-specific improvements do not compromise general capabilities.

Training Dynamics

Analysis of the RL training process shows that response length and reward trends vary by model architecture, with larger models benefiting from longer, more detailed reasoning chains (Figure 4). Figure 4

Figure 4

Figure 4

Figure 4: Qwen2.5-14B training dynamics.

Error Analysis

Error cases are categorized into basic game knowledge errors, game state misinterpretation, critical event oversight, situational misjudgment, and spatio-temporal miscoordination. TiG reduces the frequency of these errors compared to baseline models, achieving results comparable to much larger models.

Case Studies

Qualitative case studies illustrate TiG's capacity for deep, context-aware reasoning and actionable guidance. The model analyzes complex game states, prioritizes objectives, formulates strategies, and integrates hero-specific playstyles, providing clear and executable instructions (Figures 4–13). Figure 5

Figure 5: One of the cases of TiG. > </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 6

Figure 6: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 7

Figure 7: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 8

Figure 8: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 9

Figure 9: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 10

Figure 10: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 11

Figure 11: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 12

Figure 12: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 13

Figure 13: One of the cases of TiG. <think> </think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Figure 14

Figure 14: One of the cases of TiG. <think> refers to the thinking process of model output, and <result> </result> refers to the model guidance to the main player in natural language.

Implications and Future Directions

TiG demonstrates that LLMs can acquire procedural knowledge and strategic reasoning through RL-based interaction with game environments, while retaining transparency and interpretability. The framework achieves competitive performance with reduced data and computational requirements, and produces step-by-step natural language explanations for its decisions.

Implications:

  • TiG enables the development of AI agents that can both act effectively and explain their reasoning in dynamic environments.
  • The approach is scalable and efficient, making it suitable for deployment in real-time applications.
  • The language-guided policy paradigm enhances interpretability and trust in AI decision-making.

Future Directions:

  • Extending TiG to other interactive domains (e.g., robotics, real-world tasks) to assess generalizability.
  • Incorporating multimodal feedback (visual, auditory) for richer procedural learning.
  • Investigating long-term reasoning and memory mechanisms for tasks requiring extended temporal abstraction.

Conclusion

The Think-In-Games framework provides a principled approach for bridging the gap between declarative and procedural knowledge in LLMs. By reformulating RL as a LLMing task and leveraging GRPO, TiG enables LLMs to develop strategic reasoning capabilities through direct interaction with game environments. The framework achieves strong empirical results, preserves general language abilities, and enhances interpretability, representing a significant step toward more capable and transparent AI agents in complex, interactive domains.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube