Think in Games (TiG) Framework

Updated 2 September 2025

Think in Games (TiG) is a research framework that integrates reinforcement learning, LLMs, and interactive game environments to enable dynamic procedural reasoning.
It employs language-guided policy generation with chain-of-thought reasoning and group-relative policy optimization for transparent decision-making.
The framework demonstrates high data efficiency and interpretability, achieving competitive performance in complex, multi-agent game domains.

“Think in Games” (TiG) is a research framework and set of methodologies that integrates game environments with learning and reasoning processes to bridge the gap between static declarative knowledge and dynamic procedural competence. This concept combines reinforcement learning (RL), LLMs, and interactive game settings to enable agents—particularly modern LLMs—to reason, act, and learn in a way that is both interpretable and data-efficient. TiG reframes RL-based decision-making as a language modeling task, where models generate actions and detailed natural language explanations in response to feedback from the game environment. The framework has demonstrated strong results in enhancing procedural understanding, data efficiency, and transparency compared to conventional RL systems. TiG’s scope includes macro-level strategic planning, step-by-step reasoning, and empirical results in multi-agent, online, and competitive game domains (Liao et al., 29 Aug 2025).

1. Framework Structure and Design

TiG formalizes RL-based decision-making as a language modeling task. The core process involves:

Language-guided policy generation: The LLM receives structured state observations (e.g., JSON representing game status, agents’ positions, goals, and resources).
Macro-action prediction: The model outputs a macro-level game action (such as “Push Top Lane” or “Secure Dragon” in MOBAs).
Chain-of-thought reasoning: For each action, the model generates an explicit, structured reasoning trace (using special tokens such as > ... for internal logic and <answer> ... </answer> for decision outputs).
Online learning via RL: Environmental feedback—binary or more granular rewards, typically generated by a rules-based system—are used to update the policy through an RL loop (Liao et al., 29 Aug 2025).
Group-Relative Policy Optimization (GRPO): RL updates are performed using an algorithm that computes token-level policy advantages relative to batches, regularizing with respect to divergence from a reference policy.

The overall objective function for GRPO is: $\mathcal{L}_{GRPO}(\theta) = -\frac{1}{\sum_k |o_k|} \sum_{k} \sum_{t} \left\{ \min \left[ \frac{\pi_\theta(o_{k,t}|q, o_{k,<t})}{\pi_{\theta_{old}}(o_{k,t}|q, o_{k,<t})} \cdot \hat{A}_{k,t}, \mathrm{clip}(\cdot, 1-\epsilon, 1+\epsilon) \cdot \hat{A}_{k,t} \right] - \beta D_{KL}[\pi_\theta || \pi_{ref}] \right\}$ where $\hat{A}_{k,t}$ is the group-relative advantage at token $t$ in trajectory $k$ .

2. Declarative and Procedural Knowledge Integration

Traditional LLMs primarily encode declarative knowledge, which is static and text-based. This enables them to answer factual queries and perform chain-of-thought reasoning in domains such as mathematics and coding. However, when operating in dynamic, interactive settings—such as games—these models are not equipped for rapid, context-dependent procedural reasoning ("knowing how"). TiG closes this gap through:

Direct environmental interaction: The agent operates within a game, adjusting its policy based on feedback rather than relying solely on text pre-training.
Chain-of-thought relabeling: The framework uses a relabeling technique to densify action annotations over game state sequences, enabling efficient learning even when labeled data is sparse or irregular.
Supervised and RL hybrid training: The learning pipeline integrates both supervised fine-tuning (SFT) for reasoning trace generation and RL for policy exploration and optimization (Liao et al., 29 Aug 2025).

3. Policy Representation and Transparency

A central feature of TiG is the explicit representation of strategy and decision-making:

Structured output format: Macro-action (decision) and chain-of-thought (rationale) are tightly coupled, with outputs conforming to a standardized template.
Transparency and interpretability: Each action is supported by a line-by-line reasoning trace, facilitating post-hoc analysis, error attribution, and user trust.
Token-level adaptation: The RL update mechanism optimizes not just for action selection, but also for the coherence and informativeness of the reasoning trace, with regularization to prevent drift from reference policies.

This contrasts with traditional RL agents, which typically operate as black boxes, offering little insight into their internal logic or knowledge transfer mechanisms.

4. Empirical Results and Data Efficiency

The TiG framework demonstrates competitive performance in complex, multi-agent games with a pronounced reduction in training time and computational demands:

Accuracy: For macro-action prediction in Honor of Kings (HoK), TiG-trained LLMs with GRPO reach up to 90.91% accuracy, outperforming larger, non-TiG models and baseline RL approaches (Liao et al., 29 Aug 2025).
Efficiency: TiG models achieve high performance with significantly fewer training steps and lower computational resource requirements than deep RL agents.
Model scalability: Small and mid-sized models (e.g., Qwen-3-14B) with SFT and GRPO post-training match or surpass much larger counterparts, demonstrating effective procedural competence transfer.

5. Explanatory Reasoning and Error Analysis

By generating natural language explanations for every decision, TiG enables:

Interpretability in interactive tasks: Human players and analysts can follow the agent’s reasoning, identify strengths and deficiencies, and iterate on game strategies based on transparent model output.
Robustness in high-stakes settings: Stepwise rationales facilitate error tracing, enabling rapid correction of faulty strategies or misinterpretations of game states.
Enhanced oversight for human–AI collaboration: The clear reasoning trace supports mixed-initiative systems, where human actors can audit, veto, or amend AI-powered strategies.

6. Implications and Application Domains

TiG’s framework is applicable to a variety of domains requiring procedural reasoning grounded in environmental feedback:

Game AI and player advisory systems: Real-time macro-action recommendations, transparent strategic planning, and accessible reasoning traces.
Simulation-based training: Adaptive agents for robotics, real-time strategy, and multirole simulations.
Educational environments: Interactive settings where agents teach and demonstrate strategy based on explainable decisions.
Research into knowledge representation: Bridging declarative and procedural knowledge in interactive agents, with implications for agent communication, explainable AI, and trust calibration.

A plausible implication is that language-based RL—when coupled with chain-of-thought regularization and explicit policy templates—can serve as a new paradigm, yielding highly efficient, interpretable, and generalizable strategic agents across both digital and physical domains.

7. Comparative Performance and Limitations

Compared to conventional RL agents:

System	Data/Compute Requirement	Interpretability	Performance
Traditional RL Agent	High	Black-box	Competitive
TiG (LLM + RL)	Low	Stepwise, Natural	Competitive/Superior

Limitations include the current scope of macro-action templates and the reliance on structured feedback; future research may extend TiG to finer-grained control, richer feedback functions, and more complex strategic domains.

References

TiG: Learning to Reason in Games via Reinforcement Learning with LLMs (Liao et al., 29 Aug 2025)

PDF Markdown Chat (Pro)

References (1)

Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Think in Games (TiG).