Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Dyna-Think Dyna Training (DDT)

Updated 22 June 2025

Dyna-Think Dyna Training (DDT) is a Dyna-style two-stage training paradigm that integrates world model simulation, reasoning, and acting within AI agents—specifically exemplified for LLM-based agents engaging in long-horizon, open-environment tasks such as computer use and web navigation. Developed within the Simple-Dyna framework, DDT systematically enhances both the agent’s internal world modeling capacity and its decision-making policy, resulting in improved performance and efficiency across a variety of complex agentic tasks (Yu et al., 31 May 2025 ).

1. Theoretical Foundations: Dyna, Imitation, and World Model Integration

Classic Dyna-style reinforcement learning methods (e.g., Dyna-Q) interleave real experience with environment simulation: the agent collects data, learns a model of the environment, and uses this learned model to generate synthetic trajectories that improve policy learning. However, in high-dimensional or combinatorially large state-action spaces, explicit model learning and separate module planning become impractical. DDT extends this by leveraging the unified capacity of LLMs to represent not only policies but also the environment’s transition dynamics and internal model simulations through language.

To address the complexity of agentic reasoning, the Dyna-Think framework combines:

World model simulation: The agent generates predictions about future environment states, either explicit (e.g., next-state prediction), difference-based (state-delta), or metacognitive (critique generation).
Reasoning and acting: The agent decomposes goals, verifies planned actions, and chooses actions based on a comprehensive interpretation of context and imagined consequences.
Imitation learning initialization (Dyna-Think Imitation Learning, DIT): Supervised construction of agent trajectories, ensuring the world model simulation is strictly relevant to upcoming actions.

2. Dyna-Think Dyna Training: Two-Stage Training Procedure

DDT executes in two coordinated stages designed to alternately enhance world modeling and policy capabilities within a single LLM agent.

Stage 1: World Model Training

The primary objective is to improve the agent’s ability to simulate the environment’s response to potential actions. Several training functions are studied:

Next-state prediction: Learning the mapping $\mathcal{T}(o_i, a_i) \to o_{i+1}$ , i.e., predicting the next observation given the current observation and action.
State-difference prediction: Learning $\mathcal{T}_\Delta(o_i, a_i) \to \Delta(o_i, o_{i+1} \mid a_i)$ , i.e., focusing on salient differences resulting from action.
Critique generation: Producing an introspective evaluation or "critique" (e.g., via LLM assessment) comparing the agent’s simulated next state with the actual observed outcome. This is formalized as

$\mathcal{T}_{\textrm{critic}}(o_i, a_i) \to \textrm{Critic}(o_i, a_i \mid o_{i+1})$

where the agent is trained on the critique tokens generated by an external judge (such as GPT-4o).

Only the chosen world model components (e.g., critique tokens) are unmasked during training.

Stage 2: Policy Training

Once world model capabilities are strengthened, DDT proceeds to improve the agent’s decision-making policy. Training data consists of successful behavior trajectories:

$\pi(o_0, a_{<i}, o_{<i}) \to a_i$

with optimization performed via supervised learning (rejection sampling on successful episodes) or reinforcement learning objectives. This stage utilizes the enhanced internal model to facilitate richer and better-informed decision-making by the agent.

Iterative Enhancement

After these stages, DDT cycles further batches of real-environment experience through the same process, facilitating iterative improvement and continual bootstrapping of both agent simulation and policy.

3. Unified Internalization of World Modeling and Policy

Unique to DDT, both the world model and the policy reside within the same parameter space—the agent itself is responsible for simulation, reflection, and action selection via language. This unification allows:

Direct interplay between model-based prediction and policy—simulation results immediately inform action choices.
Data efficiency—synthetic, critique-augmented world model objectives increase learning yield from limited real experience.
Simplified inference—no coordination overhead between separate planning and acting modules.

Mathematically, the joint loss can be expressed as:

$\min_{\theta} \ \mathbb{E}_{\tau} \left[ \sum_{i} \ell_{\text{WM}}\left(\mathcal{T}(o_i, a_i), o_{i+1}\right) + \sum_{i} \ell_{\text{policy}}\left(\pi(o_0, a_{<i}, o_{<i}), a_i \right) \right]$

where loss terms correspond to world model objectives (cross-entropy or negative log-likelihood for next-state/critique tokens) and policy actions.

4. Empirical Results: World Model Accuracy and Policy Efficacy

DDT has been evaluated on OSWorld, a suite of computer-use tasks requiring long-horizon reasoning and acting. Key empirical findings include:

Critique-based world model training results in the highest policy success rates among all world model objectives tested, surpassing both next-state and difference prediction as well as RL-only baselines.
Direct correlation between world model accuracy and policy success: Higher LLM-judged accuracy on world modeling tasks predicts greater agent task performance (Pearson correlation up to 0.45).
Efficiency: Agents trained by DDT require approximately half as many generated tokens as baseline agents (e.g., R1) to achieve comparable best-of-n solution rates.
Scaling: Increasing world model data continues to improve the policy’s robustness and generalization, demonstrating scalability of the DDT approach.
Ablation studies: Removal of either world model simulation within the agent or critique-style objective from the training pipeline leads to significant performance drops, underscoring the necessity of both simulation and introspective evaluation.

Method	World Model Training	Policy Data	BoN Success (All)	Avg. Tokens
RFT (policy)	None	35 traj	38.5	1.0–2.7x
Vanilla Dyna	Next-state (separate WM)	94 (π), 116 (WM)	35.6	1.1–2.5x
DDT (Critique)	Critique (unified)	35 (π), 116 (WM)	44.3	1.2–2.7x

5. Impact on Reasoning, Planning, and Actuation in LLM Agents

The integrative approach of DDT leads to several key agent-level advances:

Enhanced reasoning: The simulation component of the thinking process augments the agent’s capacity for complex goal decomposition, verification, and anticipation.
Improved planning: Explicit world model simulation (and subsequent critique) provides actionable lookahead and error correction, increasing trajectory success rates in both in-domain and out-of-domain tasks.
Efficient action: Fusion of reasoning, simulation, and policy enables the agent to act with fewer redundant computation steps, reducing resource consumption per decision.

By internalizing a world model, DDT-trained agents move beyond pure imitation and direct feedback learning, instead developing the capacity to “think ahead,” reflect on simulated consequences, and make more robust decisions under uncertainty and novel scenarios.

6. Significance and Future Directions

DDT demonstrates that language-model-based agents can benefit substantially from explicit, internally trained world model simulation, especially when such simulation is structured as critique generation. This provides not only data-efficient policy improvement but also makes the agent’s reasoning process interpretable and modular.

The findings suggest several natural avenues for further investigation:

Scaling critique-style world model pretraining: To further enhance generalization and out-of-distribution robustness.
Extending to more open-ended agentic tasks: Where environment granularity, compositionality, and non-stationarity challenge classic Dyna approaches.
Leveraging language-based simulation for richer planning: Such as integrating Monte Carlo search or multi-agent interaction within agent “thinking.”

7. Summary Table: DDT within Simple-Dyna Framework

Component	Function	Effect
DIT	Imitation-based policy and world model	Initialization (policy, WM)
DDT	Two-stage (WM then policy training)	Enhanced WM, better acting
Critique WM obj	LLM-based simulation critique generation	Highest policy performance

DDT thus constitutes an empirically validated, theoretically grounded methodology for synergizing reasoning, acting, and world model simulation in LLM-based agents, advancing unified Dyna-style learning in language-centric, high-dimensional domains.

PDF Markdown Chat (Pro)