Dyna-Think Dyna Training (DDT)
Dyna-Think Dyna Training (DDT) is a Dyna-style two-stage training paradigm that integrates world model simulation, reasoning, and acting within AI agents—specifically exemplified for LLM-based agents engaging in long-horizon, open-environment tasks such as computer use and web navigation. Developed within the Simple-Dyna framework, DDT systematically enhances both the agent’s internal world modeling capacity and its decision-making policy, resulting in improved performance and efficiency across a variety of complex agentic tasks (Yu et al., 31 May 2025 ).
1. Theoretical Foundations: Dyna, Imitation, and World Model Integration
Classic Dyna-style reinforcement learning methods (e.g., Dyna-Q) interleave real experience with environment simulation: the agent collects data, learns a model of the environment, and uses this learned model to generate synthetic trajectories that improve policy learning. However, in high-dimensional or combinatorially large state-action spaces, explicit model learning and separate module planning become impractical. DDT extends this by leveraging the unified capacity of LLMs to represent not only policies but also the environment’s transition dynamics and internal model simulations through language.
To address the complexity of agentic reasoning, the Dyna-Think framework combines:
- World model simulation: The agent generates predictions about future environment states, either explicit (e.g., next-state prediction), difference-based (state-delta), or metacognitive (critique generation).
- Reasoning and acting: The agent decomposes goals, verifies planned actions, and chooses actions based on a comprehensive interpretation of context and imagined consequences.
- Imitation learning initialization (Dyna-Think Imitation Learning, DIT): Supervised construction of agent trajectories, ensuring the world model simulation is strictly relevant to upcoming actions.
2. Dyna-Think Dyna Training: Two-Stage Training Procedure
DDT executes in two coordinated stages designed to alternately enhance world modeling and policy capabilities within a single LLM agent.
Stage 1: World Model Training
The primary objective is to improve the agent’s ability to simulate the environment’s response to potential actions. Several training functions are studied:
- Next-state prediction: Learning the mapping , i.e., predicting the next observation given the current observation and action.
- State-difference prediction: Learning , i.e., focusing on salient differences resulting from action.
- Critique generation: Producing an introspective evaluation or "critique" (e.g., via LLM assessment) comparing the agent’s simulated next state with the actual observed outcome. This is formalized as
where the agent is trained on the critique tokens generated by an external judge (such as GPT-4o).
Only the chosen world model components (e.g., critique tokens) are unmasked during training.
Stage 2: Policy Training
Once world model capabilities are strengthened, DDT proceeds to improve the agent’s decision-making policy. Training data consists of successful behavior trajectories:
with optimization performed via supervised learning (rejection sampling on successful episodes) or reinforcement learning objectives. This stage utilizes the enhanced internal model to facilitate richer and better-informed decision-making by the agent.
Iterative Enhancement
After these stages, DDT cycles further batches of real-environment experience through the same process, facilitating iterative improvement and continual bootstrapping of both agent simulation and policy.
3. Unified Internalization of World Modeling and Policy
Unique to DDT, both the world model and the policy reside within the same parameter space—the agent itself is responsible for simulation, reflection, and action selection via language. This unification allows:
- Direct interplay between model-based prediction and policy—simulation results immediately inform action choices.
- Data efficiency—synthetic, critique-augmented world model objectives increase learning yield from limited real experience.
- Simplified inference—no coordination overhead between separate planning and acting modules.
Mathematically, the joint loss can be expressed as:
where loss terms correspond to world model objectives (cross-entropy or negative log-likelihood for next-state/critique tokens) and policy actions.
4. Empirical Results: World Model Accuracy and Policy Efficacy
DDT has been evaluated on OSWorld, a suite of computer-use tasks requiring long-horizon reasoning and acting. Key empirical findings include:
- Critique-based world model training results in the highest policy success rates among all world model objectives tested, surpassing both next-state and difference prediction as well as RL-only baselines.
- Direct correlation between world model accuracy and policy success: Higher LLM-judged accuracy on world modeling tasks predicts greater agent task performance (Pearson correlation up to 0.45).
- Efficiency: Agents trained by DDT require approximately half as many generated tokens as baseline agents (e.g., R1) to achieve comparable best-of-n solution rates.
- Scaling: Increasing world model data continues to improve the policy’s robustness and generalization, demonstrating scalability of the DDT approach.
- Ablation studies: Removal of either world model simulation within the agent or critique-style objective from the training pipeline leads to significant performance drops, underscoring the necessity of both simulation and introspective evaluation.
Method | World Model Training | Policy Data | BoN Success (All) | Avg. Tokens |
---|---|---|---|---|
RFT (policy) | None | 35 traj | 38.5 | 1.0–2.7x |
Vanilla Dyna | Next-state (separate WM) | 94 (π), 116 (WM) | 35.6 | 1.1–2.5x |
DDT (Critique) | Critique (unified) | 35 (π), 116 (WM) | 44.3 | 1.2–2.7x |
5. Impact on Reasoning, Planning, and Actuation in LLM Agents
The integrative approach of DDT leads to several key agent-level advances:
- Enhanced reasoning: The simulation component of the thinking process augments the agent’s capacity for complex goal decomposition, verification, and anticipation.
- Improved planning: Explicit world model simulation (and subsequent critique) provides actionable lookahead and error correction, increasing trajectory success rates in both in-domain and out-of-domain tasks.
- Efficient action: Fusion of reasoning, simulation, and policy enables the agent to act with fewer redundant computation steps, reducing resource consumption per decision.
By internalizing a world model, DDT-trained agents move beyond pure imitation and direct feedback learning, instead developing the capacity to “think ahead,” reflect on simulated consequences, and make more robust decisions under uncertainty and novel scenarios.
6. Significance and Future Directions
DDT demonstrates that language-model-based agents can benefit substantially from explicit, internally trained world model simulation, especially when such simulation is structured as critique generation. This provides not only data-efficient policy improvement but also makes the agent’s reasoning process interpretable and modular.
The findings suggest several natural avenues for further investigation:
- Scaling critique-style world model pretraining: To further enhance generalization and out-of-distribution robustness.
- Extending to more open-ended agentic tasks: Where environment granularity, compositionality, and non-stationarity challenge classic Dyna approaches.
- Leveraging language-based simulation for richer planning: Such as integrating Monte Carlo search or multi-agent interaction within agent “thinking.”
7. Summary Table: DDT within Simple-Dyna Framework
Component | Function | Effect |
---|---|---|
DIT | Imitation-based policy and world model | Initialization (policy, WM) |
DDT | Two-stage (WM then policy training) | Enhanced WM, better acting |
Critique WM obj | LLM-based simulation critique generation | Highest policy performance |
DDT thus constitutes an empirically validated, theoretically grounded methodology for synergizing reasoning, acting, and world model simulation in LLM-based agents, advancing unified Dyna-style learning in language-centric, high-dimensional domains.