MC-DML: Memory-Guided Monte Carlo Planning

Updated 29 October 2025

The paper introduces MC-DML, which fuses Monte Carlo Tree Search with LLMs and dynamic memory to enhance exploration and efficiency in large decision spaces.
MC-DML employs in-trial and cross-trial memory mechanisms to inform action selection, mitigating the computational costs of extensive Monte Carlo sampling.
Empirical results on text-based games reveal that MC-DML outperforms standard MCTS and RL approaches in terms of cumulative rewards and task success rates.

Monte Carlo Planning with Dynamic Memory-guided LLMs (MC-DML) is an approach that fuses the robust search capabilities of Monte Carlo Tree Search (MCTS) with the language understanding and reasoning strengths of LLMs, further augmented by dynamic memory mechanisms. By integrating in-trial and cross-trial memory into the planning process, MC-DML is designed to efficiently explore large and complex decision spaces while mitigating the computational overhead typically associated with large Monte Carlo sample sizes.

1. Background and Motivation

Conventional planning methods—whether based on reinforcement learning (RL) or search algorithms—often struggle in scenarios with extensive combinatorial complexity and high uncertainty. In the context of language-based decision tasks such as text-based games or diffusion model planning, the inherent stochasticity of LLM outputs requires exploration over many possible action sequences. Typical Monte Carlo sampling methods demand retention of all computation graphs for gradient backpropagation, leading to a prohibitive memory cost. MC-DML addresses these challenges by constructing planning procedures that use memory-efficient MC sampling integrated with dynamic memory mechanisms. This results in both improved likelihood approximation and enhanced exploration on tasks ranging from mathematical problem solving to interactive text-based game playing.

2. Integration of MCTS and LLMs

MC-DML modifies the standard MCTS framework by incorporating LLMs as policy priors and embedding additional memory inputs into the decision process. In traditional MCTS, the selection of actions is driven by the Upper Confidence Bound (UCT) formula:

a* = argmax₍ₐ∈A(s)₎ [ Q(s, a) + Cₚ₍ᵤ𝒸ₜ₎ * π(a|s) * √(N(s))/(1+N(s, a)) ]

In MC-DML, the LLM replaces the typical neural policy to provide a language-grounded probability distribution over actions. The action selection then becomes

a* = argmax₍ₐ∈A(s)₎ [ Q(s, a) + Cₚ₍ₚᵤ𝒸ₜ₎ * LLM(a | 𝓜ᵢ, 𝓜𝒸, p) * √(N(s))/(1+N(s, a)) ]

where 𝓜ᵢ and 𝓜𝒸 denote in-trial and cross-trial memories respectively, and p represents the prompting context. This integration allows the agent to combine immediate game context and past experience with the LLM’s strong language understanding.

3. Dynamic Memory Mechanisms

A central innovation of MC-DML is its use of dynamic memory to guide planning and learning:

In-Trial Memory (𝓜ᵢ): This short-term memory maintains the recent trajectory of observations and actions. It functions as a working memory that grounds the LLM’s current decision by providing context from the ongoing simulation.
Cross-Trial Memory (𝓜𝒸): This episodic memory aggregates “reflections” from previous failed simulations. When a simulation reaches a terminal failure state, the LLM generates a brief critique or suggestion, which is stored and later incorporated into the LLM’s decision-making process at similar states. This mechanism resembles experience replay and enables the agent to learn from past mistakes, thereby discouraging repeatedly taking actions that historically lead to failures.

By embedding both types of memory into the MCTS procedure, MC-DML achieves a form of adaptive planning where the evaluation of each action is continuously informed by both immediate context and accumulated experience.

4. Mathematical Formulation and Algorithmic Framework

MC-DML builds on the conventional Monte Carlo Tree Search framework with notable modifications:

Value Estimation:

For a node representing state s and action a, the estimated return Q(s, a) is updated according to the usual backpropagation scheme:

Vₙ ← Vₙ + (R - Vₙ)/(Nₙ + 1)

Modified UCT with LLM Guidance:

The selection procedure incorporates the LLM-predicted probability conditioned on dynamic memory:

a* = argmax₍ₐ∈A(s)₎ [ Q(s, a) + Cₚ₍ₚᵤ𝒸ₜ₎ * LLM(a | 𝓜ᵢ, 𝓜𝒸, p) * √(N(s))/(1+N(s, a)) ]

Memory Update:

When simulations end in failure, an LLM-guided reflection is generated and stored in cross-trial memory, updating 𝓜𝒸 for later visits to similar states.

The overall MC-DML algorithm therefore comprises four phases—selection, expansion, simulation (rollout), and backpropagation—with dynamic memory augmenting the standard PUCT (Predictor + UCT) sampling rule. The pseudocode provided in the original work outlines a recursive simulation procedure that, upon terminal failure, invokes memory update via an LLM prompt and then propagates the revised rewards upward.

5. Empirical Results and Comparative Performance

Experiments conducted on text-based games from the Jericho benchmark demonstrate that MC-DML attains superior performance over both pure RL approaches and standard MCTS variants. Key experimental observations include:

Immediate Performance Gains:

MC-DML achieves high scores in a single planning phase, outperforming methods that require multiple planning and retraining iterations.

Memory Ablation Studies:

Removing either in-trial or cross-trial memory substantially degrades performance, which confirms that dynamic memory is essential for steering clear of suboptimal choices and repetitive failures.

Comparative Metrics:

On benchmark games such as Zork1 and Deephome, MC-DML consistently registers higher cumulative rewards and better task success rates compared to agents using standard MCTS, RL-based approaches, or LLM-only policies.

The results indicate that the integration of dynamic memory with MCTS significantly improves the accuracy of action value estimation and enhances sample efficiency in complex, partially observable environments.

6. Implications and Future Directions

MC-DML represents a meaningful step toward more autonomous and human-like planning agents that can handle uncertainty inherent in both language generation and environmental dynamics. Its capacity to fuse LLM-based reasoning with memory-augmented search establishes several implications:

Scalability and Efficiency:

By decoupling the number of Monte Carlo samples from memory usage, MC-DML enables the agent to scale to large search spaces without prohibitive computational overhead.

Generalization:

The dynamic memory mechanism not only improves immediate planning performance but also contributes to better generalization across diverse tasks and environments.

Future Applications:

The design principles underlying MC-DML can be extended to other domains requiring reasoning under uncertainty, including diffusion model planning, combinatorial problem solving, and long-horizon task planning for service robots.

Further research may explore enhancements in long-term memory storage, more sophisticated retrieval mechanisms, and integration with real-time learning pipelines that update dynamic memory continuously during deployment.

MC-DML thus lays the groundwork for next-generation planning frameworks that leverage the strengths of LLMs while effectively managing the challenges of memory efficiency and uncertainty in complex decision-making tasks.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Monte Carlo Planning with Dynamic Memory-guided Large Language Model (MC-DML).