Papers
Topics
Authors
Recent
2000 character limit reached

PlanU: Robust LLM Decision-Making

Updated 28 October 2025
  • PlanU is a decision-making framework that models both LLM intrinsic uncertainty and environmental stochasticity, ensuring robust multi-step planning.
  • It integrates quantile distribution modeling with Monte Carlo Tree Search using quantile regression to capture diverse outcome distributions.
  • The framework employs Upper Confidence Bounds with Curiosity to balance exploration and exploitation, demonstrating superior performance on stochastic benchmark tasks.

PlanU is a decision-making methodology for LLMs that enables robust planning in environments subject to substantial uncertainty. It addresses both uncertainty intrinsic to LLMs and stochastic environmental transitions, providing a framework that uses quantile-based return distributions and enhanced exploration strategies through upper confidence bounds and curiosity-driven rewards (Deng et al., 21 Oct 2025).

1. Foundational Motivation

PlanU builds upon the limitations of prior LLM-based Decision-Making (LDM) approaches that typically handle either LLM uncertainty (arising from stochastic sampling and variable output quality) or assume environments with deterministic state transitions. In practical sequential decision-making tasks, however, decision agents must contend simultaneously with:

  • LLM uncertainty: Variability in model outputs due to stochasticity in sampling and reasoning processes.
  • Environmental uncertainty: Non-deterministic transitions, where actions yield probabilistic rather than fixed outcomes.

PlanU explicitly models both uncertainties, aiming to facilitate multi-step planning and reliable interaction with stochastic environments in domains such as robotics, navigation, and complex digital tasks.

2. Quantile Distribution Modeling in MCTS

The core innovation in PlanU is its fusion of Monte Carlo Tree Search (MCTS) with quantile-based return evaluation at each tree node. Unlike classical MCTS, which stores expected (mean) values for state–action pairs, PlanU models the return Z(s,a)Z(s, a) as a quantile distribution:

θZ(s,a,τ)=inf{zR:τFZ(z)}\theta_Z(s, a, \tau) = \inf \{ z \in \mathbb{R} : \tau \leq F_Z(z) \}

where FZ(z)F_Z(z) is the cumulative distribution function of the return for state ss and action aa, and τ[0,1]\tau \in [0, 1] indexes quantiles. PlanU typically employs a fixed number of quantiles (e.g., nq=50n_q = 50), updating each using quantile regression:

LQR=EZ(s,a)[ρτ(y(s+,a+)Z(s,a))]\mathcal{L}_{QR} = \mathbb{E}_{Z(s,a)} [\rho_\tau ( y(s_+, a_+) - Z(s, a) )]

Here, y(s+,a+)y(s_+, a_+) is the target value incorporating immediate rewards and discounted future returns; ρτ()\rho_\tau(\cdot) is a quantile-huber loss, selected for its robustness to non-differentiable value changes.

This quantile-based approach provides fine-grained uncertainty information at each node, preserving the multimodal characteristics of possible outcome distributions and better informing downstream action selection.

3. Upper Confidence Bounds with Curiosity (UCC) Score

To navigate the exploration–exploitation tradeoff inherent in MCTS, PlanU introduces the Upper Confidence Bounds with Curiosity (UCC) scoring mechanism. For each candidate action aa from state ss, the UCC score is:

UCC(st,at)=ψ[Z(st,at)]+c1(ri(st)N(st,at))UCC(s_t, a_t) = \psi[Z(s_t, a_t)] + c_1 \cdot \left( \frac{r_i(s_t)}{N(s_t, a_t)} \right)

  • ψ[Z(st,at)]\psi[Z(s_t, a_t)] maps the estimated quantile distribution to a scalar representing expected return.
  • ri(st)r_i(s_t) is an intrinsic novelty (curiosity) reward, capturing the degree of unfamiliarity in the state sts_t. This is computed as the divergence between textual features predicted by a trained network and those from a fixed, randomly initialized network.
  • N(st,at)N(s_t, a_t) is the visit count for (st,at)(s_t, a_t), regularizing exploration.
  • c1c_1 is a tunable hyperparameter adjusting the reward's strength.

By integrating curiosity-driven rewards, UCC encourages the agent to explore novel or less-visited states while retaining a bias toward actions with high expected returns, thus facilitating robust decision-making even under severe uncertainty.

4. Experimental Evaluation and Comparative Results

PlanU was tested on benchmark tasks with deliberate stochasticity, including Blocksworld (20% action failure rates), Overcooked, VirtualHome, TravelPlanner, and WebShop. These environments present:

  • Multi-step, stochastic action results: Actions may not succeed deterministically; outcomes are sampled according to probabilistic transitions.
  • Variable LLM output quality: Induced by manipulating prompt structures or sampling temperatures.

PlanU consistently yielded higher success rates and stronger constraint satisfaction compared to baselines such as Chain-of-Thought prompting (CoT), Tree-of-Thought (ToT), and RAP (Reward Augmented Planning):

Benchmark PlanU Performance Baseline (Best) Notable Features
Blocksworld Highest success rate ToT, RAP Robust to 20% action failures
Overcooked Superior task completion Linear reward models Stable under LLM output variability
VirtualHome Outperforms RAP variants RAP, CoT, ToT Handles sequential stochastic planning
TravelPlanner Improved constraint sat. ToT, CoT Excels in real-world data with uncertainty
WebShop Higher completion/accuracy Baseline LDMs Reliable with web-based stochastic environment

A key result is PlanU’s robustness: performance degrades minimally as LLM sampling temperature increases or prompt order is manipulated, which would otherwise impact conventional LDM strategies.

5. Methodological Implications and Broader Applications

PlanU’s framework—combining quantile distributions with curiosity-augmented exploration—potentially generalizes across a spectrum of interactions requiring planning under uncertainty:

  • Robotics: Manipulation tasks under uncertain sensor readings or actuator errors.
  • Autonomous navigation: Route planning with probabilistic obstacles or traffic.
  • Complex web automation: Shopping, travel, and scheduling with unpredictable third-party service responses.

Further implications include a blueprint for integrating external tools or oracles via hierarchical MCTS, progressive widening techniques for continuous or high-dimensional action spaces, and possible fusion with reinforcement learning to refine policy estimates under risk. This suggests PlanU’s architecture may serve as a foundational template for future robust LLM planners.

6. Limitations and Forward-Looking Challenges

While PlanU demonstrates empirical improvements over existing LDM approaches, several avenues for enhancement remain:

  • Scalability: Quantile-based distributions entail increased memory and computation. Progressive and hierarchical search techniques may alleviate overhead for real-time deployment.
  • Curiosity reward calibration: The effectiveness of ri(s)r_i(s) depends on accurate state representations and network divergence metrics; adaptation for domains with poor feature extraction remains an open challenge.
  • Integration with external sources: PlanU is designed for closed, well-defined environments; extending to scenarios requiring real-time external data (web search, APIs) will necessitate further methodological development.

A plausible implication is that future models may couple the PlanU framework with continuous learning agents or hybrid neuro-symbolic tools, achieving resilient planning across open-world uncertain environments.

7. Contextual Connections within Decision-Making Research

PlanU operates at the confluence of LLM reasoning, sequential decision-making, and probabilistic planning. It extends tree-based planning (MCTS) by incorporating distributional perspectives, similar in spirit to distributional reinforcement learning but applied within LLM-driven agents. PlanU’s explicit treatment of both LLM and environmental uncertainty distinguishes it from approaches that consider only one source, marking an advance in safety and reliability standards for real-world autonomous decision agents.

Researchers may situate PlanU within broader efforts to blend generative modeling, robust planning, and exploration incentives—an area likely to see continued growth as LLMs are deployed for increasingly complex tasks under uncertainty.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PlanU.