This paper (Arumugam et al., 29 Apr 2025 ) addresses the challenge of data-efficient exploration in sequential decision-making problems tackled by LLM agents. While recent work has focused on designing novel LLM agent architectures that implicitly induce reinforcement learning (RL) algorithms through techniques like in-context learning (ICL) or self-reflection, the authors propose an alternative approach: using LLMs to explicitly implement existing, well-studied RL algorithms known for their data efficiency.
The core idea is to leverage the strong exploration properties of Posterior Sampling for Reinforcement Learning (PSRL) [strens2000bayesian, osband2013more] by decomposing the algorithm into core functions and assigning each function to a distinct LLM. Unlike classical PSRL implementations that often require complex statistical machinery (e.g., maintaining Dirichlet distributions for tabular MDPs), this approach uses natural language to represent the agent's beliefs and uncertainty about the environment, making it potentially applicable to natural language-based environments.
The LLM-based PSRL implementation involves three key components (illustrated in Figure 1):
- Approximate Posterior Updater LLM: This LLM is responsible for maintaining the agent's beliefs and uncertainty about the unknown MDP (transitions and rewards). Given the current textual posterior description and a trajectory of observed experiences (state, action, reward, next state), this LLM updates the textual description to reflect the new information. The authors note that using natural language for the prior allows flexible representation of knowledge, potentially beyond standard statistical distributions.
- Posterior Sampler LLM: Given the current textual posterior, this LLM generates a plausible hypothesis of the true MDP by sampling from the agent's beliefs. This hypothesis is also represented textually and should be consistent with the current state of knowledge and uncertainty described in the posterior. For some domains, like Wordle, this might involve sampling an environment proxy (e.g., the target word) rather than explicitly sampling the full MDP dynamics.
- Optimal Sample Policy LLM: Provided with the current state and the sampled MDP hypothesis (generated by the Posterior Sampler LLM), this LLM selects the action that it believes would be optimal in the sampled environment. This requires the LLM to perform planning based on the provided hypothesis. While achieving a truly optimal policy can be challenging, the paper notes that even approximately-optimal planning can still yield theoretical guarantees for PSRL.
These three LLMs are orchestrated to follow the PSRL algorithm (Algorithm 1):
- At the beginning of each episode, the Posterior Sampler LLM draws a new hypothesis about the MDP based on the current posterior.
- For each step within the episode, the Optimal Sample Policy LLM selects an action based on the current state and the sampled MDP hypothesis.
- After the episode completes, the Posterior Updater LLM updates the agent's posterior belief based on the full trajectory observed during the episode.
The authors evaluate their LLM-based PSRL on tasks that require prudent exploration, comparing it against several baseline LLM agents: Reflexion [shinn2024reflexion], In-Context Reinforcement Learning (ICRL) [monea2024LLMs], and In-Context Policy Iteration (ICPI) [brooks2023large]. The tasks include:
- A 5-armed Bernoulli Bandit (Horizon H=1), comparing against classic Thompson Sampling.
- Deterministic natural language tasks: A Combination Lock (H=3, 8 episodes) and Wordle (H=5, 6 episodes).
- A stochastic environment: A truncated RiverSwim (3 states, H=6 episodes).
Empirical results (Figures 2, 3, 4) demonstrate that the LLM-based PSRL effectively retains the exploration benefits of classic PSRL in multi-armed bandits and deterministic natural language tasks, generally outperforming the baseline LLM agents. The paper highlights that simply prompting LLMs to perform the atomic functions of PSRL leads to strategic exploration without explicitly instructing them to explore.
A key finding regarding stochastic environments (RiverSwim) is the significant impact of the underlying LLM capability. While GPT-4o struggled with maintaining the detailed textual epistemic state in RiverSwim (Figure 6 in appendix), upgrading to a more capable model (o1-mini) allowed the LLM-based PSRL to achieve sub-linear regret comparable to classic tabular PSRL (Figure 5). This suggests that the approach can scale gracefully with improved LLMs.
However, the scalability to larger stochastic environments remains a limitation. Increasing the RiverSwim environment size to 4 states causes performance degradation back to linear regret (Figure 6). The authors attribute this to challenges in LLMs accurately maintaining posterior concentration over transitions and rewards, and difficulty in performing long-term value-based planning even when supplied with high-fidelity sampled MDPs.
The paper also briefly explores a more advanced exploration strategy, Information-Directed Sampling (IDS) [russo2018learning], implemented with LLMs (LLM-IDS) to address a known limitation of Thompson Sampling (and thus PSRL): the inability to take deliberately suboptimal actions for high information gain. Preliminary results on a contrived informative action bandit and the Combination Lock (Figure 8, 9) suggest that implementing IDS with LLMs is a promising future direction.
In conclusion, the paper establishes that explicitly implementing existing, data-efficient RL algorithms like PSRL using LLMs is a viable and effective approach for equipping LLM agents with strategic exploration capabilities, particularly in natural language domains. The success hinges on the LLMs' ability to perform the necessary probabilistic reasoning and planning functions, highlighting future research avenues in scaling these capabilities to more complex and stochastic environments.