Papers
Topics
Authors
Recent
Search
2000 character limit reached

In-Context Policy Iteration (ICPI)

Updated 26 January 2026
  • In-Context Policy Iteration (ICPI) is a reinforcement learning method that uses frozen LLMs and prompt design to iterate policies without gradient updates.
  • It employs in-context learning for value estimation and policy improvement across discrete MDPs and continuous dynamic tasks.
  • ICPI achieves robust sample efficiency by leveraging autoregressive token queries and replay buffers to drive adaptive decision-making.

In-Context Policy Iteration (ICPI) is a reinforcement learning (RL) methodology that repurposes large pre-trained foundation models—particularly LLMs—to implement policy iteration entirely through prompt design, without any modification or gradient updates to model parameters. ICPI exploits the emergent in-context learning capabilities of modern transformers to realize policy improvement, simulation, and value estimation through sequential token queries, with all adaptation localized to growing prompt content. This paradigm has been demonstrated for both discrete, tabular Markov Decision Processes (MDPs) and continuous, high-dimensional dynamic manipulation domains, achieving competitive sample efficiency and robustness without expert demonstrations or gradient-based learning (Brooks et al., 2022, Merwe et al., 20 Aug 2025).

1. Mathematical Framework

ICPI operates within the standard RL formalism. For discrete domains, a Markov Decision Process is specified by state space S\mathcal{S}, action space A\mathcal{A}, transition kernel P(ss,a)P(s'|s, a), reward function R(s,a)R(s, a), and discount factor γ[0,1)\gamma\in[0,1). The objective is to learn a stationary policy π:SA\pi: \mathcal{S} \to \mathcal{A} that maximizes the expected discounted sum of rewards: Qπ(s,a)=E[t=0γtrts0=s,a0=a,π].Q^\pi(s, a) = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s, a_0=a, \pi\right]. Classical policy iteration alternates between policy evaluation (estimation of QπQ^\pi) and policy improvement (greedily updating π\pi using QπQ^\pi).

ICPI replaces explicit value function estimation by simulating rollouts using a frozen LLM as a world model. At each decision point, for every action aAa\in\mathcal{A} in state ss, ICPI estimates Q(s,a)Q(s,a) via Monte Carlo trajectory samples generated autoregressively by the LLM, inferring immediate reward, next state, and termination conditions. Policy improvement is realized by greedy selection: at=argmaxaA    Q^(st,a;B),a_t = \operatorname*{argmax}_{a\in\mathcal{A}}\;\;\hat Q(s_t, a; B), where BB is a replay buffer comprising all previously observed transitions, stored as prompts in a pre-defined textual format. For continuous domains with parameterized policies πθ\pi_\theta, trajectory error ee is computed after each rollout, and the LLM is prompted to produce parameter updates Δθ\Delta\theta that minimize the rollout cost CτC_\tau (Merwe et al., 20 Aug 2025).

2. Algorithmic Realization

ICPI is implemented as an iterative, prompt-driven procedure. In the discrete/MDP regime (Brooks et al., 2022):

  1. An empty replay buffer BB is initialized.
  2. For each episode, the agent:
    • Observes current state sts_t.
    • For each action aa, simulates a trajectory using the current buffer BB as in-context data, assembling BRewB_\text{Rew}, BDoneB_\text{Done}, BObsB_\text{Obs}, BTrajB_\text{Traj} for reward, termination, transition, and action modeling, respectively.
    • Collects rollouts via repeated LLM queries, estimating Q^(st,a;B)\hat Q(s_t, a; B) as the discounted sum of sampled rewards.
    • Selects at=argmaxaQ^(st,a;B)a_t = \arg\max_a \hat Q(s_t, a; B) and interacts with the environment, appending the transition to BB.

In parameterized policy settings (Merwe et al., 20 Aug 2025), ICPI constructs a dataset D={(θi,ei,Δθi)}\mathcal{D}=\{(\theta^i, e^i, \Delta\theta^i)\} from prior rollouts. K-nearest-neighbor selection in [θ,e][\theta, e]-space collects kk relevant prompt examples, which, together with the query θi,ei\theta^i, e^i, is passed to a transformer model to generate Δθi\Delta\theta^i. The policy is then updated by θi+1=θi+Δθi\theta^{i+1} = \theta^i + \Delta\theta^i.

3. Prompt Design and Model Usage

ICPI leverages systematic prompt engineering to encode all knowledge required for policy iteration within the LLM context window. Discrete ICPI renders environment transitions as Python-style assertions (e.g., assert state==3\n reward==0.0\n not done\n next_state=4). Different prompt subsets are designed for reward, termination, and observation queries, with each prompt carefully balanced to include equal numbers of terminal/non-terminal and reward classes to mitigate bias (Brooks et al., 2022).

For continuous and dynamic manipulation tasks, policy parameters, rollout errors, and target updates are tokenized as space-separated numeric values. Prompts have a standardized instruction header and a structured sequence of k-shot input-output examples, finalized with the query (current) line for completion. This design enables the LLM to generalize from prior patterns without model weight modification (Merwe et al., 20 Aug 2025).

All policy improvement and state simulation occurs solely via prompt content; LLMs remain entirely frozen with no back-propagation or gradient updates.

4. Learning Mechanism and Theoretical Considerations

ICPI relies on the in-context learning capacity of large transformers to "learn" both environmental dynamics (for forward simulation) and policy structure (for action generation or parameter update suggestion). In lieu of gradient-based loss minimization, learning dynamics are embedded in the edit and organization of prompts, with relevant sub-buffers and example selection controlling empirical distributions. Exploration arises not from explicit mechanisms, but from the statistical diversity and recency management of the in-context buffer or example pool.

In iterative policy update settings, ICPI implicitly minimizes the squared norm between predicted update and the oracle policy improvement direction: ΔθiargminΔθΔθ(θθi)2,θi+1=θi+Δθi\Delta\theta^i \approx \arg\min_{\Delta\theta} \left\|\Delta\theta - (\theta^* - \theta^i)\right\|^2, \quad \theta^{i+1} = \theta^i + \Delta\theta^i (Merwe et al., 20 Aug 2025). All convergence and performance characteristics arise from the model's ability to interpolate and extrapolate from in-context demonstration sets.

5. Empirical Performance and Evaluation

ICPI achieves rapid learning and competitive performance on both synthetic MDPs and real-world dynamic manipulation tasks. Table 1 from (Merwe et al., 20 Aug 2025) summarizes results from five environments:

Method slide slide-gc rope-swing rope-swing-gc roll-gc-real
Random Shooting 0.037 0.077 0.007 0.004
Bayes Opt 0.054 0.087 0.019 0.014
KNN-5 0.106 0.071 0.020 0.010
Lin. KNN-20 0.053 0.022 0.042 0.006 33.824
ICSI 0.030 0.107 0.042 0.021
In-Weights 0.029 0.102 0.027 0.022
ICPI 0.013 0.025 0.007 0.002 17.107

Results for discrete domains (Brooks et al., 2022) indicate ICPI with Codex-davinci-001 transitions to near-optimal policies within 200–400 steps across six MDPs, outperforming both memorization/nearest-neighbor and tabular Q-learning baselines. Smaller LLMs are insufficient for the generalization demands except on the most elementary benchmark.

6. Design Sensitivities, Limitations, and Future Directions

ICPI’s success is highly sensitive to prompt engineering—balancing terminal states, reward varieties, and formatting are all essential for robust world model and policy inference. The method currently requires discrete or appropriately discretized spaces and fixed encoding schemes. In the continuous, dynamic domain, performance is bounded by the capacity of the LLM to parse numerical patterns and the quality of error feature engineering; low-dimensional parametric policies are demonstrated, but high-DOF and pixel-based problem scaling is unresolved (Merwe et al., 20 Aug 2025).

The approach trivially side-steps the need for expert demonstrations and can fully eschew gradient-based adaptation, but large transformer inference imposes significant compute and API cost. Potential avenues include automated prompt format discovery, integration with lightweight parameter adaptation (e.g., adapters), and specialized backbones for high-throughput or multi-modal RL.

A plausible implication is that, as LLM capacities and context lengths scale, ICPI may extend to richer domains (including vision-language and high-dimensional continuous control) and provide a new foundation for RL agents capable of generalization via prompt manipulation alone.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-Context Policy Iteration (ICPI).