In-Context Steered Policy Optimization

Updated 3 November 2025

ICPO is a reinforcement learning framework that uses in-context learning to provide implicit expert guidance, improving exploration and training stability.
It employs a mixed-policy GRPO combining on-policy rollouts with off-policy implicit expert forcing and expert region filtering to enhance performance.
The framework uses annealed reward shaping to balance early expert imitation with later autonomous exploration, boosting overall training efficacy.

In-Context Steered Policy Optimization (ICPO) is a reinforcement learning (RL) framework designed for large reasoning models (LRMs) under the paradigm of Reinforcement Learning from Verifiable Rewards (RLVR). ICPO leverages the in-context learning (ICL) capabilities of LRMs to provide implicit expert guidance within the optimization process, expanding policy coverage and enhancing training stability without reliance on external expert trajectories.

1. Motivation and Problem Setting

Group Relative Policy Optimization (GRPO) is a prevailing RLVR algorithm that drives policy improvement via reward maximization on verified correct trajectories. GRPO, however, restricts exploration by sampling exclusively from the current policy’s distribution (on-policy rollouts), often resulting in inadequate trajectory diversity and local optimality. Alternative methods that supplement training with trajectories sampled from stronger external expert models increase computational burden and require access to advanced resources, which may be impractical.

ICPO directly addresses these challenges by exploiting the inherent ICL ability of LRMs to supply expert guidance from existing datasets, eliminating the need for external rollouts. It establishes a scalable and effective mechanism for post-training large-scale reasoning models, with a particular focus on mathematical and symbolic reasoning domains (Huang et al., 30 Oct 2025).

2. Principal Mechanisms: Mixed-Policy GRPO with Implicit Expert Forcing

ICPO introduces Mixed-Policy GRPO, in which the policy is trained on both on-policy and off-policy rollouts:

On-policy trajectories are sampled as usual from the current $\pi_\theta$ .
Off-policy trajectories are generated by conditioning the policy on few-shot, high-quality expert demonstrations (via ICL), termed Implicit Expert Forcing (IEF).

Formally, for policy $\pi_\theta$ , conditioning with demonstrations $\mathcal{D}$ induces a “task vector” ( $\vartheta = A(\mathcal{D})$ ) and defines the IEF policy as:

$\pi_\theta^{\text{IEF}}(\tau | q) = \pi_\theta(\tau | [\mathcal{D}; q])$

The Mixed-Policy GRPO objective is:

$\mathcal{J}_{\rm Mixed}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\rm old}}}[\ldots] + \mathbb{E}_{\tau \sim \pi_{\theta_{\rm old}}^{\text{IEF}}}[\ldots]$

The group-normalized advantage is calculated across both trajectory types, allowing ICPO to “think outside the policy” and explore regions of the solution space inaccessible to on-policy updates alone.

3. Expert Region Reject Sampling (ERRS)

Many expert-conditioned (IEF) rollouts may not yield correct or reliable solutions, and including suboptimal or noisy off-policy samples can destabilize learning. ICPO introduces Expert Region Reject Sampling (ERRS), a selective filtering mechanism:

Define the Expert Region $\mathcal{E}_{\text{exp}} = \{ (x_{\text{exp}}, \tau_j) \mid R(\tau_j) > \delta \}$ , where $R(\tau_j)$ is trajectory reward and $\delta$ is typically 1 (signifies verified correctness).
Only expert rollouts in $\mathcal{E}_{\text{exp}}$ enter the policy update; others are discarded.
The ERRS operator $\rho$ restricts off-policy updates to the expert region, ensuring training stability and gradient reliability:

$\rho(f) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau | \tau \in \mathcal{E}_{\text{exp}})} [g(\tau)]$

4. Annealed Expert-Bonus Reward Shaping

ICPO deploys annealed reward shaping to balance imitation and exploration:

During early training, an expert bonus is assigned to rewards of expert-region rollouts:

$R_{\text{shaped}}(\tau) = R(\tau) + \alpha \cdot \gamma(t)$

with $\gamma(t) = 1 - t/T$ decaying linearly across iterations, $\alpha$ a tunable scalar.

This incentivizes initial convergence towards expert behaviors, but gradually anneals to encourage autonomous exploration, effectively preventing overfitting to demonstration data.

5. Detailed Training Procedure

Algorithmically, ICPO alternates between on-policy and IEF off-policy rollouts for each RL batch:

Generate standard on-policy rollouts using $\pi_\theta$ .
For IEF: construct input with randomly sampled demonstration examples and generate an ICL-conditioned trajectory.
Apply ERRS: if the expert-conditioned trajectory meets the correctness threshold, it is retained for off-policy update and awarded the expert bonus.
Mixed group-normalized advantage calculations are performed for both platform types.
The combined policy objective (with KL clipping and normalization) is optimized.

Mathematically, the overall ICPO policy objective is:

$\mathcal{J}_{\mathrm{ICPO}}(\theta) = \frac{1}{Z}\left( \sum_{i=1}^{N_{\mathrm{on}}} \sum_{t=1}^{|\tau_{i}|} \mathrm{CLIP}(r_{i,t}(\theta),A_{i},\epsilon) + \rho\left(\sum_{j=1}^{N_{\mathrm{off}}} \sum_{t=1}^{|\tau_{j}|} \mathrm{CLIP}(f(\hat{r}_{j,t}(\theta)),\hat{A}_{j},\epsilon)\right) \right)$

where $Z$ is a normalization constant, and $f(x) = x/(x + \lambda)$ performs reward shaping.

6. Experimental Results: Quantitative and Qualitative Performance

ICPO is systematically evaluated across Qwen3-1.7B and Qwen3-8B model sizes on both in-distribution (ID) and out-of-distribution (OOD) datasets, including OpenR1-Math-220k, MATH-500, and diverse reasoning/knowledge benchmarks (ARC, GPQA, MMLU).

ID Benchmarks: ICPO outperforms GRPO, with average gains of +4.17 points (Qwen3-1.7B) and +2.15 points (Qwen3-8B); ICPO with annealed RS provides further gain, especially in expert domains.
OOD Generalization: ICPO $\dagger$ yields +2.37 gain on OOD tasks.
Comparison to off-policy methods requiring external LLM rollouts (e.g., LUFFY): ICPO, without any external data, surpasses LUFFY and with its full mechanism outperforms LUFFY by +2.79 points.
Ablations: All three components—IEF, ERRS, reward shaping—are required for optimal performance; their removal degrades results.

ICPO consistently delivers enhanced exploration, improved reasoned correctness, and training stability across all tested regimes.

7. Impact and Implications

ICPO demonstrates that the ICL capability of LRMs is sufficient to provide internal expert guidance, eliminating dependence on large external models for trajectory generation. The combination of mixed-policy optimization, ERRS, and annealed reward shaping achieves scalable, robust RLVR post-training applicable to both ID and OOD settings. The framework’s main limitation is potential sensitivity to the diversity and quality of expert-like samples present in the demonstrations corpus. Nonetheless, ICPO establishes a new standard for scalable, data-efficient, and execution-stable RLVR training in large reasoning models, with applicability for advanced mathematical and symbolic reasoning.

Component	Mechanism	Role/Efficacy
Mixed-Policy GRPO w/ IEF	Mix on-policy with ICL expert rollouts	Expands exploration
Expert Region Reject Sampling	Filter for high-reward expert rollouts	Stable, robust gradient updates
Annealed Expert-Bonus Reward Shaping	Decaying bonus for expert region	Fast initial convergence, avoids overfitting

ICPO reframes RLVR as a process that leverages internal ICL instead of external expert policy dependence, enabling broad, stable, and efficient reasoning improvement (Huang et al., 30 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Think Outside the Policy: In-Context Steered Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to In-Context Steered Policy Optimization (ICPO).