In-Context Steered Policy Optimization
- ICPO is a reinforcement learning framework that uses in-context learning to provide implicit expert guidance, improving exploration and training stability.
- It employs a mixed-policy GRPO combining on-policy rollouts with off-policy implicit expert forcing and expert region filtering to enhance performance.
- The framework uses annealed reward shaping to balance early expert imitation with later autonomous exploration, boosting overall training efficacy.
In-Context Steered Policy Optimization (ICPO) is a reinforcement learning (RL) framework designed for large reasoning models (LRMs) under the paradigm of Reinforcement Learning from Verifiable Rewards (RLVR). ICPO leverages the in-context learning (ICL) capabilities of LRMs to provide implicit expert guidance within the optimization process, expanding policy coverage and enhancing training stability without reliance on external expert trajectories.
1. Motivation and Problem Setting
Group Relative Policy Optimization (GRPO) is a prevailing RLVR algorithm that drives policy improvement via reward maximization on verified correct trajectories. GRPO, however, restricts exploration by sampling exclusively from the current policy’s distribution (on-policy rollouts), often resulting in inadequate trajectory diversity and local optimality. Alternative methods that supplement training with trajectories sampled from stronger external expert models increase computational burden and require access to advanced resources, which may be impractical.
ICPO directly addresses these challenges by exploiting the inherent ICL ability of LRMs to supply expert guidance from existing datasets, eliminating the need for external rollouts. It establishes a scalable and effective mechanism for post-training large-scale reasoning models, with a particular focus on mathematical and symbolic reasoning domains (Huang et al., 30 Oct 2025).
2. Principal Mechanisms: Mixed-Policy GRPO with Implicit Expert Forcing
ICPO introduces Mixed-Policy GRPO, in which the policy is trained on both on-policy and off-policy rollouts:
- On-policy trajectories are sampled as usual from the current .
- Off-policy trajectories are generated by conditioning the policy on few-shot, high-quality expert demonstrations (via ICL), termed Implicit Expert Forcing (IEF).
Formally, for policy , conditioning with demonstrations induces a “task vector” () and defines the IEF policy as:
The Mixed-Policy GRPO objective is:
The group-normalized advantage is calculated across both trajectory types, allowing ICPO to “think outside the policy” and explore regions of the solution space inaccessible to on-policy updates alone.
3. Expert Region Reject Sampling (ERRS)
Many expert-conditioned (IEF) rollouts may not yield correct or reliable solutions, and including suboptimal or noisy off-policy samples can destabilize learning. ICPO introduces Expert Region Reject Sampling (ERRS), a selective filtering mechanism:
- Define the Expert Region , where is trajectory reward and is typically 1 (signifies verified correctness).
- Only expert rollouts in enter the policy update; others are discarded.
- The ERRS operator restricts off-policy updates to the expert region, ensuring training stability and gradient reliability:
4. Annealed Expert-Bonus Reward Shaping
ICPO deploys annealed reward shaping to balance imitation and exploration:
- During early training, an expert bonus is assigned to rewards of expert-region rollouts:
with decaying linearly across iterations, a tunable scalar.
- This incentivizes initial convergence towards expert behaviors, but gradually anneals to encourage autonomous exploration, effectively preventing overfitting to demonstration data.
5. Detailed Training Procedure
Algorithmically, ICPO alternates between on-policy and IEF off-policy rollouts for each RL batch:
- Generate standard on-policy rollouts using .
- For IEF: construct input with randomly sampled demonstration examples and generate an ICL-conditioned trajectory.
- Apply ERRS: if the expert-conditioned trajectory meets the correctness threshold, it is retained for off-policy update and awarded the expert bonus.
- Mixed group-normalized advantage calculations are performed for both platform types.
- The combined policy objective (with KL clipping and normalization) is optimized.
Mathematically, the overall ICPO policy objective is:
where is a normalization constant, and performs reward shaping.
6. Experimental Results: Quantitative and Qualitative Performance
ICPO is systematically evaluated across Qwen3-1.7B and Qwen3-8B model sizes on both in-distribution (ID) and out-of-distribution (OOD) datasets, including OpenR1-Math-220k, MATH-500, and diverse reasoning/knowledge benchmarks (ARC, GPQA, MMLU).
- ID Benchmarks: ICPO outperforms GRPO, with average gains of +4.17 points (Qwen3-1.7B) and +2.15 points (Qwen3-8B); ICPO with annealed RS provides further gain, especially in expert domains.
- OOD Generalization: ICPO yields +2.37 gain on OOD tasks.
- Comparison to off-policy methods requiring external LLM rollouts (e.g., LUFFY): ICPO, without any external data, surpasses LUFFY and with its full mechanism outperforms LUFFY by +2.79 points.
- Ablations: All three components—IEF, ERRS, reward shaping—are required for optimal performance; their removal degrades results.
ICPO consistently delivers enhanced exploration, improved reasoned correctness, and training stability across all tested regimes.
7. Impact and Implications
ICPO demonstrates that the ICL capability of LRMs is sufficient to provide internal expert guidance, eliminating dependence on large external models for trajectory generation. The combination of mixed-policy optimization, ERRS, and annealed reward shaping achieves scalable, robust RLVR post-training applicable to both ID and OOD settings. The framework’s main limitation is potential sensitivity to the diversity and quality of expert-like samples present in the demonstrations corpus. Nonetheless, ICPO establishes a new standard for scalable, data-efficient, and execution-stable RLVR training in large reasoning models, with applicability for advanced mathematical and symbolic reasoning.
| Component | Mechanism | Role/Efficacy |
|---|---|---|
| Mixed-Policy GRPO w/ IEF | Mix on-policy with ICL expert rollouts | Expands exploration |
| Expert Region Reject Sampling | Filter for high-reward expert rollouts | Stable, robust gradient updates |
| Annealed Expert-Bonus Reward Shaping | Decaying bonus for expert region | Fast initial convergence, avoids overfitting |
ICPO reframes RLVR as a process that leverages internal ICL instead of external expert policy dependence, enabling broad, stable, and efficient reasoning improvement (Huang et al., 30 Oct 2025).