Intrinsic Reward Policy Optimization (IRPO)

Updated 4 July 2026

IRPO is a reinforcement learning framework that uses intrinsic rewards solely for exploration while optimizing the base policy on extrinsic rewards.
It creates multiple exploratory policy branches through intrinsic updates and aggregates extrinsic evaluations to update the base policy effectively.
IRPO avoids the pitfalls of mixing rewards by backpropagating extrinsic gradients through intrinsic updates, enhancing credit assignment in sparse-reward settings.

Searching arXiv for papers on Intrinsic Reward Policy Optimization and related intrinsic-reward policy optimization methods. Intrinsic Reward Policy Optimization (IRPO) denotes a policy-optimization framework for sparse-reward reinforcement learning in which intrinsic rewards are used strictly as exploration mechanisms, while the base policy is still optimized for the extrinsic task reward through a surrogate policy gradient. In the formulation introduced for sparse-reward environments, IRPO avoids both direct optimization of a mixed extrinsic–intrinsic reward and hierarchical pretraining of subpolicies. Instead, it creates multiple exploratory policies from a base policy, updates those exploratory policies with intrinsic rewards, evaluates them on extrinsic reward, and backpropagates the resulting extrinsic-policy gradients through the exploratory updates to improve the base policy (Cho et al., 29 Jan 2026).

1. Conceptual scope and terminological usage

In reinforcement learning, an extrinsic reward is the task-defining signal supplied by the environment, whereas an intrinsic reward is an auxiliary signal intended to guide exploration or shape learning dynamics. IRPO is specifically concerned with settings in which extrinsic rewards are so sparse that the ordinary policy gradient becomes weak or effectively uninformative, yet directly mixing intrinsic and extrinsic rewards can distort credit assignment (Cho et al., 29 Jan 2026).

The defining feature of IRPO is that intrinsic rewards are not treated as alternate objectives for the final policy. They are used to induce exploratory updates in auxiliary policy branches, and the base policy is then trained through the extrinsic evaluations of those branches. This distinguishes IRPO from additive reward shaping schemes, in which the agent optimizes a combined signal such as $r_t^{E}+\lambda r_t^{I}$ , and from hierarchical methods that pretrain subpolicies under intrinsic rewards (Zheng et al., 2018).

The acronym IRPO is not unique in the recent literature. It is also used for Intergroup Relative Preference Optimization in reward modeling for language-model post-training, where the objective is to scale Bradley-Terry-style preference learning with pointwise generative reward models rather than to address sparse-reward exploration (Song et al., 2 Jan 2026). The acronym is likewise used for a GRPO-based image-restoration post-training paradigm that converts a restoration network into a stochastic policy optimized on a reward mixture reflecting structural fidelity, perceptual preference, and task-aware criteria (Liu et al., 30 Nov 2025). In technical usage, therefore, “IRPO” is context-dependent.

2. Motivation: sparse rewards, vanishing gradients, and the limits of standard remedies

The sparse-reward IRPO framework is motivated by a precise failure mode of conventional policy-gradient RL. The environment is modeled as an MDP

$M = (\mathcal S,\mathcal A,T,R,\gamma),$

with bounded reward $R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ and discounted return

$g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$

Under the paper’s sparsity assumption,

$0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$

positive reward is rare, and the standard policy gradient

$\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$

vanishes as sparsity increases. The paper states this as Corollary 3.1:

$\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$

The central implication is that the true gradient may provide almost no actionable signal before the agent reaches rewarding regions of the state space (Cho et al., 29 Jan 2026).

The paper identifies three common responses to sparse rewards and their associated limitations. First, pure extrinsic policy gradients fail because reward observations are too infrequent. Second, augmenting extrinsic reward with intrinsic reward can improve exploration but often harms credit assignment because the optimized objective is no longer the original task reward. Third, hierarchical RL preserves extrinsic credit assignment more directly, but incurs sample inefficiency and sub-optimality through subpolicy pretraining and temporally extended actions (Cho et al., 29 Jan 2026).

This diagnosis aligns with earlier intrinsic-reward work but departs from it operationally. LIRPG, for example, learns an intrinsic reward function $r^{in}_\eta(s,a)$ for policy-gradient agents and updates the policy on the additive objective

$J^{ex+in} = \mathbb{E}_{\theta}\left[\sum_{t=0}^{\infty}\gamma^t \big(r^{ex}_t + \lambda r^{in}_\eta(s_t,a_t)\big)\right],$

while training the intrinsic-reward parameters only to improve eventual extrinsic return (Zheng et al., 2018). EIPO, by contrast, frames intrinsic-reward balancing as a constrained optimization problem and uses a Lagrange multiplier $\alpha$ to adjust the relative weight of intrinsic exploration and extrinsic exploitation dynamically (Chen et al., 2022). IRPO’s distinctive claim is that one can use intrinsic rewards for exploration without training the base policy on a mixed reward at all (Cho et al., 29 Jan 2026).

3. Core algorithmic construction

IRPO assumes access to multiple intrinsic reward functions

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 0

For each intrinsic reward, the method creates an exploratory branch by initializing

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 1

where $M = (\mathcal S,\mathcal A,T,R,\gamma),$ 2 parameterizes the current base policy. Each branch is then updated for $M = (\mathcal S,\mathcal A,T,R,\gamma),$ 3 steps using the intrinsic objective. At exploratory step $M = (\mathcal S,\mathcal A,T,R,\gamma),$ 4,

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 5

with intrinsic policy gradient

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 6

The implementation uses an actor-critic style approximation

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 7

Thus, each branch performs ordinary intrinsic-reward policy improvement, but only within its own exploratory trajectory (Cho et al., 29 Jan 2026).

After the $M = (\mathcal S,\mathcal A,T,R,\gamma),$ 8 intrinsic updates, each final exploratory policy is evaluated on the extrinsic objective

$M = (\mathcal S,\mathcal A,T,R,\gamma),$ 9

The extrinsic policy gradient for branch $R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 0 is estimated at the final exploratory parameters:

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 1

IRPO then backpropagates this extrinsic gradient through the entire sequence of exploratory updates by storing the Jacobians

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 2

and applying the chain rule:

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 3

The surrogate IRPO gradient is then defined as

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 4

with weights

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 5

where $R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 6 is a temperature controlling whether the aggregation is closer to averaging or to selecting the best branch (Cho et al., 29 Jan 2026).

The high-level loop is bi-level. At each iteration, the algorithm clones the base policy into $R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 7 exploratory policies, performs intrinsic updates while storing Jacobians, evaluates each final branch on extrinsic reward, backpropagates the branchwise extrinsic gradients through the intrinsic update paths, aggregates them, and updates the base policy through a trust-region step:

$R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 8

The paper further notes that the Jacobian product is computed efficiently with automatic differentiation and vector-Jacobian products, reducing backward-pass complexity from $R:\mathcal S\times\mathcal A\to [0,R_{\max}]$ 9 to $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 0 where $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 1 is the number of parameters (Cho et al., 29 Jan 2026).

4. Objective, bias, and formal interpretation

IRPO is explicit about the fact that its surrogate gradient is biased relative to the true gradient of $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 2. The base update does not differentiate the extrinsic return of the current base policy directly. Instead, it differentiates the extrinsic return of policies obtained after intrinsic-guided exploratory optimization. Consequently, IRPO is optimizing an implicit objective over reachable exploratory branches, not the ordinary extrinsic objective at the base parameters (Cho et al., 29 Jan 2026).

The paper formalizes this reachable-set perspective. After $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 3 intrinsic updates under reward $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 4, a final exploratory policy can be written in simplified notation as

$g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 5

The set of all exploratory policies reachable from a base parameter $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 6 is denoted $g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 7, and the union across all base policies is

$g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 8

With temperature annealed to zero, the paper’s Remark 3.4 states that IRPO effectively searches for

$g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.$ 9

and if

$0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 0

then the output policy corresponds to

$0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 1

If the optimal policy parameters $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 2 lie in the reachable set $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 3, the paper states that IRPO can recover optimality (Cho et al., 29 Jan 2026).

This interpretation clarifies both the promise and the limitation of the method. The promise is that the algorithm can “look through” exploratory updates that have already reached informative parts of the state space, even when the base-policy gradient is close to zero. The limitation is that the search space is restricted to what can be reached by $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 4 intrinsic-guided updates from some base policy. Performance therefore depends on the diversity and quality of the intrinsic rewards, the number of branches $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 5, and the exploration depth $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 6 (Cho et al., 29 Jan 2026).

A related misconception is that IRPO simply replaces one reward with another. The sparse-reward formulation does not optimize intrinsic reward as the final target, and it does not learn a fixed intrinsic–extrinsic coefficient. By comparison, EIPO solves a constrained dual problem in which a mixed policy maximizes $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 7 while maintaining extrinsic optimality, with the effective reward weighting becoming $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 8 and $0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),$ 9 updated by

$\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 0

IRPO instead keeps intrinsic rewards inside the exploratory branches and uses extrinsic performance to determine which exploratory directions matter for the base update (Chen et al., 2022).

5. Empirical profile, ablations, and implementation trade-offs

The sparse-reward IRPO paper evaluates the method on nine tasks: three discrete environments—Maze-v1, Maze-v2, and FourRooms—and six continuous environments—PointMaze-v1, PointMaze-v2, FetchReach, AntMaze-v1, AntMaze-v2, and AntMaze-v3. Rewards are typically $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 1 upon reaching the goal. In AntMaze, the experiments additionally impose a $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 2 penalty and termination if the ant flips or jumps aggressively, and the success threshold is loosened from 0.35 to 0.5 to avoid unnatural jumping behavior (Cho et al., 29 Jan 2026).

The baselines are HRL-ALLO, DRND, PSNE, PPO, and TRPO; IRPO and HRL use the same intrinsic rewards derived from ALLO. Across these environments, the paper reports that IRPO generally achieves the highest converged performance, narrow confidence intervals, and stronger robustness across environment difficulty. It further states that IRPO outperforms all baselines in almost all environments, that HRL-ALLO is competitive in several settings but still underperforms IRPO, and that IRPO succeeds in FetchReach where HRL-ALLO completely fails. By contrast, direct extrinsic methods such as PPO, TRPO, and PSNE struggle because exploration is insufficient, while DRND provides little gain over PPO and fails in some tasks because of poor credit assignment (Cho et al., 29 Jan 2026).

The sample-efficiency picture is more qualified. The paper states that IRPO is often more sample-efficient than HRL-ALLO because it avoids pretraining separate subpolicies, but it can be less sample-efficient than simpler baselines on easier tasks because it incurs the extra cost of exploratory updates. This suggests that IRPO is best understood as a targeted response to hard sparse-reward problems rather than as a uniformly cheaper replacement for standard policy-gradient methods (Cho et al., 29 Jan 2026).

The ablations are central to the method’s characterization. Trust-region updates reduce variance and improve stability relative to standard gradient updates. For exploration depth, $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 3 gives insufficient exploration and poor final performance, $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 4 is too costly and worsens sample efficiency, and $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 5 provides the best balance in the reported experiments. Replacing the backpropagated surrogate with an importance-sampling estimator yields high variance and poor performance, which the paper uses to justify differentiation through the exploratory updates. Blending the IRPO gradient with the true gradient, whether abruptly or gradually, hurts performance and stability, plausibly because the two objectives induce different optimization landscapes. When random intrinsic rewards are substituted for meaningful ones, IRPO degrades but still often outperforms HRL using the same random rewards (Cho et al., 29 Jan 2026).

Several practical design choices recur in the reported implementation: annealing the temperature $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 6 from 1 to 0 over the first 10% of training, using a small trust-region KL threshold of $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 7, computing vector-Jacobian products with automatic differentiation, and selecting $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 8 to balance exploration power against computational overhead (Cho et al., 29 Jan 2026).

6. Relation to adjacent intrinsic-reward methods and acronym collisions

IRPO sits within a broader family of methods that treat intrinsic reward as an optimization variable rather than as a hand-crafted bonus, but the mechanisms differ substantially.

Method	Core mechanism	Relation to IRPO
LIRPG (Zheng et al., 2018)	Meta-gradient learning of $\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]$ 9 for additive policy updates	Learns intrinsic reward shaping directly
EIPO (Chen et al., 2022)	Constrained optimization with dynamic multiplier $\\|\nabla_\theta J(\theta)\\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 0 for extrinsic–intrinsic balance	Tunes intrinsic influence in a mixed objective
Sparse-reward IRPO (Cho et al., 29 Jan 2026)	Intrinsic-guided exploratory branches with backpropagated extrinsic gradients	Uses intrinsic reward only for exploration
VIGOR (Wen et al., 11 May 2026)	Verifier-free GRPO using intrinsic gradient-norm reward in LLM post-training	Intrinsic reward from model-internal gradient geometry

LIRPG is a meta-gradient algorithm derived for policy-gradient agents under the Optimal Rewards Framework. Its intrinsic reward module is trained only to improve $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 1, but the policy itself is updated using an additive reward signal $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 2 (Zheng et al., 2018). EIPO, although named Extrinsic-Intrinsic Policy Optimization, is described as a constrained intrinsic-reward policy optimization method because it maximizes $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 3 subject to an extrinsic optimality constraint and automatically suppresses intrinsic reward when exploration is unnecessary (Chen et al., 2022). A survey and empirical study of behavior adaptation via intrinsic reward further broadens the landscape by showing that intrinsic rewards based on the amount of learning, especially Weight Change and Bayesian Surprise, can generate useful behavior when the individual learners are introspective (Linke et al., 2019). These works illuminate the design space that IRPO occupies: whether intrinsic reward should be learned, constrained, or used as a proxy for learning progress.

Recent LLM post-training work extends the same broad idea into different regimes. VIGOR defines a verifier-free intrinsic reward from the $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 4 norm of the policy model’s own teacher-forced negative log-likelihood gradients, corrected by a $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 5 factor to remove length bias and rank-shaped within groups for GRPO-style optimization. The paper positions it as an intrinsic reward policy optimization method in the same broad family, but the intrinsic signal is derived from parameter-space gradient geometry rather than from environment exploration (Wen et al., 11 May 2026).

Finally, two papers use the same acronym IRPO for different topics entirely. “IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning” uses Intergroup Relative Preference Optimization to replace $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 6 pairwise GRM comparisons with pointwise scoring and $\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.$ 7 scaling in RL-based reward-model training (Song et al., 2 Jan 2026). “IRPO: Boosting Image Restoration via Post-training GRPO” uses IRPO for a low-level vision post-training paradigm based on hard-sample selection and composite rewards in image restoration (Liu et al., 30 Nov 2025). These usages do not concern sparse-reward exploration, but they demonstrate that the acronym has become polysemous across reinforcement learning, reward modeling, and multimodal post-training.

In that broader context, sparse-reward IRPO is most precisely characterized as a bi-level surrogate-gradient method: intrinsic rewards define exploratory update paths, extrinsic reward evaluates the endpoints of those paths, and the base policy is optimized by differentiating through the exploratory dynamics. This suggests a distinctive position within intrinsic-reward research: neither simple reward shaping nor constrained reward mixing, but extrinsic training through intrinsic exploration (Cho et al., 29 Jan 2026).