Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intrinsic Reward Policy Optimization (IRPO)

Updated 4 July 2026
  • IRPO is a reinforcement learning framework that uses intrinsic rewards solely for exploration while optimizing the base policy on extrinsic rewards.
  • It creates multiple exploratory policy branches through intrinsic updates and aggregates extrinsic evaluations to update the base policy effectively.
  • IRPO avoids the pitfalls of mixing rewards by backpropagating extrinsic gradients through intrinsic updates, enhancing credit assignment in sparse-reward settings.

Searching arXiv for papers on Intrinsic Reward Policy Optimization and related intrinsic-reward policy optimization methods. Intrinsic Reward Policy Optimization (IRPO) denotes a policy-optimization framework for sparse-reward reinforcement learning in which intrinsic rewards are used strictly as exploration mechanisms, while the base policy is still optimized for the extrinsic task reward through a surrogate policy gradient. In the formulation introduced for sparse-reward environments, IRPO avoids both direct optimization of a mixed extrinsic–intrinsic reward and hierarchical pretraining of subpolicies. Instead, it creates multiple exploratory policies from a base policy, updates those exploratory policies with intrinsic rewards, evaluates them on extrinsic reward, and backpropagates the resulting extrinsic-policy gradients through the exploratory updates to improve the base policy (Cho et al., 29 Jan 2026).

1. Conceptual scope and terminological usage

In reinforcement learning, an extrinsic reward is the task-defining signal supplied by the environment, whereas an intrinsic reward is an auxiliary signal intended to guide exploration or shape learning dynamics. IRPO is specifically concerned with settings in which extrinsic rewards are so sparse that the ordinary policy gradient becomes weak or effectively uninformative, yet directly mixing intrinsic and extrinsic rewards can distort credit assignment (Cho et al., 29 Jan 2026).

The defining feature of IRPO is that intrinsic rewards are not treated as alternate objectives for the final policy. They are used to induce exploratory updates in auxiliary policy branches, and the base policy is then trained through the extrinsic evaluations of those branches. This distinguishes IRPO from additive reward shaping schemes, in which the agent optimizes a combined signal such as rtE+λrtIr_t^{E}+\lambda r_t^{I}, and from hierarchical methods that pretrain subpolicies under intrinsic rewards (Zheng et al., 2018).

The acronym IRPO is not unique in the recent literature. It is also used for Intergroup Relative Preference Optimization in reward modeling for language-model post-training, where the objective is to scale Bradley-Terry-style preference learning with pointwise generative reward models rather than to address sparse-reward exploration (Song et al., 2 Jan 2026). The acronym is likewise used for a GRPO-based image-restoration post-training paradigm that converts a restoration network into a stochastic policy optimized on a reward mixture reflecting structural fidelity, perceptual preference, and task-aware criteria (Liu et al., 30 Nov 2025). In technical usage, therefore, “IRPO” is context-dependent.

2. Motivation: sparse rewards, vanishing gradients, and the limits of standard remedies

The sparse-reward IRPO framework is motivated by a precise failure mode of conventional policy-gradient RL. The environment is modeled as an MDP

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),

with bounded reward R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}] and discounted return

gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.

Under the paper’s sparsity assumption,

0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),

positive reward is rare, and the standard policy gradient

θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]

vanishes as sparsity increases. The paper states this as Corollary 3.1:

θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.

The central implication is that the true gradient may provide almost no actionable signal before the agent reaches rewarding regions of the state space (Cho et al., 29 Jan 2026).

The paper identifies three common responses to sparse rewards and their associated limitations. First, pure extrinsic policy gradients fail because reward observations are too infrequent. Second, augmenting extrinsic reward with intrinsic reward can improve exploration but often harms credit assignment because the optimized objective is no longer the original task reward. Third, hierarchical RL preserves extrinsic credit assignment more directly, but incurs sample inefficiency and sub-optimality through subpolicy pretraining and temporally extended actions (Cho et al., 29 Jan 2026).

This diagnosis aligns with earlier intrinsic-reward work but departs from it operationally. LIRPG, for example, learns an intrinsic reward function rηin(s,a)r^{in}_\eta(s,a) for policy-gradient agents and updates the policy on the additive objective

Jex+in=Eθ[t=0γt(rtex+λrηin(st,at))],J^{ex+in} = \mathbb{E}_{\theta}\left[\sum_{t=0}^{\infty}\gamma^t \big(r^{ex}_t + \lambda r^{in}_\eta(s_t,a_t)\big)\right],

while training the intrinsic-reward parameters only to improve eventual extrinsic return (Zheng et al., 2018). EIPO, by contrast, frames intrinsic-reward balancing as a constrained optimization problem and uses a Lagrange multiplier α\alpha to adjust the relative weight of intrinsic exploration and extrinsic exploitation dynamically (Chen et al., 2022). IRPO’s distinctive claim is that one can use intrinsic rewards for exploration without training the base policy on a mixed reward at all (Cho et al., 29 Jan 2026).

3. Core algorithmic construction

IRPO assumes access to multiple intrinsic reward functions

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),0

For each intrinsic reward, the method creates an exploratory branch by initializing

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),1

where M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),2 parameterizes the current base policy. Each branch is then updated for M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),3 steps using the intrinsic objective. At exploratory step M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),4,

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),5

with intrinsic policy gradient

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),6

The implementation uses an actor-critic style approximation

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),7

Thus, each branch performs ordinary intrinsic-reward policy improvement, but only within its own exploratory trajectory (Cho et al., 29 Jan 2026).

After the M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),8 intrinsic updates, each final exploratory policy is evaluated on the extrinsic objective

M=(S,A,T,R,γ),M = (\mathcal S,\mathcal A,T,R,\gamma),9

The extrinsic policy gradient for branch R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]0 is estimated at the final exploratory parameters:

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]1

IRPO then backpropagates this extrinsic gradient through the entire sequence of exploratory updates by storing the Jacobians

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]2

and applying the chain rule:

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]3

The surrogate IRPO gradient is then defined as

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]4

with weights

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]5

where R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]6 is a temperature controlling whether the aggregation is closer to averaging or to selecting the best branch (Cho et al., 29 Jan 2026).

The high-level loop is bi-level. At each iteration, the algorithm clones the base policy into R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]7 exploratory policies, performs intrinsic updates while storing Jacobians, evaluates each final branch on extrinsic reward, backpropagates the branchwise extrinsic gradients through the intrinsic update paths, aggregates them, and updates the base policy through a trust-region step:

R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]8

The paper further notes that the Jacobian product is computed efficiently with automatic differentiation and vector-Jacobian products, reducing backward-pass complexity from R:S×A[0,Rmax]R:\mathcal S\times\mathcal A\to [0,R_{\max}]9 to gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.0 where gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.1 is the number of parameters (Cho et al., 29 Jan 2026).

4. Objective, bias, and formal interpretation

IRPO is explicit about the fact that its surrogate gradient is biased relative to the true gradient of gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.2. The base update does not differentiate the extrinsic return of the current base policy directly. Instead, it differentiates the extrinsic return of policies obtained after intrinsic-guided exploratory optimization. Consequently, IRPO is optimizing an implicit objective over reachable exploratory branches, not the ordinary extrinsic objective at the base parameters (Cho et al., 29 Jan 2026).

The paper formalizes this reachable-set perspective. After gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.3 intrinsic updates under reward gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.4, a final exploratory policy can be written in simplified notation as

gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.5

The set of all exploratory policies reachable from a base parameter gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.6 is denoted gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.7, and the union across all base policies is

gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.8

With temperature annealed to zero, the paper’s Remark 3.4 states that IRPO effectively searches for

gt=l=0γlrt+l.g_t = \sum_{l=0}^\infty \gamma^l r_{t+l}.9

and if

0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),0

then the output policy corresponds to

0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),1

If the optimal policy parameters 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),2 lie in the reachable set 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),3, the paper states that IRPO can recover optimality (Cho et al., 29 Jan 2026).

This interpretation clarifies both the promise and the limitation of the method. The promise is that the algorithm can “look through” exploratory updates that have already reached informative parts of the state space, even when the base-policy gradient is close to zero. The limitation is that the search space is restricted to what can be reached by 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),4 intrinsic-guided updates from some base policy. Performance therefore depends on the diversity and quality of the intrinsic rewards, the number of branches 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),5, and the exploration depth 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),6 (Cho et al., 29 Jan 2026).

A related misconception is that IRPO simply replaces one reward with another. The sparse-reward formulation does not optimize intrinsic reward as the final target, and it does not learn a fixed intrinsic–extrinsic coefficient. By comparison, EIPO solves a constrained dual problem in which a mixed policy maximizes 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),7 while maintaining extrinsic optimality, with the effective reward weighting becoming 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),8 and 0<Prdπ(R(s,a)>0)ϵ,ϵ(0,1),0 < \Pr_{d^\pi}(R(s,a)>0)\le \epsilon,\qquad \epsilon\in(0,1),9 updated by

θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]0

IRPO instead keeps intrinsic rewards inside the exploratory branches and uses extrinsic performance to determine which exploratory directions matter for the base update (Chen et al., 2022).

5. Empirical profile, ablations, and implementation trade-offs

The sparse-reward IRPO paper evaluates the method on nine tasks: three discrete environments—Maze-v1, Maze-v2, and FourRooms—and six continuous environments—PointMaze-v1, PointMaze-v2, FetchReach, AntMaze-v1, AntMaze-v2, and AntMaze-v3. Rewards are typically θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]1 upon reaching the goal. In AntMaze, the experiments additionally impose a θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]2 penalty and termination if the ant flips or jumps aggressively, and the success threshold is loosened from 0.35 to 0.5 to avoid unnatural jumping behavior (Cho et al., 29 Jan 2026).

The baselines are HRL-ALLO, DRND, PSNE, PPO, and TRPO; IRPO and HRL use the same intrinsic rewards derived from ALLO. Across these environments, the paper reports that IRPO generally achieves the highest converged performance, narrow confidence intervals, and stronger robustness across environment difficulty. It further states that IRPO outperforms all baselines in almost all environments, that HRL-ALLO is competitive in several settings but still underperforms IRPO, and that IRPO succeeds in FetchReach where HRL-ALLO completely fails. By contrast, direct extrinsic methods such as PPO, TRPO, and PSNE struggle because exploration is insufficient, while DRND provides little gain over PPO and fails in some tasks because of poor credit assignment (Cho et al., 29 Jan 2026).

The sample-efficiency picture is more qualified. The paper states that IRPO is often more sample-efficient than HRL-ALLO because it avoids pretraining separate subpolicies, but it can be less sample-efficient than simpler baselines on easier tasks because it incurs the extra cost of exploratory updates. This suggests that IRPO is best understood as a targeted response to hard sparse-reward problems rather than as a uniformly cheaper replacement for standard policy-gradient methods (Cho et al., 29 Jan 2026).

The ablations are central to the method’s characterization. Trust-region updates reduce variance and improve stability relative to standard gradient updates. For exploration depth, θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]3 gives insufficient exploration and poor final performance, θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]4 is too costly and worsens sample efficiency, and θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]5 provides the best balance in the reported experiments. Replacing the backpropagated surrogate with an importance-sampling estimator yields high variance and poor performance, which the paper uses to justify differentiation through the exploratory updates. Blending the IRPO gradient with the true gradient, whether abruptly or gradually, hurts performance and stability, plausibly because the two objectives induce different optimization landscapes. When random intrinsic rewards are substituted for meaningful ones, IRPO degrades but still often outperforms HRL using the same random rewards (Cho et al., 29 Jan 2026).

Several practical design choices recur in the reported implementation: annealing the temperature θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]6 from 1 to 0 over the first 10% of training, using a small trust-region KL threshold of θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]7, computing vector-Jacobian products with automatic differentiation, and selecting θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]8 to balance exploration power against computational overhead (Cho et al., 29 Jan 2026).

6. Relation to adjacent intrinsic-reward methods and acronym collisions

IRPO sits within a broader family of methods that treat intrinsic reward as an optimization variable rather than as a hand-crafted bonus, but the mechanisms differ substantially.

Method Core mechanism Relation to IRPO
LIRPG (Zheng et al., 2018) Meta-gradient learning of θJ(θ)=Edπθ ⁣[Qπθ(s,a)θlogπθ(as)]\nabla_\theta J(\theta)=\mathbb E_{d^{\pi_\theta}}\!\left[Q^{\pi_\theta}(s,a)\nabla_\theta\log \pi_\theta(a\mid s)\right]9 for additive policy updates Learns intrinsic reward shaping directly
EIPO (Chen et al., 2022) Constrained optimization with dynamic multiplier θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.0 for extrinsic–intrinsic balance Tunes intrinsic influence in a mixed objective
Sparse-reward IRPO (Cho et al., 29 Jan 2026) Intrinsic-guided exploratory branches with backpropagated extrinsic gradients Uses intrinsic reward only for exploration
VIGOR (Wen et al., 11 May 2026) Verifier-free GRPO using intrinsic gradient-norm reward in LLM post-training Intrinsic reward from model-internal gradient geometry

LIRPG is a meta-gradient algorithm derived for policy-gradient agents under the Optimal Rewards Framework. Its intrinsic reward module is trained only to improve θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.1, but the policy itself is updated using an additive reward signal θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.2 (Zheng et al., 2018). EIPO, although named Extrinsic-Intrinsic Policy Optimization, is described as a constrained intrinsic-reward policy optimization method because it maximizes θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.3 subject to an extrinsic optimality constraint and automatically suppresses intrinsic reward when exploration is unnecessary (Chen et al., 2022). A survey and empirical study of behavior adaptation via intrinsic reward further broadens the landscape by showing that intrinsic rewards based on the amount of learning, especially Weight Change and Bayesian Surprise, can generate useful behavior when the individual learners are introspective (Linke et al., 2019). These works illuminate the design space that IRPO occupies: whether intrinsic reward should be learned, constrained, or used as a proxy for learning progress.

Recent LLM post-training work extends the same broad idea into different regimes. VIGOR defines a verifier-free intrinsic reward from the θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.4 norm of the policy model’s own teacher-forced negative log-likelihood gradients, corrected by a θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.5 factor to remove length bias and rank-shaped within groups for GRPO-style optimization. The paper positions it as an intrinsic reward policy optimization method in the same broad family, but the intrinsic signal is derived from parameter-space gradient geometry rather than from environment exploration (Wen et al., 11 May 2026).

Finally, two papers use the same acronym IRPO for different topics entirely. “IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning” uses Intergroup Relative Preference Optimization to replace θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.6 pairwise GRM comparisons with pointwise scoring and θJ(θ)20as ϵ0.\|\nabla_\theta J(\theta)\|_2 \to 0 \quad \text{as } \epsilon\to 0.7 scaling in RL-based reward-model training (Song et al., 2 Jan 2026). “IRPO: Boosting Image Restoration via Post-training GRPO” uses IRPO for a low-level vision post-training paradigm based on hard-sample selection and composite rewards in image restoration (Liu et al., 30 Nov 2025). These usages do not concern sparse-reward exploration, but they demonstrate that the acronym has become polysemous across reinforcement learning, reward modeling, and multimodal post-training.

In that broader context, sparse-reward IRPO is most precisely characterized as a bi-level surrogate-gradient method: intrinsic rewards define exploratory update paths, extrinsic reward evaluates the endpoints of those paths, and the base policy is optimized by differentiating through the exploratory dynamics. This suggests a distinctive position within intrinsic-reward research: neither simple reward shaping nor constrained reward mixing, but extrinsic training through intrinsic exploration (Cho et al., 29 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intrinsic Reward Policy Optimization (IRPO).