Papers
Topics
Authors
Recent
Search
2000 character limit reached

Potential-Based Intrinsic Motivation (PBIM)

Updated 6 January 2026
  • PBIM is a reinforcement learning framework that transforms complex, often non-Markovian intrinsic rewards into shaping rewards while preserving optimal policies.
  • It generalizes classical potential-based reward shaping by allowing trajectory-dependent potentials and imposing boundary conditions to prevent reward hacking.
  • Empirical results show that using PBIM with Generalized Reward Matching accelerates exploration and improves convergence in sparse-reward environments.

Potential-Based Intrinsic Motivation (PBIM) is a principled framework in reinforcement learning for converting arbitrary intrinsic motivation (IM) reward streams into optimality-preserving shaping rewards. The underlying motivation is to enable complex, often trainable IM functions—frequently non-Markovian or trajectory-dependent—to safely accelerate exploration in sparse-reward environments without distorting the set of optimal policies. PBIM achieves this by generalizing classical potential-based reward shaping (PBRS) to allow history-dependent potentials and, together with the Generalized Reward Matching (GRM) construct, establishes a comprehensive family of shaping transformations that preserve policy invariance. This approach guarantees immunity to reward-hacking, is broadly compatible with contemporary IM modules, and is underpinned by rigorous theorems, boundary conditions, and explicit algorithmic corrections (Forbes et al., 2024).

1. Foundations of Potential-Based Reward Shaping

Traditional PBRS introduces a shaping term of the form F(s,a,s)=γΦ(s)Φ(s)F(s,a,s') = \gamma \Phi(s') - \Phi(s), where Φ\Phi is a potential function defined over states (or state-actions in some variants) and γ\gamma is the discount factor. Ng et al. (1999) showed that such additive shaping terms shift QQ-values by an action-independent offset and thus leave the set of optimal policies unchanged, provided that Φ\Phi does not depend on future agent actions. Episodic extensions require a terminal correction to ensure no bias at episode boundaries (Forbes et al., 2024).

PBIM extends classical PBRS by permitting Φ\Phi to depend on the entire trajectory up to time tt (i.e., non-Markovian), encompassing complex, learned IM signals such as deep curiosity or density models, or trainable policy-dependent bonuses. The critical sufficient condition for optimality preservation is that the expected difference Eπ,T[γNtΦNΦt]E_{\pi,T}[ \gamma^{N-t} \Phi_N - \Phi_t ] must be independent of the agent’s action at every time step tt for horizon NN.

2. Formal Definition of PBIM and Boundary Condition

Given an episodic MDP M=(S,A,T,R,γ)M = (S, A, T, R, \gamma) of horizon NN, PBIM shapes the reward as

R(st,at,st+1)=R(st,at,st+1)+Ft,R'(s_t, a_t, s_{t+1}) = R(s_t, a_t, s_{t+1}) + F_t,

where FtF_t is the IM term to be converted. PBIM constructs FtF_t as a telescoping difference of the intrinsic discounted return, Utπ=n=tN1γntFnU_t^{\pi} = \sum_{n=t}^{N-1} \gamma^{n-t} F_n. The Bellman-like identity yields

Ft=UtπγUt+1π=γΦt+1ΦtF_t = U_t^{\pi} - \gamma U_{t+1}^{\pi} = \gamma \Phi_{t+1} - \Phi_t

with Φt=Utπ\Phi_t = -U_t^{\pi}, accommodating non-Markovian and trainable potential functions.

To guarantee policy invariance, PBIM enforces the boundary condition: t,Eπ,T[γNtΦNΦt]=constant with respect to at,\forall t,\quad E_{\pi,T}[ \gamma^{N-t} \Phi_N - \Phi_t ] = \text{constant with respect to } a_t, which, under the assumption that FtF_t is future-agnostic (action-independent beyond tt), always holds when using the prescribed terminal correction. This prevents the agent from converging to suboptimal policies induced by IM.

3. Generalized Reward Matching (GRM): Optimality-Preserving Family

GRM generalizes PBIM by introducing a matching function mt,t[0,1]m_{t,t'} \in [0,1] controlling when and how much of each IM reward FtF_{t'} is “subtracted back” at time ttt \ge t'. For future-agnostic FtF_t, GRM defines the shaped reward as

Ft=Fti=0tγitFimt,i,F'_t = F_t - \sum_{i=0}^{t} \gamma^{i-t} F_i m_{t,i},

with the constraints:

  • t,t=tN1mt,t=1\forall t', \sum_{t=t'}^{N-1} m_{t,t'} = 1 (each FtF_{t'} fully matched),
  • mt,t=0m_{t,t'} = 0 for t<tt < t' (no pre-subtraction).

Theorem 3 proves that GRM transformations coincide exactly with the set of all PBRS forms that preserve optimal policies (Forbes et al., 2024). GRM thus encompasses interim corrections and delays beyond the terminal adjustment used in PBIM, accommodating environment-specific credit assignment and potential normalization strategies.

4. Algorithmic Realizations and Empirical Performance

PBIM is instantiated by:

  • Collecting trajectories and IM rewards FtF_t,
  • Computing cumulative returns UtU_t for potential estimates,
  • Applying terminal corrections (or distributed matching per GRM) to form shaped rewards,
  • Normalizing intrinsic signals as needed via running means,
  • Updating RL agents (e.g., PPO+LSTM, Q-learning+RND) with these shaped returns.

Table: Example convergence metrics in MiniGrid DoorKey (higher γ\gamma, α=0.025\alpha = 0.025):

Method Frames to Convergence Mean Steps
No IM N/A 634.8
Raw IM 6.07M 62.0
PBIM+norm 1.26M 51.2
GRM(D=10)+norm 1.19M 38.0

Empirical results confirm that PBIM and GRM restore true optimality and accelerate exploration, outperforming naive IM (which induces stalling or distraction behaviors) and yielding lower variance and improved convergence speed (Forbes et al., 2024).

5. Extension: Potential-Based Intrinsic Motivation in Bayes-Adaptive MDPs

PBIM has been extended to Bayes-Adaptive Markov Decision Processes (BAMDPs), where a “belief-state” MDP models learning over the agent’s knowledge posterior. Potential-based pseudo-rewards in this context (BAMDP Potential-Based Shaping Functions, or BAMPFs) guarantee optimal policy invariance and resist reward-hacking both in meta-RL and standard RL settings. Potentials Φ(h)\Phi(h) can encode the value of information VI(s,h)V_I^*(s,h) or opportunity VO(s,h)V_O^*(s,h) for history-dependent signals. Classical PBIM thus becomes a special case of BAMDP shaping when Φ\Phi is state-only (Lidayan et al., 2024).

Meta-RL and exploration-intensive tasks such as Bernoulli-bandits (potential: negative posterior entropy) and Mountain-Car (potential: proximity and velocity to goal) benefit from PBIM/BAMPF shaping, which yields Bayes-optimal exploration and improved sample efficiency without policy corruption (Lidayan et al., 2024).

6. Limitations and Open Questions

  • PBIM and GRM require future-agnostic intrinsic reward signals; empowerment-style IM (future-dependent) violates the optimality-preserving boundary condition.
  • The “reward horizon” problem in PBIM (large final correction) can dilute credit assignment; intermediate matching in GRM offers partial mitigation.
  • Fully automated design of matching functions remains unresolved; environment-specific and agent-specific adaptation of delay parameters impacts performance.
  • Scaling to very long-horizon sparse-reward domains (e.g., Montezuma’s Revenge) induces terminal penalties that can destabilize learning; action-dependent strategies such as ADOPS circumvent these issues but depart from pure potential-based forms (2505.12611).

7. Practical Significance and Outlook

PBIM and GRM jointly furnish a plug-and-play, theory-backed pipeline for composing IM rewards into the classical PBRS guarantee of policy invariance. They support complex, non-Markovian, and trainable IM modules, enabling their direct use in deep RL and exploration-heavy settings with minimal overhead and robust empirical improvements. Ensuring optimality preservation allows for more aggressive and reliable exploration bonuses, subject to tractable boundary conditions and careful normalization. Ongoing work seeks to relax future-agnosticity, address credit assignment in ultra-long episodes, and unify PBIM with broader adaptive exploration frameworks.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Potential-Based Intrinsic Motivation (PBIM).