Potential-Based Intrinsic Motivation (PBIM)

Updated 6 January 2026

PBIM is a reinforcement learning framework that transforms complex, often non-Markovian intrinsic rewards into shaping rewards while preserving optimal policies.
It generalizes classical potential-based reward shaping by allowing trajectory-dependent potentials and imposing boundary conditions to prevent reward hacking.
Empirical results show that using PBIM with Generalized Reward Matching accelerates exploration and improves convergence in sparse-reward environments.

Potential-Based Intrinsic Motivation (PBIM) is a principled framework in reinforcement learning for converting arbitrary intrinsic motivation (IM) reward streams into optimality-preserving shaping rewards. The underlying motivation is to enable complex, often trainable IM functions—frequently non-Markovian or trajectory-dependent—to safely accelerate exploration in sparse-reward environments without distorting the set of optimal policies. PBIM achieves this by generalizing classical potential-based reward shaping (PBRS) to allow history-dependent potentials and, together with the Generalized Reward Matching (GRM) construct, establishes a comprehensive family of shaping transformations that preserve policy invariance. This approach guarantees immunity to reward-hacking, is broadly compatible with contemporary IM modules, and is underpinned by rigorous theorems, boundary conditions, and explicit algorithmic corrections (Forbes et al., 2024).

1. Foundations of Potential-Based Reward Shaping

Traditional PBRS introduces a shaping term of the form $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ , where $\Phi$ is a potential function defined over states (or state-actions in some variants) and $\gamma$ is the discount factor. Ng et al. (1999) showed that such additive shaping terms shift $Q$ -values by an action-independent offset and thus leave the set of optimal policies unchanged, provided that $\Phi$ does not depend on future agent actions. Episodic extensions require a terminal correction to ensure no bias at episode boundaries (Forbes et al., 2024).

PBIM extends classical PBRS by permitting $\Phi$ to depend on the entire trajectory up to time $t$ (i.e., non-Markovian), encompassing complex, learned IM signals such as deep curiosity or density models, or trainable policy-dependent bonuses. The critical sufficient condition for optimality preservation is that the expected difference $E_{\pi,T}[ \gamma^{N-t} \Phi_N - \Phi_t ]$ must be independent of the agent’s action at every time step $t$ for horizon $N$ .

2. Formal Definition of PBIM and Boundary Condition

Given an episodic MDP $M = (S, A, T, R, \gamma)$ of horizon $N$ , PBIM shapes the reward as

$R'(s_t, a_t, s_{t+1}) = R(s_t, a_t, s_{t+1}) + F_t,$

where $F_t$ is the IM term to be converted. PBIM constructs $F_t$ as a telescoping difference of the intrinsic discounted return, $U_t^{\pi} = \sum_{n=t}^{N-1} \gamma^{n-t} F_n$ . The Bellman-like identity yields

$F_t = U_t^{\pi} - \gamma U_{t+1}^{\pi} = \gamma \Phi_{t+1} - \Phi_t$

with $\Phi_t = -U_t^{\pi}$ , accommodating non-Markovian and trainable potential functions.

To guarantee policy invariance, PBIM enforces the boundary condition: $\forall t,\quad E_{\pi,T}[ \gamma^{N-t} \Phi_N - \Phi_t ] = \text{constant with respect to } a_t,$ which, under the assumption that $F_t$ is future-agnostic (action-independent beyond $t$ ), always holds when using the prescribed terminal correction. This prevents the agent from converging to suboptimal policies induced by IM.

3. Generalized Reward Matching (GRM): Optimality-Preserving Family

GRM generalizes PBIM by introducing a matching function $m_{t,t'} \in [0,1]$ controlling when and how much of each IM reward $F_{t'}$ is “subtracted back” at time $t \ge t'$ . For future-agnostic $F_t$ , GRM defines the shaped reward as

$F'_t = F_t - \sum_{i=0}^{t} \gamma^{i-t} F_i m_{t,i},$

with the constraints:

$\forall t', \sum_{t=t'}^{N-1} m_{t,t'} = 1$ (each $F_{t'}$ fully matched),
$m_{t,t'} = 0$ for $t < t'$ (no pre-subtraction).

Theorem 3 proves that GRM transformations coincide exactly with the set of all PBRS forms that preserve optimal policies (Forbes et al., 2024). GRM thus encompasses interim corrections and delays beyond the terminal adjustment used in PBIM, accommodating environment-specific credit assignment and potential normalization strategies.

4. Algorithmic Realizations and Empirical Performance

PBIM is instantiated by:

Collecting trajectories and IM rewards $F_t$ ,
Computing cumulative returns $U_t$ for potential estimates,
Applying terminal corrections (or distributed matching per GRM) to form shaped rewards,
Normalizing intrinsic signals as needed via running means,
Updating RL agents (e.g., PPO+LSTM, Q-learning+RND) with these shaped returns.

Table: Example convergence metrics in MiniGrid DoorKey (higher $\gamma$ , $\alpha = 0.025$ ):

Method	Frames to Convergence	Mean Steps
No IM	N/A	634.8
Raw IM	6.07M	62.0
PBIM+norm	1.26M	51.2
GRM(D=10)+norm	1.19M	38.0

Empirical results confirm that PBIM and GRM restore true optimality and accelerate exploration, outperforming naive IM (which induces stalling or distraction behaviors) and yielding lower variance and improved convergence speed (Forbes et al., 2024).

5. Extension: Potential-Based Intrinsic Motivation in Bayes-Adaptive MDPs

PBIM has been extended to Bayes-Adaptive Markov Decision Processes (BAMDPs), where a “belief-state” MDP models learning over the agent’s knowledge posterior. Potential-based pseudo-rewards in this context (BAMDP Potential-Based Shaping Functions, or BAMPFs) guarantee optimal policy invariance and resist reward-hacking both in meta-RL and standard RL settings. Potentials $\Phi(h)$ can encode the value of information $V_I^*(s,h)$ or opportunity $V_O^*(s,h)$ for history-dependent signals. Classical PBIM thus becomes a special case of BAMDP shaping when $\Phi$ is state-only (Lidayan et al., 2024).

Meta-RL and exploration-intensive tasks such as Bernoulli-bandits (potential: negative posterior entropy) and Mountain-Car (potential: proximity and velocity to goal) benefit from PBIM/BAMPF shaping, which yields Bayes-optimal exploration and improved sample efficiency without policy corruption (Lidayan et al., 2024).

6. Limitations and Open Questions

PBIM and GRM require future-agnostic intrinsic reward signals; empowerment-style IM (future-dependent) violates the optimality-preserving boundary condition.
The “reward horizon” problem in PBIM (large final correction) can dilute credit assignment; intermediate matching in GRM offers partial mitigation.
Fully automated design of matching functions remains unresolved; environment-specific and agent-specific adaptation of delay parameters impacts performance.
Scaling to very long-horizon sparse-reward domains (e.g., Montezuma’s Revenge) induces terminal penalties that can destabilize learning; action-dependent strategies such as ADOPS circumvent these issues but depart from pure potential-based forms (2505.12611).

7. Practical Significance and Outlook

PBIM and GRM jointly furnish a plug-and-play, theory-backed pipeline for composing IM rewards into the classical PBRS guarantee of policy invariance. They support complex, non-Markovian, and trainable IM modules, enabling their direct use in deep RL and exploration-heavy settings with minimal overhead and robust empirical improvements. Ensuring optimality preservation allows for more aggressive and reliable exploration bonuses, subject to tractable boundary conditions and careful normalization. Ongoing work seeks to relax future-agnosticity, address credit assignment in ultra-long episodes, and unify PBIM with broader adaptive exploration frameworks.

References:

Forbes et al., “Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards” (Forbes et al., 2024)
Forbes et al., “Potential-Based Reward Shaping For Intrinsic Motivation” (Forbes et al., 2024)
"BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping" (Lidayan et al., 2024)
"Action-Dependent Optimality-Preserving Reward Shaping" (2505.12611)