Potential-Based Intrinsic Motivation (PBIM)
- PBIM is a reinforcement learning framework that transforms complex, often non-Markovian intrinsic rewards into shaping rewards while preserving optimal policies.
- It generalizes classical potential-based reward shaping by allowing trajectory-dependent potentials and imposing boundary conditions to prevent reward hacking.
- Empirical results show that using PBIM with Generalized Reward Matching accelerates exploration and improves convergence in sparse-reward environments.
Potential-Based Intrinsic Motivation (PBIM) is a principled framework in reinforcement learning for converting arbitrary intrinsic motivation (IM) reward streams into optimality-preserving shaping rewards. The underlying motivation is to enable complex, often trainable IM functions—frequently non-Markovian or trajectory-dependent—to safely accelerate exploration in sparse-reward environments without distorting the set of optimal policies. PBIM achieves this by generalizing classical potential-based reward shaping (PBRS) to allow history-dependent potentials and, together with the Generalized Reward Matching (GRM) construct, establishes a comprehensive family of shaping transformations that preserve policy invariance. This approach guarantees immunity to reward-hacking, is broadly compatible with contemporary IM modules, and is underpinned by rigorous theorems, boundary conditions, and explicit algorithmic corrections (Forbes et al., 2024).
1. Foundations of Potential-Based Reward Shaping
Traditional PBRS introduces a shaping term of the form , where is a potential function defined over states (or state-actions in some variants) and is the discount factor. Ng et al. (1999) showed that such additive shaping terms shift -values by an action-independent offset and thus leave the set of optimal policies unchanged, provided that does not depend on future agent actions. Episodic extensions require a terminal correction to ensure no bias at episode boundaries (Forbes et al., 2024).
PBIM extends classical PBRS by permitting to depend on the entire trajectory up to time (i.e., non-Markovian), encompassing complex, learned IM signals such as deep curiosity or density models, or trainable policy-dependent bonuses. The critical sufficient condition for optimality preservation is that the expected difference must be independent of the agent’s action at every time step for horizon .
2. Formal Definition of PBIM and Boundary Condition
Given an episodic MDP of horizon , PBIM shapes the reward as
where is the IM term to be converted. PBIM constructs as a telescoping difference of the intrinsic discounted return, . The Bellman-like identity yields
with , accommodating non-Markovian and trainable potential functions.
To guarantee policy invariance, PBIM enforces the boundary condition: which, under the assumption that is future-agnostic (action-independent beyond ), always holds when using the prescribed terminal correction. This prevents the agent from converging to suboptimal policies induced by IM.
3. Generalized Reward Matching (GRM): Optimality-Preserving Family
GRM generalizes PBIM by introducing a matching function controlling when and how much of each IM reward is “subtracted back” at time . For future-agnostic , GRM defines the shaped reward as
with the constraints:
- (each fully matched),
- for (no pre-subtraction).
Theorem 3 proves that GRM transformations coincide exactly with the set of all PBRS forms that preserve optimal policies (Forbes et al., 2024). GRM thus encompasses interim corrections and delays beyond the terminal adjustment used in PBIM, accommodating environment-specific credit assignment and potential normalization strategies.
4. Algorithmic Realizations and Empirical Performance
PBIM is instantiated by:
- Collecting trajectories and IM rewards ,
- Computing cumulative returns for potential estimates,
- Applying terminal corrections (or distributed matching per GRM) to form shaped rewards,
- Normalizing intrinsic signals as needed via running means,
- Updating RL agents (e.g., PPO+LSTM, Q-learning+RND) with these shaped returns.
Table: Example convergence metrics in MiniGrid DoorKey (higher , ):
| Method | Frames to Convergence | Mean Steps |
|---|---|---|
| No IM | N/A | 634.8 |
| Raw IM | 6.07M | 62.0 |
| PBIM+norm | 1.26M | 51.2 |
| GRM(D=10)+norm | 1.19M | 38.0 |
Empirical results confirm that PBIM and GRM restore true optimality and accelerate exploration, outperforming naive IM (which induces stalling or distraction behaviors) and yielding lower variance and improved convergence speed (Forbes et al., 2024).
5. Extension: Potential-Based Intrinsic Motivation in Bayes-Adaptive MDPs
PBIM has been extended to Bayes-Adaptive Markov Decision Processes (BAMDPs), where a “belief-state” MDP models learning over the agent’s knowledge posterior. Potential-based pseudo-rewards in this context (BAMDP Potential-Based Shaping Functions, or BAMPFs) guarantee optimal policy invariance and resist reward-hacking both in meta-RL and standard RL settings. Potentials can encode the value of information or opportunity for history-dependent signals. Classical PBIM thus becomes a special case of BAMDP shaping when is state-only (Lidayan et al., 2024).
Meta-RL and exploration-intensive tasks such as Bernoulli-bandits (potential: negative posterior entropy) and Mountain-Car (potential: proximity and velocity to goal) benefit from PBIM/BAMPF shaping, which yields Bayes-optimal exploration and improved sample efficiency without policy corruption (Lidayan et al., 2024).
6. Limitations and Open Questions
- PBIM and GRM require future-agnostic intrinsic reward signals; empowerment-style IM (future-dependent) violates the optimality-preserving boundary condition.
- The “reward horizon” problem in PBIM (large final correction) can dilute credit assignment; intermediate matching in GRM offers partial mitigation.
- Fully automated design of matching functions remains unresolved; environment-specific and agent-specific adaptation of delay parameters impacts performance.
- Scaling to very long-horizon sparse-reward domains (e.g., Montezuma’s Revenge) induces terminal penalties that can destabilize learning; action-dependent strategies such as ADOPS circumvent these issues but depart from pure potential-based forms (2505.12611).
7. Practical Significance and Outlook
PBIM and GRM jointly furnish a plug-and-play, theory-backed pipeline for composing IM rewards into the classical PBRS guarantee of policy invariance. They support complex, non-Markovian, and trainable IM modules, enabling their direct use in deep RL and exploration-heavy settings with minimal overhead and robust empirical improvements. Ensuring optimality preservation allows for more aggressive and reliable exploration bonuses, subject to tractable boundary conditions and careful normalization. Ongoing work seeks to relax future-agnosticity, address credit assignment in ultra-long episodes, and unify PBIM with broader adaptive exploration frameworks.
References:
- Forbes et al., “Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards” (Forbes et al., 2024)
- Forbes et al., “Potential-Based Reward Shaping For Intrinsic Motivation” (Forbes et al., 2024)
- "BAMDP Shaping: a Unified Framework for Intrinsic Motivation and Reward Shaping" (Lidayan et al., 2024)
- "Action-Dependent Optimality-Preserving Reward Shaping" (2505.12611)