Internal Model-Based Reward Shaping

Updated 23 April 2026

Internal model-based reward shaping is a reinforcement learning technique that uses learned or formal models to produce dense, informative rewards while preserving optimal policies.
It leverages potential-based frameworks and neural approaches, achieving up to 2× faster convergence and improved sample efficiency across various control tasks.
The method extends to Bayesian and meta-RL settings, maintaining theoretical guarantees and resilience against reward hacking in complex environments.

Internal model-based reward shaping is a class of methods in reinforcement learning (RL) that leverages learned or formal models of the environment’s dynamics, reward structure, or task constraints to produce dense, informative reward signals while rigorously preserving, or explicitly characterizing, their impact on optimal policy invariance. This paradigm subsumes approaches ranging from learned potentials in neural architectures to automata-induced shaping for temporal logic objectives and extends to Bayesian settings with belief-state planning. The following sections detail foundational principles, representative methodologies, theoretical properties, and empirical results across diverse model-based shaping frameworks.

1. Foundations of Potential-Based Reward Shaping

Internal model-based reward shaping is grounded in the potential-based reward shaping theorem (Ng et al., 1999): for any scalar “potential” function $\Phi$ , the reward transformed as

$F(s,a,s') = \gamma \Phi(s') - \Phi(s)$

preserves the set of optimal policies for all $0 \leq \gamma < 1$ (Sami et al., 2022, Li et al., 3 Feb 2026). This transformation introduces dense, trajectory-independent feedback (via the telescoping property), enabling accelerated learning even in environments with sparse or delayed rewards.

A critical insight is that $\Phi$ can be parameterized or derived using an internal model, such as a neural predictor, automaton, or belief-state function. Potential-based shaping admits extensions to stochastic transition models (Zhan et al., 2024), belief-state MDPs (Lidayan et al., 2024), and even sequence-level decompositions in LLMs via explainability methods (Koo et al., 22 Apr 2025).

2. Neural Approaches: Convolutional and Predictive Models

A salient instantiation is VIN-RS (Value Iteration Network for Reward Shaping), which encodes the shaping potential $\Phi_\theta(s) = \mathrm{CNN}_\theta(\mathrm{obs}(s))$ using a convolutional neural network trained on environment states or graph representations (Sami et al., 2022). VIN-RS integrates an implicit internal transition model: convolutional kernels $W^{\bar a}$ act as proxies for transition probabilities, facilitating internal value iteration and the construction of $\Phi_\theta$ without explicitly recovering the true environment dynamics. Training labels are generated via message passing in an HMM framework, using forward and backward messages ( $\alpha$ , $\beta$ ) as soft targets.

Predictive-coding-based shaping (Lu et al., 2019) and reward-predictive representation learning (Hlynsson et al., 2021) instead construct internal models for mapping observations (often via CNNs or CPC encoders) to compact latent spaces. Reward shaping is produced either by negative prediction error bonuses (distance between predicted and observed state), by embedding distance to a goal, or via clustering in latent space to reward topologically meaningful progress.

Empirical results from these methods demonstrate substantial improvements in convergence rates and cumulative reward compared to unshaped or curiosity-driven approaches, on continuous-control (MuJoCo), visual RL (Atari), and navigation (MiniGrid) benchmarks (Sami et al., 2022, Lu et al., 2019, Hlynsson et al., 2021).

3. Model-Based Shaping in Planning and Inverse RL

In model-based RL, SLOPE (Shaping Landscapes with Optimistic Potential Estimates) constructs potential landscapes by regressing a distributional $Q$ -function and extracting an optimistic potential $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 0 (Li et al., 3 Feb 2026). Optimism is enforced via quantile-weighted cross-entropy loss, which up-weights underestimation errors, ensuring high-confidence guidance even in regions of rare success. The shaped reward for each transition is then

$F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 1

Empirically, SLOPE achieves superior performance in fully sparse and semi-sparse continuous-control domains by enabling planners to ascend the learned potential, overcoming gradient starvation.

In adversarial imitation learning, model-based reward shaping is used to address stochastic dynamics in AIRL. Here, the shaped reward integrates an internal model $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 2,

$F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 3

guaranteeing policy invariance even under transition uncertainty (Zhan et al., 2024). Incorporating $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 4 enables both tighter theoretical error bounds (proportional to the total-variation distance between $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 5 and $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 6) and improved empirical sample efficiency in noisy control tasks.

4. Formal Models and Automata-Driven Shaping

When RL objectives are expressed via temporal logics ( $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 7-regular objectives), internal model-based shaping is realized by augmenting the environment with a Büchi automaton $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 8, forming the product $F(s,a,s') = \gamma \Phi(s') - \Phi(s)$ 9 (Hahn et al., 2020). The shaped reward assigns geometric sequence bonuses to each accepting transition using $0 \leq \gamma < 1$ 0, yielding dense, temporally-consistent feedback. This discount-scheduled scheme is proven to be strictly equivalent to optimizing the original $0 \leq \gamma < 1$ 1-regular objective: both undiscounted reachability in an auxiliary MDP and the discounted sum in $0 \leq \gamma < 1$ 2 induce the same optimality conditions and convergence guarantees.

5. Reward Shaping in Bayesian-Adaptive and Meta-RL Settings

BAMDP-Shaping formalizes internal model-based shaping with belief-state representations and Bayes-Adaptive MDPs (Lidayan et al., 2024). Here, belief-state transitions and rewards are computed as expectations over the agent’s posterior on the environment model. Any shaping term $0 \leq \gamma < 1$ 3 is guaranteed to leave the Bayes-optimal policy invariant, provided $0 \leq \gamma < 1$ 4 is a scalar potential on histories. This framework enables principled design of meta-RL shaping, including entropy-based (information value) or expected-opportunity based potentials, and ensures resistance to reward hacking.

The following table summarizes key properties:

Framework	Internal Model Type	Policy-Invariance Guarantee	Empirical Domains
VIN-RS	CNN + convolutional MDP	Yes (PBRS)	Tabular, Atari, MuJoCo (Sami et al., 2022)
SLOPE	Distributional Q-function	Yes (PBRS)	ManiSkill, Meta-World (Li et al., 3 Feb 2026)
AIRL/Model-Enhanced	Learned $0 \leq \gamma < 1$ 5 (Gaussian)	Yes (Soft PBRS + model)	MuJoCo (stochastic) (Zhan et al., 2024)
Automata-based	Formal Büchi Automaton	Yes	LTL, $0 \leq \gamma < 1$ 6-regular, RL (Hahn et al., 2020)
BAMDP-Shaping	Bayesian belief MDP	Yes	Meta-RL, bandits, MCar (Lidayan et al., 2024)
Predictive Coding	CPC/LSTM encoder	Heuristic/empirical	GridWorld, Mujoco (Lu et al., 2019)

6. Extensions, Limitations, and Empirical Findings

Internal model-based shaping demonstrates broad applicability: dense reward induction from sparse feedback, acceleration of convergence on both classical and meta-RL tasks, and direct handling of complex objectives or stochastic environments. Notable observations include:

Model-based shaping often yields up to 2× faster convergence or significant sample efficiency gains compared with nonshaped baselines (Sami et al., 2022, Li et al., 3 Feb 2026, Zhan et al., 2024, Lu et al., 2019).
CNN-based approaches rapidly infer structural aspects of MDPs without explicit transition estimation (Sami et al., 2022), while distributional Q-trained potentials enable directed exploration (Li et al., 3 Feb 2026).
In belief spaces (BAMDPs), shaping can be rigorously designed to inject only value-of-information or physical-state value, preventing bias and reward hacking (Lidayan et al., 2024).
Limitations include the need for domain- or model-specific potential parameterizations, sensitivity to overfitting in small data regimes, scaling to high-dimensional or continuous domains, and—depending on the formulation—brightness of the potential vanishing as the policy approaches optimality.
Empirical ablations confirm that both the potential-based form and adequate optimism/training targets are essential for maximal effect (Li et al., 3 Feb 2026).

7. Applications, Generalization, and Future Directions

Internal model-based reward shaping frameworks have been applied to:

Atari and MuJoCo continuous control (learning from visual or proprioceptive input)
Robotic manipulation, navigation, cyber-defense, and meta-RL exploration
RL from human feedback in LLMs, where explainability-based token reward attribution provably preserves optimality (Koo et al., 22 Apr 2025)
Temporal logic–driven control and omega-regular RL via automata products (Hahn et al., 2020)
Bayesian RL and meta-learning, where belief-dependent shaping unifies pseudo-reward and intrinsic motivation (Lidayan et al., 2024)

Directions for future development include designing generalized graph convolution potentials for irregular spaces, leveraging online label generation to reduce computation, and extending policy-invariant shaping architectures to richer, history-dependent objectives and non-Markovian settings (Sami et al., 2022, Lidayan et al., 2024). Expanding scalability to large action/state spaces and formalizing the use of model ensembles or uncertainty-aware shaping in high-dimensional regimes remain open challenges.

References:

(Sami et al., 2022, Li et al., 3 Feb 2026, Zhan et al., 2024, Hahn et al., 2020, Koo et al., 22 Apr 2025, Lidayan et al., 2024, Hlynsson et al., 2021, Lu et al., 2019).