Papers
Topics
Authors
Recent
Search
2000 character limit reached

BE-based Intrinsic Reward in RL

Updated 22 May 2026
  • BE-based intrinsic reward, also known as empowerment, is an intrinsic motivation mechanism that quantifies an agent’s control over its future observations.
  • It employs a variational information bottleneck to approximate the mutual information between actions and subsequent sensor states, facilitating effective exploration.
  • Empirical studies in sparse-reward environments, such as MiniGrid tasks, show that incorporating the intrinsic bonus accelerates learning and improves agent performance.

A Bellman-error (BE)-based intrinsic reward is an intrinsic motivation mechanism in reinforcement learning that assigns additional, internally-generated rewards to drive exploration or skill acquisition, chiefly by quantifying the influence an agent’s actions have on its future sensor states. The most thoroughly investigated form of BE-based intrinsic reward, also known as empowerment, operationalizes this intuition as the mutual information between the agent’s actions and its future observations. In practical algorithms, empowerment-based bonuses augment or replace sparse or absent extrinsic rewards, providing a structured incentive for behavioral diversity and effective exploration in high-dimensional or challenging environments.

1. Formal Definition of BE-based (“Empowerment”) Intrinsic Reward

Empowerment at time tt in environment state ete_t is defined as the channel capacity from action choices to subsequent sensor states:

E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]

where AtA_t denotes the action sequence chosen in ete_t and St+1S_{t+1} is the stochastic next sensor state. I(At;St+1et)I(A_t;S_{t+1}|e_t) is the conditional mutual information, measuring potential control or “empowerment” provided by choosing different AtA_t given ete_t. This operationalizes intrinsic motivation as the maximization of the agent’s influence over its perceptual future (Massari et al., 2021).

2. Variational Information-Bottleneck Approximation

Direct computation of empowerment is intractable in practical RL settings due to the enormity of state and action spaces. Recent methods employ a variational information bottleneck (VIB) approximation:

  • Construct two networks: one predicts st+1s_{t+1} from ete_t0, the other from ete_t1.
  • Each predictor is parameterized by an encoder ete_t2 and decoder ete_t3, minimizing the VIB loss:

ete_t4

  • The empowerment-based intrinsic bonus at time ete_t5 is:

ete_t6

  • Under the VIB bound, the difference accumulates the information about ete_t7 in predicting ete_t8. This approximates a lower bound of one-step empowerment (Massari et al., 2021).

3. Algorithmic Integration with RL Solvers

BE-based intrinsic rewards are commonly incorporated into the RL loop by modifying the agent’s reward at each timestep:

ete_t9

where E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]0 is the extrinsic reward, E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]1 is the empowerment-based intrinsic reward, and E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]2 is a hyperparameter tuning the contribution scale. Policy and value function updates are performed using the total reward E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]3 via a standard loss (e.g., Advantage Actor-Critic, A2C):

  • Rollouts of fixed length are collected; at each frame, the intrinsic reward is computed from the predictors’ losses.
  • Returns for policy gradient updates are computed as multi-step returns of the modified reward.
  • Predictor networks are updated after each rollout using the latest transitions.

Empirical studies employ a variety of architectures, e.g., gridworld encodings to binary tensors with fully connected encoders, and optimize both the RL and VIB components with Adam or similar optimizers (Massari et al., 2021).

4. Empirical Evaluation and Performance

In environments with sparse extrinsic rewards, BE-based intrinsic rewards improve exploratory behavior. Specifically, experiments on MiniGrid tasks (MultiRoom-N3-S4, DoorKey-8x8, KeyCorridor-S3R1) reported mean time-to-success (frames to first average reward E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]4) and showed that agents with empowerment-based intrinsic rewards (“power”) frequently outperformed Actor-Critic baselines and, in some settings, matched “curious” (surprise-based) agents. Statistically significant improvements were observed in two of three benchmark tasks, with substantial reductions in episodes required for successful navigation (Massari et al., 2021).

5. Relation to Other Intrinsic Reward Mechanisms

Intrinsic motivation in RL is broadly categorized as:

Intrinsic Reward Type Mechanism Examples
State-novelty-based Tracks visitation counts/novelty Count-based, RND, pseudo-count
Prediction-error-based Predictive loss or error Curiosity, ICM, GIRM
Entropy-based Maximizes state/action entropy RE3 (Shannon), RISE (Rényi)
BE-based (“Empowerment”) Maximizes control over future sensors Empowerment, VIB power agent

Empowerment is distinguished from prediction-error or Shannon entropy methods in that it directly quantifies the agent’s causal influence over its next state, rather than novelty or unpredictability per se (Yuan, 2022).

6. Hyperparameterization and Implementation Details

Critical parameters governing implementation include:

  • Rollout length: typically E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]5 frames
  • Discount factor: E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]6
  • Intrinsic reward strength: E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]7 (domain-dependent, e.g., E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]8 for MultiRoom, DoorKey)
  • Network architectures: encoder/decoder layer sizes vary by environment; sample latent dim: 256 (MultiRoom), 16 (DoorKey), 128 (KeyCorridor)
  • Optimizer: Adam, learning rate E(et)=maxp(Atet)I(At;St+1et)=maxp(Atet)Eat,st+1[logp(st+1at,et)p(st+1et)]\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]9, batch size 128 for predictor updates
  • Activation functions: ReLU (hidden layers), linear (latent), sigmoid (output)

Ten independent runs per environment with up to AtA_t0 training frames are used in reported experiments (Massari et al., 2021).

7. Interpretation, Impact, and Limitations

BE-based intrinsic rewards offer a theoretically principled method for driving agent exploration via maximization of sensorimotor influence. Empirical results confirm effectiveness, especially in sparse-reward domains. Comparison with alternative intrinsic rewards suggests BE-based methods are competitive, though sensitive to implementation details and hyperparameters. Limitations include additional computational cost from predictor networks and possible approximation gaps in VIB frameworks. A plausible implication is that BE-based intrinsic motivation is best suited to scenarios where agent control or influence is a primary exploration bottleneck, complementing rather than replacing novelty-based schemes (Massari et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BE-based Intrinsic Reward.