BE-based Intrinsic Reward in RL

Updated 22 May 2026

BE-based intrinsic reward, also known as empowerment, is an intrinsic motivation mechanism that quantifies an agent’s control over its future observations.
It employs a variational information bottleneck to approximate the mutual information between actions and subsequent sensor states, facilitating effective exploration.
Empirical studies in sparse-reward environments, such as MiniGrid tasks, show that incorporating the intrinsic bonus accelerates learning and improves agent performance.

A Bellman-error (BE)-based intrinsic reward is an intrinsic motivation mechanism in reinforcement learning that assigns additional, internally-generated rewards to drive exploration or skill acquisition, chiefly by quantifying the influence an agent’s actions have on its future sensor states. The most thoroughly investigated form of BE-based intrinsic reward, also known as empowerment, operationalizes this intuition as the mutual information between the agent’s actions and its future observations. In practical algorithms, empowerment-based bonuses augment or replace sparse or absent extrinsic rewards, providing a structured incentive for behavioral diversity and effective exploration in high-dimensional or challenging environments.

1. Formal Definition of BE-based (“Empowerment”) Intrinsic Reward

Empowerment at time $t$ in environment state $e_t$ is defined as the channel capacity from action choices to subsequent sensor states:

$\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$

where $A_t$ denotes the action sequence chosen in $e_t$ and $S_{t+1}$ is the stochastic next sensor state. $I(A_t;S_{t+1}|e_t)$ is the conditional mutual information, measuring potential control or “empowerment” provided by choosing different $A_t$ given $e_t$ . This operationalizes intrinsic motivation as the maximization of the agent’s influence over its perceptual future (Massari et al., 2021).

2. Variational Information-Bottleneck Approximation

Direct computation of empowerment is intractable in practical RL settings due to the enormity of state and action spaces. Recent methods employ a variational information bottleneck (VIB) approximation:

Construct two networks: one predicts $s_{t+1}$ from $e_t$ 0, the other from $e_t$ 1.
Each predictor is parameterized by an encoder $e_t$ 2 and decoder $e_t$ 3, minimizing the VIB loss:

$e_t$ 4

The empowerment-based intrinsic bonus at time $e_t$ 5 is:

$e_t$ 6

Under the VIB bound, the difference accumulates the information about $e_t$ 7 in predicting $e_t$ 8. This approximates a lower bound of one-step empowerment (Massari et al., 2021).

3. Algorithmic Integration with RL Solvers

BE-based intrinsic rewards are commonly incorporated into the RL loop by modifying the agent’s reward at each timestep:

$e_t$ 9

where $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 0 is the extrinsic reward, $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 1 is the empowerment-based intrinsic reward, and $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 2 is a hyperparameter tuning the contribution scale. Policy and value function updates are performed using the total reward $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 3 via a standard loss (e.g., Advantage Actor-Critic, A2C):

Rollouts of fixed length are collected; at each frame, the intrinsic reward is computed from the predictors’ losses.
Returns for policy gradient updates are computed as multi-step returns of the modified reward.
Predictor networks are updated after each rollout using the latest transitions.

Empirical studies employ a variety of architectures, e.g., gridworld encodings to binary tensors with fully connected encoders, and optimize both the RL and VIB components with Adam or similar optimizers (Massari et al., 2021).

4. Empirical Evaluation and Performance

In environments with sparse extrinsic rewards, BE-based intrinsic rewards improve exploratory behavior. Specifically, experiments on MiniGrid tasks (MultiRoom-N3-S4, DoorKey-8x8, KeyCorridor-S3R1) reported mean time-to-success (frames to first average reward $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 4) and showed that agents with empowerment-based intrinsic rewards (“power”) frequently outperformed Actor-Critic baselines and, in some settings, matched “curious” (surprise-based) agents. Statistically significant improvements were observed in two of three benchmark tasks, with substantial reductions in episodes required for successful navigation (Massari et al., 2021).

5. Relation to Other Intrinsic Reward Mechanisms

Intrinsic motivation in RL is broadly categorized as:

Intrinsic Reward Type	Mechanism	Examples
State-novelty-based	Tracks visitation counts/novelty	Count-based, RND, pseudo-count
Prediction-error-based	Predictive loss or error	Curiosity, ICM, GIRM
Entropy-based	Maximizes state/action entropy	RE3 (Shannon), RISE (Rényi)
BE-based (“Empowerment”)	Maximizes control over future sensors	Empowerment, VIB power agent

Empowerment is distinguished from prediction-error or Shannon entropy methods in that it directly quantifies the agent’s causal influence over its next state, rather than novelty or unpredictability per se (Yuan, 2022).

6. Hyperparameterization and Implementation Details

Critical parameters governing implementation include:

Rollout length: typically $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 5 frames
Discount factor: $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 6
Intrinsic reward strength: $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 7 (domain-dependent, e.g., $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 8 for MultiRoom, DoorKey)
Network architectures: encoder/decoder layer sizes vary by environment; sample latent dim: 256 (MultiRoom), 16 (DoorKey), 128 (KeyCorridor)
Optimizer: Adam, learning rate $\mathfrak{E}(e_t) = \max_{p(A_t|e_t)} I\bigl(A_t; S_{t+1} \mid e_t \bigr) = \max_{p(A_t|e_t)} \mathbb{E}_{a_t,s_{t+1}} \left[ \log \frac{p(s_{t+1}|a_t, e_t)}{p(s_{t+1}|e_t)} \right]$ 9, batch size 128 for predictor updates
Activation functions: ReLU (hidden layers), linear (latent), sigmoid (output)

Ten independent runs per environment with up to $A_t$ 0 training frames are used in reported experiments (Massari et al., 2021).

7. Interpretation, Impact, and Limitations

BE-based intrinsic rewards offer a theoretically principled method for driving agent exploration via maximization of sensorimotor influence. Empirical results confirm effectiveness, especially in sparse-reward domains. Comparison with alternative intrinsic rewards suggests BE-based methods are competitive, though sensitive to implementation details and hyperparameters. Limitations include additional computational cost from predictor networks and possible approximation gaps in VIB frameworks. A plausible implication is that BE-based intrinsic motivation is best suited to scenarios where agent control or influence is a primary exploration bottleneck, complementing rather than replacing novelty-based schemes (Massari et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Experimental Evidence that Empowerment May Drive Exploration in Sparse-Reward Environments (2021)

Intrinsically-Motivated Reinforcement Learning: A Brief Introduction (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BE-based Intrinsic Reward.