BE-based Intrinsic Reward in RL
- BE-based intrinsic reward, also known as empowerment, is an intrinsic motivation mechanism that quantifies an agent’s control over its future observations.
- It employs a variational information bottleneck to approximate the mutual information between actions and subsequent sensor states, facilitating effective exploration.
- Empirical studies in sparse-reward environments, such as MiniGrid tasks, show that incorporating the intrinsic bonus accelerates learning and improves agent performance.
A Bellman-error (BE)-based intrinsic reward is an intrinsic motivation mechanism in reinforcement learning that assigns additional, internally-generated rewards to drive exploration or skill acquisition, chiefly by quantifying the influence an agent’s actions have on its future sensor states. The most thoroughly investigated form of BE-based intrinsic reward, also known as empowerment, operationalizes this intuition as the mutual information between the agent’s actions and its future observations. In practical algorithms, empowerment-based bonuses augment or replace sparse or absent extrinsic rewards, providing a structured incentive for behavioral diversity and effective exploration in high-dimensional or challenging environments.
1. Formal Definition of BE-based (“Empowerment”) Intrinsic Reward
Empowerment at time in environment state is defined as the channel capacity from action choices to subsequent sensor states:
where denotes the action sequence chosen in and is the stochastic next sensor state. is the conditional mutual information, measuring potential control or “empowerment” provided by choosing different given . This operationalizes intrinsic motivation as the maximization of the agent’s influence over its perceptual future (Massari et al., 2021).
2. Variational Information-Bottleneck Approximation
Direct computation of empowerment is intractable in practical RL settings due to the enormity of state and action spaces. Recent methods employ a variational information bottleneck (VIB) approximation:
- Construct two networks: one predicts from 0, the other from 1.
- Each predictor is parameterized by an encoder 2 and decoder 3, minimizing the VIB loss:
4
- The empowerment-based intrinsic bonus at time 5 is:
6
- Under the VIB bound, the difference accumulates the information about 7 in predicting 8. This approximates a lower bound of one-step empowerment (Massari et al., 2021).
3. Algorithmic Integration with RL Solvers
BE-based intrinsic rewards are commonly incorporated into the RL loop by modifying the agent’s reward at each timestep:
9
where 0 is the extrinsic reward, 1 is the empowerment-based intrinsic reward, and 2 is a hyperparameter tuning the contribution scale. Policy and value function updates are performed using the total reward 3 via a standard loss (e.g., Advantage Actor-Critic, A2C):
- Rollouts of fixed length are collected; at each frame, the intrinsic reward is computed from the predictors’ losses.
- Returns for policy gradient updates are computed as multi-step returns of the modified reward.
- Predictor networks are updated after each rollout using the latest transitions.
Empirical studies employ a variety of architectures, e.g., gridworld encodings to binary tensors with fully connected encoders, and optimize both the RL and VIB components with Adam or similar optimizers (Massari et al., 2021).
4. Empirical Evaluation and Performance
In environments with sparse extrinsic rewards, BE-based intrinsic rewards improve exploratory behavior. Specifically, experiments on MiniGrid tasks (MultiRoom-N3-S4, DoorKey-8x8, KeyCorridor-S3R1) reported mean time-to-success (frames to first average reward 4) and showed that agents with empowerment-based intrinsic rewards (“power”) frequently outperformed Actor-Critic baselines and, in some settings, matched “curious” (surprise-based) agents. Statistically significant improvements were observed in two of three benchmark tasks, with substantial reductions in episodes required for successful navigation (Massari et al., 2021).
5. Relation to Other Intrinsic Reward Mechanisms
Intrinsic motivation in RL is broadly categorized as:
| Intrinsic Reward Type | Mechanism | Examples |
|---|---|---|
| State-novelty-based | Tracks visitation counts/novelty | Count-based, RND, pseudo-count |
| Prediction-error-based | Predictive loss or error | Curiosity, ICM, GIRM |
| Entropy-based | Maximizes state/action entropy | RE3 (Shannon), RISE (Rényi) |
| BE-based (“Empowerment”) | Maximizes control over future sensors | Empowerment, VIB power agent |
Empowerment is distinguished from prediction-error or Shannon entropy methods in that it directly quantifies the agent’s causal influence over its next state, rather than novelty or unpredictability per se (Yuan, 2022).
6. Hyperparameterization and Implementation Details
Critical parameters governing implementation include:
- Rollout length: typically 5 frames
- Discount factor: 6
- Intrinsic reward strength: 7 (domain-dependent, e.g., 8 for MultiRoom, DoorKey)
- Network architectures: encoder/decoder layer sizes vary by environment; sample latent dim: 256 (MultiRoom), 16 (DoorKey), 128 (KeyCorridor)
- Optimizer: Adam, learning rate 9, batch size 128 for predictor updates
- Activation functions: ReLU (hidden layers), linear (latent), sigmoid (output)
Ten independent runs per environment with up to 0 training frames are used in reported experiments (Massari et al., 2021).
7. Interpretation, Impact, and Limitations
BE-based intrinsic rewards offer a theoretically principled method for driving agent exploration via maximization of sensorimotor influence. Empirical results confirm effectiveness, especially in sparse-reward domains. Comparison with alternative intrinsic rewards suggests BE-based methods are competitive, though sensitive to implementation details and hyperparameters. Limitations include additional computational cost from predictor networks and possible approximation gaps in VIB frameworks. A plausible implication is that BE-based intrinsic motivation is best suited to scenarios where agent control or influence is a primary exploration bottleneck, complementing rather than replacing novelty-based schemes (Massari et al., 2021).