- The paper presents an energy-based transfer method that selectively integrates teacher guidance to prevent negative transfer in reinforcement learning.
- It employs energy scores for out-of-distribution detection to decide when teacher advice is warranted, ensuring robust and safe exploration.
- Empirical evaluations demonstrate significant sample efficiency gains and improved performance across both single-task and multi-task settings.
This paper introduces Energy-Based Transfer Learning (EBTL), a principled method for selective teacher-student transfer in reinforcement learning (RL). EBTL seeks to address the pivotal challenge of negative transfer that arises when a teacher policy, pretrained on a source task, offers guidance for a target task distribution it has not encountered. The method leverages energy-based out-of-distribution (OOD) detection to determine when the teacher's intervention is warranted. This is realized via a theoretically grounded score function derived from the teacher's neural network energy, correlating with its empirical state-visitation density.
Motivation and Theoretical Justification
Transfer in RL commonly suffers from a lack of selectivity: teacher advice issued outside its state distribution can bias student exploration into regions that are irrelevant or even detrimental to learning optimal behavior. EBTL counters this by monitoring a quantitative energy score, only activating teacher advice for in-distribution (ID) states—those similar to its prior experience.
The core theoretical insight is that, under standard energy-based models, the negative free energy computed from the teacher's policy network is proportional to the logarithm of the policy's state visitation density. This links energy scores directly to teacher familiarity: states with high energy are likely OOD, and guidance in such regions is suppressed. This result is formally supported by leveraging the connection between energy-based model densities and RL trajectory distributions.
Method
The EBTL algorithm is structured as follows:
- At each step, the student's state is evaluated using the teacher's policy to compute a free energy score.
- If this score exceeds a pre-computed threshold (quantile of teacher training state energies), and a stochastic decay schedule permits, the student receives the teacher's action proposal. Otherwise, the student acts independently.
- Energy regularization is incorporated into teacher training to enhance the separation between ID and OOD states, penalizing misclassification via margins on in-sample and out-of-sample energy scores.
- The method employs off-policy corrections (via importance sampling ratios) to ensure that student updates remain valid under mixed-policy data.
A summary of the student update procedure is given by the following policy selection rule:
1
2
3
4
|
if energy_score(state) ≥ τ and p < δ(t):
action ← teacher_policy(state)
else:
action ← student_policy(state) |
where τ is the energy threshold and δ(t) is a linear decay controlling guidance probability.
Empirical Evaluation
EBTL is empirically evaluated on several single-task and multi-task RL settings, including navigation tasks in Minigrid and multi-recipe cooking tasks in Overcooked. The experiments are designed to induce various degrees of covariate shift between teacher and student state distributions.
Findings highlight:
- Robust Sample Efficiency Gains: Across all transfer scenarios, EBTL consistently achieves greater sample efficiency and higher final returns compared to standard baselines (No Transfer, vanilla Action Advising, parameter Fine-Tuning, Kickstarting RL, and JumpStart RL).
- Selective Guidance: The algorithm adaptively regulates guidance issuance, maximizing positive transfer when the teacher's experience is relevant and effectively suppressing it otherwise. This ability is most pronounced under moderate covariate shift, where teacher familiarity is neither ubiquitous nor entirely absent.
- Threshold Tuning: There exists an optimal energy quantile threshold balancing the benefits and risks of guidance. Excessively low thresholds result in harmful guidance in unfamiliar states; too high, and valuable advice is withheld.
- Energy Regularization Effectiveness: Augmenting the teacher's training with energy separation loss significantly improves the discrimination between ID and OOD states, yielding further improvements in transfer efficacy.
Representative empirical results demonstrate, for instance, that in navigation tasks with alternating goals, EBTL correctly assigns higher energy scores—and thus issues advice—solely in goal-room configurations encountered during teacher training. In Overcooked, under increased task and layout variation, EBTL retains stable sample efficiency, unlike baselines which degrade under greater distributional shift.
Limitations and Future Implications
While the EBTL framework demonstrates robust transfer performance, several practical caveats remain:
- Threshold Sensitivity: The approach requires specification of an energy threshold (quantile), which may be domain/task specific. Although empirical analysis suggests insensitivity to exact margin values, fully unsupervised or adaptive thresholding merits further investigation.
- Type of Shift Addressed: The method is primarily suitable for situations dominated by covariate shift. Scenarios with significant label/reward function shift, where optimal actions differ despite similar states, are not directly addressed.
- Requirement for OOD Samples: Energy regularization during the teacher's training assumes access to OOD samples representative of the target state space, which may not always be available in practical settings.
Theoretically, EBTL elegantly extends the use of energy-based models from supervised OOD detection to RL transfer, introducing a quantifiable mechanism for selective advice issuance. Practically, it can be implemented atop modern PPO algorithms with minimal computational overhead, using standard neural architectures.
Broader Implications and Future Directions
EBTL provides a template for the broader integration of uncertainty quantification and density estimation in RL transfer mechanisms. Potential extensions include:
- Adaptive or learned thresholding to further automate the selectivity mechanism.
- Generalization to label-shifted domains by composing state-action visitation densities or exploiting meta-learning.
- Combining EBTL with reward shaping or imitation learning for hybrid transfer schemes.
Overall, the paradigm of energy-based selective guidance offers a promising direction for improving transfer robustness in RL, particularly as multi-task and continual learning scenarios become more prevalent in real-world applications.