Learn Potential-Based Shaping Useful Throughout Training
Develop a procedure to learn a potential function Φ(s) for potential-based reward shaping such that, when added to a recovered reward r to form r′(s,a)=r(s,a)+Φ(s′)−Φ(s), the shaped reward accelerates policy training from scratch throughout the entire training process, including early stages with weak initial policies.
References
Thus, in practice, we are left with an open question. Challenge 3: In practice, how do we learn a potential-based shaping term that is useful throughout the course of training from scratch?
— EvIL: Evolution Strategies for Generalisable Imitation Learning
(2406.11905 - Sapora et al., 15 Jun 2024) in Section “Reward-Centric Challenges of Efficient IRL”, Challenge 3