Dice Question Streamline Icon: https://streamlinehq.com

Learn Potential-Based Shaping Useful Throughout Training

Develop a procedure to learn a potential function Φ(s) for potential-based reward shaping such that, when added to a recovered reward r to form r′(s,a)=r(s,a)+Φ(s′)−Φ(s), the shaped reward accelerates policy training from scratch throughout the entire training process, including early stages with weak initial policies.

Information Square Streamline Icon: https://streamlinehq.com

Background

Potential-based reward shaping preserves optimal policies but can dramatically reduce the effective planning horizon, thus improving interaction efficiency. In theory, setting Φ to the optimal value function V* would yield a one-step greedy optimal policy, but computing or approximating V* uniformly over the state space is infeasible in realistic settings.

Given the practical difficulty of obtaining an admissible heuristic or expert critic that is accurate beyond the expert’s state distribution, the authors pose the need for a method that learns a shaping term that is helpful across the full training trajectory when starting from scratch.

References

Thus, in practice, we are left with an open question. Challenge 3: In practice, how do we learn a potential-based shaping term that is useful throughout the course of training from scratch?

EvIL: Evolution Strategies for Generalisable Imitation Learning (2406.11905 - Sapora et al., 15 Jun 2024) in Section “Reward-Centric Challenges of Efficient IRL”, Challenge 3