KL regularization sufficiency for missing reward components

Determine whether, in Human-Regularized PPO for multi-agent autonomous driving, Kullback–Leibler divergence regularization between the behavioral cloning reference policy and the reinforcement learning policy can compensate for missing components of the (unknown) ground-truth reward function, and, if so, specify the assumptions and conditions under which such compensation yields sound convergence behavior.

Background

The paper proposes Human-Regularized PPO (HR-PPO), which augments self-play PPO with a KL regularization term that nudges the learned policy toward a behavioral cloning policy trained on limited human demonstrations.

Unlike some game-theoretic applications where the ground-truth reward is available, the driving setting lacks an explicit, fully specified reward capturing human conventions. The authors rely on imitation to implicitly fill in these elements, raising theoretical concerns about whether KL regularization can substitute for missing reward terms.

The authors explicitly note unresolved theoretical questions about the soundness of relying on KL loss when the true reward function is unavailable, motivating a formal investigation of the conditions under which KL regularization compensates for missing reward components.

References

Finally, there remain unresolved theoretical questions about the soundness of this approach. In contrast to other works applying this type of regularization in the game literature, we do not have access to the ground truth reward function. As such, we are relying on imitation learning to implicitly complete these portions of the reward. It is not clear if the KL loss used can compensate for these missing terms.

— Human-compatible driving partners through data-regularized self-play reinforcement learning (2403.19648 - Cornelisse et al., 28 Mar 2024) in Conclusion and future work

KL regularization sufficiency for missing reward components

Sponsor

Background

References

Related Problems