KL regularization sufficiency for missing reward components
Determine whether, in Human-Regularized PPO for multi-agent autonomous driving, Kullback–Leibler divergence regularization between the behavioral cloning reference policy and the reinforcement learning policy can compensate for missing components of the (unknown) ground-truth reward function, and, if so, specify the assumptions and conditions under which such compensation yields sound convergence behavior.
References
Finally, there remain unresolved theoretical questions about the soundness of this approach. In contrast to other works applying this type of regularization in the game literature, we do not have access to the ground truth reward function. As such, we are relying on imitation learning to implicitly complete these portions of the reward. It is not clear if the KL loss used can compensate for these missing terms.