Closing the agent generalization gap via stronger imitation policies

Determine whether increasing the capability of the behavioral cloning reference policy—using methods such as Generative Adversarial Imitation Learning or Diffusion Policies—suffices to close the agent generalization gap observed when policies trained in self-play are evaluated against unseen human drivers in log-replay.

Background

The paper documents an agent generalization gap: policies trained via self-play (both PPO and HR-PPO) perform worse when evaluated against unseen human driving logs compared to self-play evaluation.

HR-PPO narrows this gap relative to PPO, but a residual difference remains. The authors hypothesize that further improving the imitation policy used for regularization might close the gap and explicitly flag this as uncertain.

They specifically mention more complex imitation methods (e.g., GAIL) and stronger architectures (e.g., Diffusion Policies) as potential avenues, but note that it is not yet known if these are sufficient.

References

Additionally, it is still to be seen if the agent generalization gap can be closed simply by increasing the capability of the BC policy using more complex imitation methods such as GAIL~\citep{ho2016generative} or better architectures such as Diffusion Policies~\citep{chi2023diffusion}.

— Human-compatible driving partners through data-regularized self-play reinforcement learning (2403.19648 - Cornelisse et al., 28 Mar 2024) in Conclusion and future work

Closing the agent generalization gap via stronger imitation policies

Sponsor

Background

References

Related Problems