DPO’s ability to output off-KL implicit rewards
Determine whether Direct Preference Optimization (DPO), when minimizing its loss over a parametric policy class on pairwise preferences generated under a Bradley–Terry–Luce model, can actually output a reward function from the policy-induced implicit reward family (i.e., of the form r_theta_beta(s,a) = beta · log[π_theta(a|s)/π_theta0(a|s)]) that is ε-close to the true reward yet induces a policy whose distribution lies outside the KL neighborhood of the base policy.
References
It is, however, unclear if such an implicit reward function can actually be output by DPO.
— Why DPO is a Misspecified Estimator and How to Fix It
(2510.20413 - Gopalan et al., 23 Oct 2025) in Related work