DPO’s ability to output off-KL implicit rewards

Determine whether Direct Preference Optimization (DPO), when minimizing its loss over a parametric policy class on pairwise preferences generated under a Bradley–Terry–Luce model, can actually output a reward function from the policy-induced implicit reward family (i.e., of the form r_theta_beta(s,a) = beta · log[π_theta(a|s)/π_theta0(a|s)]) that is ε-close to the true reward yet induces a policy whose distribution lies outside the KL neighborhood of the base policy.

Background

A prior result argues that shortcomings of DPO can arise from a lack of coverage and presents a counterexample showing the existence of an implicitly expressible reward function that is close to the true reward but corresponds to a policy not in the KL neighborhood of the base policy. This raises a fundamental question about what policies DPO can actually realize when optimizing its loss over a parametric policy class.

The paper explicitly notes that it is unclear whether DPO can, in practice, output such an implicitly expressible reward function via its optimization, highlighting a gap between existence of such functions and attainability by DPO’s training dynamics. Clarifying this would sharpen understanding of DPO’s limitations under coverage constraints and parametric policy classes.

References

It is, however, unclear if such an implicit reward function can actually be output by DPO.

— Why DPO is a Misspecified Estimator and How to Fix It (2510.20413 - Gopalan et al., 23 Oct 2025) in Related work

DPO’s ability to output off-KL implicit rewards

Background

References

Related Problems