- The paper demonstrates that DPO’s log-ratio reward is uniquely optimal when preference data encodes a power-law structured Differential Information Distribution.
- It introduces the DID framework that formalizes transforming a reference policy to a target policy using Bayesian updating and normalized likelihood ratios.
- Experimental results on synthetic and real-world datasets validate the theoretical findings and reveal how DID entropy influences alignment and policy capabilities.
This paper, "Differential Information: An Information-Theoretic Perspective on Preference Optimization" (2505.23761), provides a theoretical framework based on information theory to better understand the mechanics of Direct Preference Optimization (DPO) [rafailov2024direct], particularly why its characteristic log-ratio reward function is effective. The core concept introduced is the Differential Information Distribution (DID).
The paper formalizes the Differential Information Distribution (DID) from a reference policy ref to a target policy π as the distribution over token sequences y that embodies the information needed to transform ref into π via Bayesian updating. Formally, if π(y)=P(Y=y∣ref,X) where X is an event conditionally independent of ref given Y=y, the DID is defined as P(Y=y∣X). The paper shows that this DID is equivalent to the normalized likelihood ratio: P(Y=y∣X)=∑y′π(y′)/ref(y′)π(y)/ref(y).
The authors then investigate when a preference dataset, typically consisting of pairs (yw,yl) where yw is preferred over yl, naturally encodes the differential information needed to update a reference policy ref to a target policy .TheyshowthatiftheDIDbetweenrefand relates to the DID between ref and viaapower−lawstructure,specificallyP(Y=y \mid X_{\ref \to \pi^*}) \propto P(Y=y \mid X_{\pi^* \to \ref})^\betaforsome\beta > 0,thenthepreferenceprobabilityp^*(y_w \succ y_l)$ can be expressed in terms of the DID.</p>
<p>A key theoretical result (Theorem 3.2) demonstrates that when preferences encode this differential information structure, DPO's log-ratio reward $r(y) = \beta \log(\pi(y)/ref(y)) + Cis<em>uniquelyoptimal</em>forrecoveringthetargetpolicy via preference optimization. This provides a theoretical justification for the standard DPO objective beyond its derivation from KL-regularized RL.
The paper further connects the condition for preferences encoding differential information (the power-law structure in DID) to an implicit assumption often found in preference optimization methods: an ordering of policies based on increasing log-margins. This suggests that methods optimizing for larger log-margins implicitly rely on this underlying structure of differential information.
Based on this framework, the paper derives a closed-form expression for the ideal distribution from which to sample rejected responses yl, assuming preferred responses yw are sampled from ref. The ideal distribution for yl is proportional to ref(yl)(π∗(yl)ref(yl))β. While requiring knowledge of the target policy ,thisprovidesatheoreticaltargetfordatasetcreationprocesses.</p><p>Inthesecondpart,thepaperanalyzestheentropyoftheDID,H(P(Y=y \mid X_{\ref \to \pi})).Itisarguedthatlearninglow−entropydifferentialinformationreinforcesthepolicydistribution(concentratingmass),whilelearninghigh−entropydifferentialinformationinducesasmoothingeffect(spreadingmass).Thisleadstoaninformation−theoreticexplanationforLog−LikelihoodDisplacement(LLD),thephenomenonwherethelog−likelihoodofpreferredresponsescandecreaseduringDPOtraining.Thehypothesisisthatcomplex,multifacetedalignmentobjectives(likegeneralinstructionfollowing)encodehigh−entropyDID,leadingtopolicysmoothingandthusLLD,especiallyforpreferredresponsesinitiallyinhigh−probabilityregionsofref.</p><p>Experimentalvalidationispresentedusingbothsyntheticandreal−worlddatasets:</p><ul><li><strong>SyntheticExperiments:</strong>UsingEnergy−BasedModels,theauthorscreateasyntheticsetupwherepreferencesareknowntoencodedifferentialinformation.ExperimentsconfirmthatapolicytrainedtomatchthepreferencedistributionconvergestotheexpectedDIDandthatstandardDPOoptimallyrecoversthetargetpolicyinthissetting,validatingthetheoreticalfindings.</li><li><strong>Real−WorldExperiments:</strong>Usinginstruction−followingdatasets(Ultra−Feedback[cui2023ultrafeedback]andMagpie−Pro[xu2024magpie]),theauthorsshowthattrainingapolicytodirectlymatchthepreferencedistribution(equivalenttolearningp^*$) results in a policy distribution with higher entropy than the reference policy, suggesting these datasets encode high-entropy DID. Concurrently, LLD is observed during standard DPO training on these datasets, supporting the link between high-entropy DID and policy smoothing.
DID Entropy and Capabilities: By comparing standard DPO with a custom "DPO-Projected Gradient" (DPO-PG) method designed to encourage policy reinforcement (and thus lower DID entropy), the authors explore the relationship between DID entropy and learned capabilities. On general instruction-following benchmarks (Arena-Hard [arenahard2024], Wild-Bench [lin2024wildbench]), standard DPO (higher entropy) performs better. On knowledge-intensive Question Answering tasks, DPO-PG (lower entropy) performs better. This suggests that learning high-entropy DID is important for the diverse capabilities required in general instruction following, while low-entropy DID is more relevant for precise, knowledge-based tasks.
In summary, the paper offers a unified information-theoretic perspective on DPO, linking its objective to the structure of preference data via the concept of Differential Information Distribution. It provides theoretical justification for DPO's log-ratio reward and a novel explanation for LLD, suggesting that different types of capabilities acquired through alignment correlate with the entropy of the learned differential information.
Implementation considerations include the challenge of sampling from the theoretically ideal distribution for rejected responses, which requires access to the target policy. The empirical results highlight that different alignment goals might require different training strategies that influence the learned DID entropy. The paper notes limitations regarding the assumption of sufficient data coverage for preference optimization equivalence and suggests future work could explore how data annotation protocols influence DID entropy.