Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differential Information: An Information-Theoretic Perspective on Preference Optimization (2505.23761v1)

Published 29 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Direct Preference Optimization (DPO) has become a standard technique for aligning LLMs with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

Summary

  • The paper demonstrates that DPO’s log-ratio reward is uniquely optimal when preference data encodes a power-law structured Differential Information Distribution.
  • It introduces the DID framework that formalizes transforming a reference policy to a target policy using Bayesian updating and normalized likelihood ratios.
  • Experimental results on synthetic and real-world datasets validate the theoretical findings and reveal how DID entropy influences alignment and policy capabilities.

This paper, "Differential Information: An Information-Theoretic Perspective on Preference Optimization" (2505.23761), provides a theoretical framework based on information theory to better understand the mechanics of Direct Preference Optimization (DPO) [rafailov2024direct], particularly why its characteristic log-ratio reward function is effective. The core concept introduced is the Differential Information Distribution (DID).

The paper formalizes the Differential Information Distribution (DID) from a reference policy refref to a target policy π\pi as the distribution over token sequences yy that embodies the information needed to transform refref into π\pi via Bayesian updating. Formally, if π(y)=P(Y=yref,X)\pi(y) = P(Y=y \mid ref, X) where XX is an event conditionally independent of refref given Y=yY=y, the DID is defined as P(Y=yX)P(Y=y \mid X). The paper shows that this DID is equivalent to the normalized likelihood ratio: P(Y=yX)=π(y)/ref(y)yπ(y)/ref(y)P(Y=y \mid X) = \frac{\pi(y) / ref(y)}{\sum_{y'} \pi(y') / ref(y')}.

The authors then investigate when a preference dataset, typically consisting of pairs (yw,yl)(y_w, y_l) where ywy_w is preferred over yly_l, naturally encodes the differential information needed to update a reference policy refref to a target policy .TheyshowthatiftheDIDbetween. They show that if the DID between refand and relates to the DID between refref and viaapowerlawstructure,specifically via a power-law structure, specifically P(Y=y \mid X_{\ref \to \pi^*}) \propto P(Y=y \mid X_{\pi^* \to \ref})^\betaforsome for some \beta > 0,thenthepreferenceprobability, then the preference probability p^*(y_w \succ y_l)$ can be expressed in terms of the DID.</p> <p>A key theoretical result (Theorem 3.2) demonstrates that when preferences encode this differential information structure, DPO&#39;s log-ratio reward $r(y) = \beta \log(\pi(y)/ref(y)) + Cis<em>uniquelyoptimal</em>forrecoveringthetargetpolicy is <em>uniquely optimal</em> for recovering the target policy via preference optimization. This provides a theoretical justification for the standard DPO objective beyond its derivation from KL-regularized RL.

The paper further connects the condition for preferences encoding differential information (the power-law structure in DID) to an implicit assumption often found in preference optimization methods: an ordering of policies based on increasing log-margins. This suggests that methods optimizing for larger log-margins implicitly rely on this underlying structure of differential information.

Based on this framework, the paper derives a closed-form expression for the ideal distribution from which to sample rejected responses yly_l, assuming preferred responses ywy_w are sampled from refref. The ideal distribution for yly_l is proportional to ref(yl)(ref(yl)π(yl))βref(y_l) \left( \frac{ref(y_l)}{\pi^*(y_l)} \right)^\beta. While requiring knowledge of the target policy ,thisprovidesatheoreticaltargetfordatasetcreationprocesses.</p><p>Inthesecondpart,thepaperanalyzestheentropyoftheDID,, this provides a theoretical target for dataset creation processes.</p> <p>In the second part, the paper analyzes the entropy of the DID, H(P(Y=y \mid X_{\ref \to \pi})).Itisarguedthatlearninglowentropydifferentialinformationreinforcesthepolicydistribution(concentratingmass),whilelearninghighentropydifferentialinformationinducesasmoothingeffect(spreadingmass).ThisleadstoaninformationtheoreticexplanationforLogLikelihoodDisplacement(LLD),thephenomenonwheretheloglikelihoodofpreferredresponsescandecreaseduringDPOtraining.Thehypothesisisthatcomplex,multifacetedalignmentobjectives(likegeneralinstructionfollowing)encodehighentropyDID,leadingtopolicysmoothingandthusLLD,especiallyforpreferredresponsesinitiallyinhighprobabilityregionsof. It is argued that learning low-entropy differential information reinforces the policy distribution (concentrating mass), while learning high-entropy differential information induces a smoothing effect (spreading mass). This leads to an information-theoretic explanation for Log-Likelihood Displacement (LLD), the phenomenon where the log-likelihood of preferred responses can decrease during DPO training. The hypothesis is that complex, multifaceted alignment objectives (like general instruction following) encode high-entropy DID, leading to policy smoothing and thus LLD, especially for preferred responses initially in high-probability regions of ref.</p><p>Experimentalvalidationispresentedusingbothsyntheticandrealworlddatasets:</p><ul><li><strong>SyntheticExperiments:</strong>UsingEnergyBasedModels,theauthorscreateasyntheticsetupwherepreferencesareknowntoencodedifferentialinformation.ExperimentsconfirmthatapolicytrainedtomatchthepreferencedistributionconvergestotheexpectedDIDandthatstandardDPOoptimallyrecoversthetargetpolicyinthissetting,validatingthetheoreticalfindings.</li><li><strong>RealWorldExperiments:</strong>Usinginstructionfollowingdatasets(UltraFeedback[cui2023ultrafeedback]andMagpiePro[xu2024magpie]),theauthorsshowthattrainingapolicytodirectlymatchthepreferencedistribution(equivalenttolearning.</p> <p>Experimental validation is presented using both synthetic and real-world datasets:</p> <ul> <li><strong>Synthetic Experiments:</strong> Using Energy-Based Models, the authors create a synthetic setup where preferences are known to encode differential information. Experiments confirm that a policy trained to match the preference distribution converges to the expected DID and that standard DPO optimally recovers the target policy in this setting, validating the theoretical findings.</li> <li><strong>Real-World Experiments:</strong> Using instruction-following datasets (Ultra-Feedback [cui2023ultrafeedback] and Magpie-Pro [xu2024magpie]), the authors show that training a policy to directly match the preference distribution (equivalent to learning p^*$) results in a policy distribution with higher entropy than the reference policy, suggesting these datasets encode high-entropy DID. Concurrently, LLD is observed during standard DPO training on these datasets, supporting the link between high-entropy DID and policy smoothing.

  • DID Entropy and Capabilities: By comparing standard DPO with a custom "DPO-Projected Gradient" (DPO-PG) method designed to encourage policy reinforcement (and thus lower DID entropy), the authors explore the relationship between DID entropy and learned capabilities. On general instruction-following benchmarks (Arena-Hard [arenahard2024], Wild-Bench [lin2024wildbench]), standard DPO (higher entropy) performs better. On knowledge-intensive Question Answering tasks, DPO-PG (lower entropy) performs better. This suggests that learning high-entropy DID is important for the diverse capabilities required in general instruction following, while low-entropy DID is more relevant for precise, knowledge-based tasks.
  • In summary, the paper offers a unified information-theoretic perspective on DPO, linking its objective to the structure of preference data via the concept of Differential Information Distribution. It provides theoretical justification for DPO's log-ratio reward and a novel explanation for LLD, suggesting that different types of capabilities acquired through alignment correlate with the entropy of the learned differential information.

    Implementation considerations include the challenge of sampling from the theoretically ideal distribution for rejected responses, which requires access to the target policy. The empirical results highlight that different alignment goals might require different training strategies that influence the learned DID entropy. The paper notes limitations regarding the assumption of sufficient data coverage for preference optimization equivalence and suggests future work could explore how data annotation protocols influence DID entropy.