Fine-Grained Direct Preference Optimization (fDPO)

Updated 1 July 2025

Fine-Grained Direct Preference Optimization (fDPO) extends standard DPO by using granular feedback signals for more precise alignment of generative models like LLMs.
By using detailed feedback, fDPO improves data efficiency and handles localized issues in diverse applications like code, text-to-speech, and visual language models.
Various fDPO techniques achieve granularity by modifying the loss function, incorporating kernel methods, or applying optimization at sub-sequence levels like tokens or segments.

Fine-Grained Direct Preference Optimization (fDPO) represents a class of methods that extend the principles of Direct Preference Optimization (DPO) to leverage more granular feedback or structural information during the alignment of generative models, particularly LLMs. Standard DPO (Rafailov et al., 2023) simplifies the complex Reinforcement Learning from Human Feedback (RLHF) pipeline by deriving a closed-form expression for the optimal policy under a KL-regularized objective, directly optimizing a preference loss over pairs of chosen and rejected responses. While effective and computationally efficient, this approach operates at the full sequence level, potentially overlooking subtle distinctions, being sensitive to localized errors, or failing to leverage richer forms of preference feedback. fDPO methods aim to address these limitations by introducing granularity into the preference signal, the optimization objective, the model architecture, or the training process. This allows for more precise alignment with nuanced human judgments, improved robustness, and enhanced data efficiency in various domains.

Foundations of Direct Preference Optimization

Direct Preference Optimization (DPO) (Rafailov et al., 2023) is an alternative to traditional RLHF that avoids explicit reward model training and reinforcement learning steps. Given a dataset $\mathcal{D}$ of paired preferences $(x, y_w, y_l)$ , where $x$ is a prompt, $y_w$ is the preferred response, and $y_l$ is the dispreferred response, DPO directly optimizes the model policy $\pi_\theta$ initialized from a reference policy $\pi_{\text{ref}}$ (typically a supervised fine-tuned model). The objective is derived from the insight that the optimal policy $\pi_r$ for a KL-regularized reward maximization problem, $\max_{\pi} \mathbb{E}_{x,y \sim \pi}[r(x,y)] - \beta D_{\text{KL}}[\pi(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)]$ , can be related to the reward function $r(x,y)$ via $r(x,y) = \beta \log \frac{\pi_r(y|x)}{\pi_{\text{ref}}(y|x)} + \text{const}$ . By substituting this relationship into the Bradley-Terry model for preference probabilities, DPO formulates a loss function that directly optimizes the policy parameters $\theta$ :

$\mathcal{L}_{\text{DPO}(\pi_\theta; \pi_{\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D} \left[ \log \sigma \bigg(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}(y_l|x)}\bigg) \right]$

where $\sigma(\cdot)$ is the sigmoid function and $\beta$ is a hyperparameter controlling the strength of the KL regularization. This objective increases the log probability ratio of the preferred response over the dispreferred one, steered by the reference policy. DPO offers stability, computational efficiency, and strong performance compared to PPO-based RLHF, demonstrating effectiveness in controlling sentiment, improving summarization quality, and enhancing dialogue responses (Rafailov et al., 2023). DPO has also been successfully applied to fine-tune molecular LLMs to align generations with chemist preferences, showing significant improvements in desired properties with minimal loss in validity or diversity (Park et al., 2023).

Enhancing Granularity in Preference Signals and Data

Several fDPO methods focus on incorporating finer-grained information directly into the preference data or signal used for optimization. Direct Preference Optimization with an Offset (ODPO), also referred to as fDPO in one context (Amini et al., 16 Feb 2024), introduces an offset $\Delta r$ into the DPO loss to account for the degree or magnitude of preference between responses, not just the binary winner/loser. The ODPO loss is given by:

$\mathcal{L}_{\text{ODPO}(\theta) = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}(y_l|x)} - \Delta r \right) \right]$

where $\Delta r$ is a function of the score difference between $y_w$ and $y_l$ , typically $\Delta r = \alpha \cdot \mathcal{S}(\text{score}(x, y_w) - \text{score}(x, y_l))$ (Amini et al., 16 Feb 2024). This offset allows the model to prioritize learning from pairs with larger preference gaps, improving data efficiency, especially with limited data.

For CodeLLMs, DPO's inherent preference for better code over worse code, as determined by execution-based pairwise feedback, provides a fine-grained rewarding pattern without explicit reward functions (Miao et al., 24 Oct 2024). The pairwise comparison implicitly captures nuanced differences between code outputs (e.g., subtly wrong vs. completely wrong) that coarse, rule-based PPO rewards might miss. A key aspect for CodeLLM alignment is the on-policy pipeline for collecting fine-grained preference pairs based on automatic execution feedback (Miao et al., 24 Oct 2024).

In reasoning tasks, generating high-quality human-annotated feedback is challenging. Pseudo Feedback Preference Optimization (PFPO) (Jiao et al., 25 Nov 2024) generates scalable pseudo feedback by evaluating solutions against test cases or leveraging self-consistency with frontier LLMs. This framework supports constructing both outcome-level and step-level preference pairs, enabling a form of fine-grained DPO that can optimize the model's reasoning process rather than just the final output. Given a solution prefix $\hat{y}$ , an expected return $\hat{r}$ can be estimated by sampling completions and evaluating them, supporting fine-grained preference pairs for intermediate steps (Jiao et al., 25 Nov 2024).

Active learning strategies can enhance DPO by selecting the most informative preference pairs for training, either online or offline (Kveton et al., 3 Mar 2025). By linearizing the DPO objective at the last neural network layer and applying D-optimal experimental design, methods like Active DPO (ADPO) (Kveton et al., 3 Mar 2025) select data points that maximally reduce uncertainty in policy logits. This improves statistical efficiency and targets preference judgments that are most critical for fine-grained policy refinement.

Fine-Grained Modeling and Regularization

Another class of fDPO methods introduces granularity into the optimization objective's structure or the model's feature space. One such generalization, also termed $f$ -DPO (Wang et al., 2023), replaces the standard reverse KL divergence constraint with a broader class of $f$ -divergences:

$\max_{\pi} \;\; \mathbb{E}_{\pi}[r(y|x)] - \beta D_f(\pi, \pi_0)$

where $D_f(p, q) = \mathbb{E}_{q(x)} [ f(p(x)/q(x)) ]$ for a convex function $f$ with $f(1)=0$ . For specific $f$ -divergences like Jensen-Shannon, forward KL, and certain $\alpha$ -divergences, a tractable mapping between policy and reward exists, similar to the original DPO (Wang et al., 2023). Choosing different $f$ -divergences allows for fine-grained control over the trade-off between alignment performance and generation diversity. Forward KL, for instance, tends to yield higher diversity than reverse KL (Wang et al., 2023).

DPO-Kernels (Das et al., 5 Jan 2025), another method referred to as fDPO, integrates kernel functions into the DPO loss to operate in richer, non-linear feature spaces, enabling finer context-sensitive preference discrimination. It uses kernels like Polynomial, RBF, Mahalanobis, and Spectral kernels and combines a probability-based contrastive loss with an embedding-based hybrid loss:

$\max_{\pi} \; \mathbb{E}_{x, y^{+}, y^{-} \kappa \Biggl[ \log \frac{\pi(y^{+} \mid x)}{\pi(y^{-} \mid x)} + \gamma \log \left(\frac{e_{y^+} \mid e_x}{e_{y^-} \mid e_x}\right) \Biggr] - \alpha D\left(\pi(\cdot|x) \| \pi_{\text{ref}(\cdot|x)\right)$

where $\kappa$ is a kernel transformation and $e_{(\cdot)}$ are embedding representations (Das et al., 5 Jan 2025). This method also employs diverse divergence alternatives (JSD, Hellinger, Rényi, Bhattacharyya, Wasserstein, $f$ -divergences) and a data-driven approach to automatically select the best kernel-divergence pair. A Hierarchical Mixture of Kernels (HMK) combines local (RBF, Polynomial) and global (Spectral, Mahalanobis) kernels for balanced fine-grained and global modeling (Das et al., 5 Jan 2025).

The reference policy and the strength of the KL constraint ( $\beta$ ) in DPO also implicitly influence the granularity of updates. A very small $\beta$ allows for larger, potentially token-level probability shifts, while a strong constraint limits deviation (Liu et al., 18 Jul 2024). Using stronger, compatible reference policies can improve performance, but incompatibility can be detrimental, suggesting a fine balance in leveraging the reference (Liu et al., 18 Jul 2024).

Segmental and Token-Level Optimization

A direct approach to fDPO involves applying the preference optimization signal at a sub-sequence level. TGDPO (Token-Level Reward Guidance for DPO) (Zhu et al., 17 Jun 2025) integrates token-level reward signals into DPO by theoretically decomposing the sequence-level PPO objective into a series of token-level PPO problems. This enables a tractable loss formulation with token-level reward guidance, allowing different tokens to deviate from the reference policy based on their estimated rewards. The TGDPO loss includes per-token weights based on an induced token-level reward, allowing fine-grained control over updates (Zhu et al., 17 Jun 2025).

For text-to-speech (TTS), Fine-grained Preference Optimization (FPO) (Yao et al., 5 Feb 2025) addresses localized issues by selectively computing the DPO-style loss only on identified problematic segments using a token indicator function $I(y^i)$ :

$\mathcal{L}_{\mathrm{FPO} = -\mathbb{E}\left[ \sum_i I(y^i) \cdot \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}(y_l|x)} \right) \right]$

This selective training enhances robustness to localized temporal or semantic-phonetic errors, improving intelligibility and reducing the "bad case ratio" with superior data efficiency compared to utterance-level DPO (Yao et al., 5 Feb 2025).

Adaptive Sentence-level Preference Optimization (ASPO) (Wang et al., 25 May 2025) applies fine-grained optimization at the sentence level for multimodal VLMs. It dynamically calculates adaptive rewards for each sentence in a response based on metrics like image-text similarity and textual perplexity, allowing sentence-level weighting in the DPO margin (Wang et al., 25 May 2025). This improves multimodal alignment by emphasizing well-grounded sentences and suppressing hallucinated ones without additional models (Wang et al., 25 May 2025).

DenseDPO (Wu et al., 4 Jun 2025) extends DPO to text-to-video diffusion models by creating video pairs via denoising corrupted copies of a ground truth video, ensuring temporal alignment and neutralizing motion bias. This alignment allows for preference labeling on short temporal segments rather than entire clips, providing a denser and more precise learning signal. The DenseDPO loss aggregates segment-level implicit rewards (Wu et al., 4 Jun 2025).

For VLM spatial reasoning, fDPO (Shen et al., 26 Jun 2025) introduces segment-specific preference granularity by decomposing responses into descriptive grounding and logical reasoning segments. It applies adaptively-balanced optimization to these segments based on a spatial reward mechanism that evaluates visual consistency, spatial grounding, and logical coherence (Shen et al., 26 Jun 2025). This allows prioritizing the more difficult logical reasoning segments, leading to improved spatial understanding.

Applications and Practical Considerations

fDPO methods have been applied across diverse domains, demonstrating the utility of fine-grained preference signals. In medicine, DPO is shown to be significantly better than supervised fine-tuning (SFT) for complex tasks like clinical reasoning, summarization, and triage, which require nuanced judgment, whereas SFT suffices for simpler classification (Savage et al., 19 Sep 2024). However, the application of DPO, and by extension fDPO, in medical settings is hampered by software gaps, including the lack of multi-GPU parallelization in open-source DPO libraries and the absence of DPO APIs in closed-source frontier models (Savage et al., 19 Sep 2024).

For controllable generation, the UltraGen framework (Yun et al., 17 Feb 2025) uses DPO in its Global Preference Optimization (GPO) stage to achieve extremely fine-grained control over dozens of attributes (soft and hard). It leverages auto-reconstruction and attribute sampling to expose the model to complex attribute combinations, mitigating position bias and attention dilution (Yun et al., 17 Feb 2025).

While DPO relies on paired preferences, Kahneman-Tversky Optimization (KTO) can handle single-response feedback, offering greater flexibility in distributed settings like Federated Learning (FL) where paired data may be sparse or heterogeneous (Spadea et al., 20 Feb 2025). This highlights a practical limitation for DPO and some fDPO methods in scenarios where collecting explicit pairwise comparisons is difficult or privacy-sensitive.

Overall, fDPO approaches improve data efficiency and robustness compared to standard DPO. For example, Filtered DPO (Morimura et al., 22 Apr 2024) uses a reward model to dynamically filter low-quality "chosen" responses during training, making DPO more robust to noisy datasets. FPO for TTS achieves similar performance to utterance-level DPO with significantly fewer data samples (Yao et al., 5 Feb 2025).

Open Challenges and Future Directions

Despite the progress, several challenges and open questions remain for fDPO:

Defining Granularity: The optimal level of granularity (token, segment, sentence, attribute, degree) is likely task-dependent and requires further investigation. Different fDPO methods explore various granularities, suggesting a need for frameworks to guide this choice.
Annotation Cost and Scalability: Obtaining fine-grained human feedback is often more expensive than coarse feedback. Developing scalable methods for automatic or pseudo feedback generation at a fine grain, like those in PFPO (Jiao et al., 25 Nov 2024) or using VLMs for segment labels (Wu et al., 4 Jun 2025), is crucial. The reliability and potential biases of pseudo-feedback also need careful paper (Jiao et al., 25 Nov 2024).
Theoretical Understanding: While some fDPO methods provide theoretical guarantees (e.g., $f$ -DPO with KKT conditions (Wang et al., 2023), ADPO logit error bounds (Kveton et al., 3 Mar 2025), TGDPO partition function elimination (Zhu et al., 17 Jun 2025)), a comprehensive theoretical framework unifying different fDPO approaches and their implications for policy optimization remains an area of active research.
Reference Policy Interaction: The impact of the reference policy on fDPO, particularly at the token level and with different divergences, needs further theoretical and empirical analysis (Liu et al., 18 Jul 2024). The necessity of a reference policy for stable fDPO is also an open question.
Integration and Combination: Exploring how different fDPO techniques (e.g., combining offset-based preference strength, segment-level loss, kernelized features, and curriculum learning) can be integrated might yield further performance gains.
Evaluation Metrics: Fine-grained alignment necessitates appropriate evaluation metrics that go beyond overall quality scores to assess correctness and fidelity at the segment or token level, as demonstrated by spatial reward mechanisms (Shen et al., 26 Jun 2025) or localized error rates (Yao et al., 5 Feb 2025).

Future research will likely continue to explore novel ways to incorporate fine-grained signals, develop more sophisticated automatic feedback mechanisms, refine theoretical understandings, and improve the practicality and scalability of fDPO methods across an expanding range of generative tasks and modalities.