Papers
Topics
Authors
Recent
Search
2000 character limit reached

MinorDPO: Preference Tuning for LLM Stability

Updated 13 January 2026
  • MinorDPO is a preference-based fine-tuning approach that modifies DPO by applying a one-sided reject penalty to improve model stability.
  • It employs a gating mechanism that only updates penalties when the student model assigns higher likelihood to dispreferred outputs than a frozen reference.
  • Empirical results indicate MinorDPO can boost accuracy by up to 5 percentage points and sustain robust performance at higher learning rates compared to DPO.

MinorDPO is a preference-based fine-tuning algorithm for LLMs that addresses instability and over-penalization issues present in the Direct Preference Optimization (DPO) objective. It introduces a one-sided "reject penalty" constraint that regularizes model deviations in line with the KL term present in classical reinforcement learning from human feedback (RLHF), while maintaining the simplicity of DPO’s RL-free, cross-entropy-based formulation. The key technical innovation in MinorDPO is the modification of the penalty on dispreferred (reject) samples: updates are only performed if the candidate likelihood under the student model exceeds that under a frozen reference, thereby eliminating unnecessary gradient pressure once the model’s preference aligns with the reference policy.

1. Background: Preference Optimization and DPO

Standard RLHF fine-tuning for LLMs employs a two-stage pipeline: (1) a reward model is learned from human preference data on completions, and (2) the LLM is fine-tuned using a policy-optimization algorithm such as Proximal Policy Optimization (PPO), with an explicit KL constraint to prevent excessive policy deviation from a frozen reference.

Direct Preference Optimization (DPO) reframes this process by dispensing with explicit reward-modeling and RL altogether. Instead, DPO optimizes a binary cross-entropy loss on observed preference triplets (x,yw,yl)(x, y_w, y_l)—where ywy_w is the preferred option and yly_l is the reject—using log-likelihood ratios as implicit rewards: LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right] where σ\sigma is the sigmoid function and β\beta modulates the margin sensitivity (Xie et al., 2024, Xie et al., 2024).

2. Technical Formulation of MinorDPO

MinorDPO introduces a modification to the DPO loss, focusing specifically on the reject-penalty term. The core idea is to activate the reject-side penalty only when the student model over-assigns probability to the dispreferred answer relative to the reference. This is formalized as:

  • Let rw=logπθ(ywx)πref(ywx)r_w = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)},
  • Let rl=logπθ(ylx)πref(ylx)r_l = \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)},
  • Define rl+=max(0,rl)r_l^+ = \max(0, r_l).

The MinorDPO loss is: LMinorDPO(πθ)=E(x,yw,yl)D[logσ(β(rwrl+))]L_{\mathrm{MinorDPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l)\sim D} \left[ \log \sigma\left(\beta (r_w - r_l^+) \right) \right] This mechanism enforces that no further pressure is applied to reduce ywy_w0 once it falls below ywy_w1 (Xie et al., 2024, Xie et al., 2024).

3. Comparison with DPO, IPO, and KTO

The main difference between DPO and MinorDPO lies in the handling of the reject log-ratio:

Algorithm Loss Penalty on Reject Sample KL Constraint Behavior
DPO ywy_w2 Always penalizes, risk of overpush
MinorDPO ywy_w3 Penalizes only if ywy_w4
IPO Full difference ywy_w5 Similar overfitting as DPO
KTO Implicit via batch pairing No explicit reject tail truncation

MinorDPO uniquely truncates the penalty on ywy_w6, eliminating continued downward pressure once ywy_w7 is sufficiently suppressed (Xie et al., 2024).

4. Algorithm, Gradient Structure, and Hyperparameters

Fine-tuning with MinorDPO involves the following process:

  • Collect a dataset ywy_w8 of preference triples ywy_w9.
  • For each mini-batch, compute:
    • Log-likelihoods under both yly_l0 and yly_l1.
    • Rewards yly_l2 and yly_l3, and apply the yly_l4 gating.
    • The cross-entropy loss yly_l5.
  • Back-propagate gradients and update yly_l6 via Adam or a similar optimizer.

The explicit gradient of the MinorDPO objective: yly_l7 where the indicator yly_l8 "turns off" the reject update when yly_l9 (Xie et al., 2024).

Key hyperparameters:

  • LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]0: governs margin scaling (LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]1–LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]2 is effective for 7B-scale models),
  • learning rate (robust up to LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]3 for Qwen1.5-7B-Chat),
  • batch size (e.g., 64–128) (Xie et al., 2024, Xie et al., 2024).

5. Theoretical Motivation and Robustness

In DPO, the constant negative push on LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]4 can cause catastrophic "token collapse" (excessive shrinking of LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]5 toward zero) and excessive model drift, especially when learning rates are increased or preference pairs are ambiguous. This is due to the lack of a lower bound on how much the probability of the reject sample is reduced.

MinorDPO addresses this by imposing a one-sided penalty, analogously to the classical RLHF KL-regularizer which forbids LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]6 from egressing the reference distribution on any output. The result is reduced gradient variance and a built-in safeguard against overfitting and instability:

  • When LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]7, no further penalty—preserving generality.
  • When LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]8, penalty is imposed—enforcing alignment (Xie et al., 2024, Xie et al., 2024).

A notable side effect is the "stay-close" property: model updates focus on promoting LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]9 without unnecessary penalization of already-suppressed σ\sigma0.

6. Empirical Results and Practical Recommendations

Experiments on the Qwen1.5-7B-Chat model with MetaMath_DPO_FewShot data and GSM8K evaluation reveal:

  • DPO achieves accuracy of σ\sigma1–σ\sigma2 as σ\sigma3 increases; MinorDPO attains σ\sigma4–σ\sigma5 (+5 percentage points).
  • At learning rates σ\sigma6, DPO collapses with small σ\sigma7 (repeated tokens); MinorDPO remains robust and yields σ\sigma8–σ\sigma9 accuracy.
  • MinorDPO performs on par or better than DPOP (which employs an extra SFT penalty, β\beta0) without additional hyperparameter tuning.
  • MinorDPO accommodates learning rates an order of magnitude higher than DPO without collapse (Xie et al., 2024).

A practical guideline is to set β\beta1 in β\beta2 for stable convergence and to leverage higher batch sizes and learning rates as model/data scale.

The central insight underpinning MinorDPO—sample-level gating of the reject penalty—generalizes to supervised fine-tuning as well. The same β\beta3 mechanism motivates the MinorSFT loss, which applies the gating on single-label settings (no explicit reject) and weights sample contributions by a DPO-style sigmoid coefficient: β\beta4 Empirical traces show Minor-style methods maintain smaller model deviation (measured as average log-likelihood difference over a reference set) and achieve higher accuracy on domain QA tasks relative to DPO or SFT (Xie et al., 2024).

MinorDPO thus provides both a robust principled modification to DPO and a paradigmatic step toward more stable, reference-anchored fine-tuning of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinorDPO.