Papers
Topics
Authors
Recent
2000 character limit reached

MinorDPO: Preference Tuning for LLM Stability

Updated 13 January 2026
  • MinorDPO is a preference-based fine-tuning approach that modifies DPO by applying a one-sided reject penalty to improve model stability.
  • It employs a gating mechanism that only updates penalties when the student model assigns higher likelihood to dispreferred outputs than a frozen reference.
  • Empirical results indicate MinorDPO can boost accuracy by up to 5 percentage points and sustain robust performance at higher learning rates compared to DPO.

MinorDPO is a preference-based fine-tuning algorithm for LLMs that addresses instability and over-penalization issues present in the Direct Preference Optimization (DPO) objective. It introduces a one-sided "reject penalty" constraint that regularizes model deviations in line with the KL term present in classical reinforcement learning from human feedback (RLHF), while maintaining the simplicity of DPO’s RL-free, cross-entropy-based formulation. The key technical innovation in MinorDPO is the modification of the penalty on dispreferred (reject) samples: updates are only performed if the candidate likelihood under the student model exceeds that under a frozen reference, thereby eliminating unnecessary gradient pressure once the model’s preference aligns with the reference policy.

1. Background: Preference Optimization and DPO

Standard RLHF fine-tuning for LLMs employs a two-stage pipeline: (1) a reward model is learned from human preference data on completions, and (2) the LLM is fine-tuned using a policy-optimization algorithm such as Proximal Policy Optimization (PPO), with an explicit KL constraint to prevent excessive policy deviation from a frozen reference.

Direct Preference Optimization (DPO) reframes this process by dispensing with explicit reward-modeling and RL altogether. Instead, DPO optimizes a binary cross-entropy loss on observed preference triplets (x,yw,yl)(x, y_w, y_l)—where ywy_w is the preferred option and yly_l is the reject—using log-likelihood ratios as implicit rewards: LDPO(πθ)=E(x,yw,yl)D[logσ(β(logπθ(ywx)logπθ(ylx)))]L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right] where σ\sigma is the sigmoid function and β\beta modulates the margin sensitivity (Xie et al., 2024, Xie et al., 2024).

2. Technical Formulation of MinorDPO

MinorDPO introduces a modification to the DPO loss, focusing specifically on the reject-penalty term. The core idea is to activate the reject-side penalty only when the student model over-assigns probability to the dispreferred answer relative to the reference. This is formalized as:

  • Let rw=logπθ(ywx)πref(ywx)r_w = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)},
  • Let rl=logπθ(ylx)πref(ylx)r_l = \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)},
  • Define rl+=max(0,rl)r_l^+ = \max(0, r_l).

The MinorDPO loss is: LMinorDPO(πθ)=E(x,yw,yl)D[logσ(β(rwrl+))]L_{\mathrm{MinorDPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l)\sim D} \left[ \log \sigma\left(\beta (r_w - r_l^+) \right) \right] This mechanism enforces that no further pressure is applied to reduce πθ(ylx)\pi_\theta(y_l|x) once it falls below πref(ylx)\pi_\text{ref}(y_l|x) (Xie et al., 2024, Xie et al., 2024).

3. Comparison with DPO, IPO, and KTO

The main difference between DPO and MinorDPO lies in the handling of the reject log-ratio:

Algorithm Loss Penalty on Reject Sample KL Constraint Behavior
DPO rl-r_l Always penalizes, risk of overpush
MinorDPO max(0,rl)-\max(0, r_l) Penalizes only if rl>0r_l > 0
IPO Full difference rwrlr_w - r_l Similar overfitting as DPO
KTO Implicit via batch pairing No explicit reject tail truncation

MinorDPO uniquely truncates the penalty on yly_l, eliminating continued downward pressure once πθ(ylx)\pi_\theta(y_l|x) is sufficiently suppressed (Xie et al., 2024).

4. Algorithm, Gradient Structure, and Hyperparameters

Fine-tuning with MinorDPO involves the following process:

  • Collect a dataset DD of preference triples (x,yw,yl)(x, y_w, y_l).
  • For each mini-batch, compute:
    • Log-likelihoods under both πθ\pi_\theta and πref\pi_\text{ref}.
    • Rewards rwr_w and rlr_l, and apply the max(0,rl)\max(0, r_l) gating.
    • The cross-entropy loss logσ(β(rwrl+))\log \sigma(\beta (r_w - r_l^+)).
  • Back-propagate gradients and update θ\theta via Adam or a similar optimizer.

The explicit gradient of the MinorDPO objective: LMinorDPO=E[βσ(βmargin)(logπθ(yw)Irl>0logπθ(yl))]\nabla L_{\mathrm{MinorDPO}} = -\, \mathbb{E} \left[ \beta \sigma(-\beta\,\text{margin}') \left( \nabla \log \pi_\theta(y_w) - \mathbb{I}_{r_l > 0}\,\nabla \log \pi_\theta(y_l) \right) \right] where the indicator Irl>0\mathbb{I}_{r_l > 0} "turns off" the reject update when rl0r_l \leq 0 (Xie et al., 2024).

Key hyperparameters:

  • β\beta: governs margin scaling (0.1\approx 0.1–$0.2$ is effective for 7B-scale models),
  • learning rate (robust up to 10510^{-5} for Qwen1.5-7B-Chat),
  • batch size (e.g., 64–128) (Xie et al., 2024, Xie et al., 2024).

5. Theoretical Motivation and Robustness

In DPO, the constant negative push on yly_l can cause catastrophic "token collapse" (excessive shrinking of πθ(ylx)\pi_\theta(y_l|x) toward zero) and excessive model drift, especially when learning rates are increased or preference pairs are ambiguous. This is due to the lack of a lower bound on how much the probability of the reject sample is reduced.

MinorDPO addresses this by imposing a one-sided penalty, analogously to the classical RLHF KL-regularizer which forbids πθ\pi_\theta from egressing the reference distribution on any output. The result is reduced gradient variance and a built-in safeguard against overfitting and instability:

  • When πθ(ylx)πref(ylx)\pi_\theta(y_l|x) \leq \pi_\text{ref}(y_l|x), no further penalty—preserving generality.
  • When πθ(ylx)>πref(ylx)\pi_\theta(y_l|x) > \pi_\text{ref}(y_l|x), penalty is imposed—enforcing alignment (Xie et al., 2024, Xie et al., 2024).

A notable side effect is the "stay-close" property: model updates focus on promoting ywy_w without unnecessary penalization of already-suppressed yly_l.

6. Empirical Results and Practical Recommendations

Experiments on the Qwen1.5-7B-Chat model with MetaMath_DPO_FewShot data and GSM8K evaluation reveal:

  • DPO achieves accuracy of 25%25\%37%37\% as β\beta increases; MinorDPO attains 30%30\%40%40\% (+5 percentage points).
  • At learning rates 1×1051 \times 10^{-5}, DPO collapses with small β\beta (repeated tokens); MinorDPO remains robust and yields $38$–40%40\% accuracy.
  • MinorDPO performs on par or better than DPOP (which employs an extra SFT penalty, λ=50\lambda=50) without additional hyperparameter tuning.
  • MinorDPO accommodates learning rates an order of magnitude higher than DPO without collapse (Xie et al., 2024).

A practical guideline is to set β\beta in [0.1,0.2][0.1, 0.2] for stable convergence and to leverage higher batch sizes and learning rates as model/data scale.

The central insight underpinning MinorDPO—sample-level gating of the reject penalty—generalizes to supervised fine-tuning as well. The same max(0,r)\max(0, r) mechanism motivates the MinorSFT loss, which applies the gating on single-label settings (no explicit reject) and weights sample contributions by a DPO-style sigmoid coefficient: θLMinorSFT=E(x,y)[2mσ(βlogπθ(yx)πref(yx))θlogπθ(yx)]\nabla_\theta L_{\mathrm{MinorSFT}} = -\mathbb{E}_{(x,y)}\left[ \frac{2}{m} \sigma( -\beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} ) \nabla_\theta \log \pi_\theta(y|x) \right] Empirical traces show Minor-style methods maintain smaller model deviation (measured as average log-likelihood difference over a reference set) and achieve higher accuracy on domain QA tasks relative to DPO or SFT (Xie et al., 2024).

MinorDPO thus provides both a robust principled modification to DPO and a paradigmatic step toward more stable, reference-anchored fine-tuning of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MinorDPO.