MinorDPO: Preference Tuning for LLM Stability
- MinorDPO is a preference-based fine-tuning approach that modifies DPO by applying a one-sided reject penalty to improve model stability.
- It employs a gating mechanism that only updates penalties when the student model assigns higher likelihood to dispreferred outputs than a frozen reference.
- Empirical results indicate MinorDPO can boost accuracy by up to 5 percentage points and sustain robust performance at higher learning rates compared to DPO.
MinorDPO is a preference-based fine-tuning algorithm for LLMs that addresses instability and over-penalization issues present in the Direct Preference Optimization (DPO) objective. It introduces a one-sided "reject penalty" constraint that regularizes model deviations in line with the KL term present in classical reinforcement learning from human feedback (RLHF), while maintaining the simplicity of DPO’s RL-free, cross-entropy-based formulation. The key technical innovation in MinorDPO is the modification of the penalty on dispreferred (reject) samples: updates are only performed if the candidate likelihood under the student model exceeds that under a frozen reference, thereby eliminating unnecessary gradient pressure once the model’s preference aligns with the reference policy.
1. Background: Preference Optimization and DPO
Standard RLHF fine-tuning for LLMs employs a two-stage pipeline: (1) a reward model is learned from human preference data on completions, and (2) the LLM is fine-tuned using a policy-optimization algorithm such as Proximal Policy Optimization (PPO), with an explicit KL constraint to prevent excessive policy deviation from a frozen reference.
Direct Preference Optimization (DPO) reframes this process by dispensing with explicit reward-modeling and RL altogether. Instead, DPO optimizes a binary cross-entropy loss on observed preference triplets —where is the preferred option and is the reject—using log-likelihood ratios as implicit rewards: where is the sigmoid function and modulates the margin sensitivity (Xie et al., 2024, Xie et al., 2024).
2. Technical Formulation of MinorDPO
MinorDPO introduces a modification to the DPO loss, focusing specifically on the reject-penalty term. The core idea is to activate the reject-side penalty only when the student model over-assigns probability to the dispreferred answer relative to the reference. This is formalized as:
- Let ,
- Let ,
- Define .
The MinorDPO loss is: This mechanism enforces that no further pressure is applied to reduce 0 once it falls below 1 (Xie et al., 2024, Xie et al., 2024).
3. Comparison with DPO, IPO, and KTO
The main difference between DPO and MinorDPO lies in the handling of the reject log-ratio:
| Algorithm | Loss Penalty on Reject Sample | KL Constraint Behavior |
|---|---|---|
| DPO | 2 | Always penalizes, risk of overpush |
| MinorDPO | 3 | Penalizes only if 4 |
| IPO | Full difference 5 | Similar overfitting as DPO |
| KTO | Implicit via batch pairing | No explicit reject tail truncation |
MinorDPO uniquely truncates the penalty on 6, eliminating continued downward pressure once 7 is sufficiently suppressed (Xie et al., 2024).
4. Algorithm, Gradient Structure, and Hyperparameters
Fine-tuning with MinorDPO involves the following process:
- Collect a dataset 8 of preference triples 9.
- For each mini-batch, compute:
- Log-likelihoods under both 0 and 1.
- Rewards 2 and 3, and apply the 4 gating.
- The cross-entropy loss 5.
- Back-propagate gradients and update 6 via Adam or a similar optimizer.
The explicit gradient of the MinorDPO objective: 7 where the indicator 8 "turns off" the reject update when 9 (Xie et al., 2024).
Key hyperparameters:
- 0: governs margin scaling (1–2 is effective for 7B-scale models),
- learning rate (robust up to 3 for Qwen1.5-7B-Chat),
- batch size (e.g., 64–128) (Xie et al., 2024, Xie et al., 2024).
5. Theoretical Motivation and Robustness
In DPO, the constant negative push on 4 can cause catastrophic "token collapse" (excessive shrinking of 5 toward zero) and excessive model drift, especially when learning rates are increased or preference pairs are ambiguous. This is due to the lack of a lower bound on how much the probability of the reject sample is reduced.
MinorDPO addresses this by imposing a one-sided penalty, analogously to the classical RLHF KL-regularizer which forbids 6 from egressing the reference distribution on any output. The result is reduced gradient variance and a built-in safeguard against overfitting and instability:
- When 7, no further penalty—preserving generality.
- When 8, penalty is imposed—enforcing alignment (Xie et al., 2024, Xie et al., 2024).
A notable side effect is the "stay-close" property: model updates focus on promoting 9 without unnecessary penalization of already-suppressed 0.
6. Empirical Results and Practical Recommendations
Experiments on the Qwen1.5-7B-Chat model with MetaMath_DPO_FewShot data and GSM8K evaluation reveal:
- DPO achieves accuracy of 1–2 as 3 increases; MinorDPO attains 4–5 (+5 percentage points).
- At learning rates 6, DPO collapses with small 7 (repeated tokens); MinorDPO remains robust and yields 8–9 accuracy.
- MinorDPO performs on par or better than DPOP (which employs an extra SFT penalty, 0) without additional hyperparameter tuning.
- MinorDPO accommodates learning rates an order of magnitude higher than DPO without collapse (Xie et al., 2024).
A practical guideline is to set 1 in 2 for stable convergence and to leverage higher batch sizes and learning rates as model/data scale.
7. Influence on Related Methods
The central insight underpinning MinorDPO—sample-level gating of the reject penalty—generalizes to supervised fine-tuning as well. The same 3 mechanism motivates the MinorSFT loss, which applies the gating on single-label settings (no explicit reject) and weights sample contributions by a DPO-style sigmoid coefficient: 4 Empirical traces show Minor-style methods maintain smaller model deviation (measured as average log-likelihood difference over a reference set) and achieve higher accuracy on domain QA tasks relative to DPO or SFT (Xie et al., 2024).
MinorDPO thus provides both a robust principled modification to DPO and a paradigmatic step toward more stable, reference-anchored fine-tuning of LLMs.