MinorDPO: Preference Tuning for LLM Stability
- MinorDPO is a preference-based fine-tuning approach that modifies DPO by applying a one-sided reject penalty to improve model stability.
- It employs a gating mechanism that only updates penalties when the student model assigns higher likelihood to dispreferred outputs than a frozen reference.
- Empirical results indicate MinorDPO can boost accuracy by up to 5 percentage points and sustain robust performance at higher learning rates compared to DPO.
MinorDPO is a preference-based fine-tuning algorithm for LLMs that addresses instability and over-penalization issues present in the Direct Preference Optimization (DPO) objective. It introduces a one-sided "reject penalty" constraint that regularizes model deviations in line with the KL term present in classical reinforcement learning from human feedback (RLHF), while maintaining the simplicity of DPO’s RL-free, cross-entropy-based formulation. The key technical innovation in MinorDPO is the modification of the penalty on dispreferred (reject) samples: updates are only performed if the candidate likelihood under the student model exceeds that under a frozen reference, thereby eliminating unnecessary gradient pressure once the model’s preference aligns with the reference policy.
1. Background: Preference Optimization and DPO
Standard RLHF fine-tuning for LLMs employs a two-stage pipeline: (1) a reward model is learned from human preference data on completions, and (2) the LLM is fine-tuned using a policy-optimization algorithm such as Proximal Policy Optimization (PPO), with an explicit KL constraint to prevent excessive policy deviation from a frozen reference.
Direct Preference Optimization (DPO) reframes this process by dispensing with explicit reward-modeling and RL altogether. Instead, DPO optimizes a binary cross-entropy loss on observed preference triplets —where is the preferred option and is the reject—using log-likelihood ratios as implicit rewards: where is the sigmoid function and modulates the margin sensitivity (Xie et al., 2024, Xie et al., 2024).
2. Technical Formulation of MinorDPO
MinorDPO introduces a modification to the DPO loss, focusing specifically on the reject-penalty term. The core idea is to activate the reject-side penalty only when the student model over-assigns probability to the dispreferred answer relative to the reference. This is formalized as:
- Let ,
- Let ,
- Define .
The MinorDPO loss is: This mechanism enforces that no further pressure is applied to reduce once it falls below (Xie et al., 2024, Xie et al., 2024).
3. Comparison with DPO, IPO, and KTO
The main difference between DPO and MinorDPO lies in the handling of the reject log-ratio:
| Algorithm | Loss Penalty on Reject Sample | KL Constraint Behavior |
|---|---|---|
| DPO | Always penalizes, risk of overpush | |
| MinorDPO | Penalizes only if | |
| IPO | Full difference | Similar overfitting as DPO |
| KTO | Implicit via batch pairing | No explicit reject tail truncation |
MinorDPO uniquely truncates the penalty on , eliminating continued downward pressure once is sufficiently suppressed (Xie et al., 2024).
4. Algorithm, Gradient Structure, and Hyperparameters
Fine-tuning with MinorDPO involves the following process:
- Collect a dataset of preference triples .
- For each mini-batch, compute:
- Log-likelihoods under both and .
- Rewards and , and apply the gating.
- The cross-entropy loss .
- Back-propagate gradients and update via Adam or a similar optimizer.
The explicit gradient of the MinorDPO objective: where the indicator "turns off" the reject update when (Xie et al., 2024).
Key hyperparameters:
- : governs margin scaling (–$0.2$ is effective for 7B-scale models),
- learning rate (robust up to for Qwen1.5-7B-Chat),
- batch size (e.g., 64–128) (Xie et al., 2024, Xie et al., 2024).
5. Theoretical Motivation and Robustness
In DPO, the constant negative push on can cause catastrophic "token collapse" (excessive shrinking of toward zero) and excessive model drift, especially when learning rates are increased or preference pairs are ambiguous. This is due to the lack of a lower bound on how much the probability of the reject sample is reduced.
MinorDPO addresses this by imposing a one-sided penalty, analogously to the classical RLHF KL-regularizer which forbids from egressing the reference distribution on any output. The result is reduced gradient variance and a built-in safeguard against overfitting and instability:
- When , no further penalty—preserving generality.
- When , penalty is imposed—enforcing alignment (Xie et al., 2024, Xie et al., 2024).
A notable side effect is the "stay-close" property: model updates focus on promoting without unnecessary penalization of already-suppressed .
6. Empirical Results and Practical Recommendations
Experiments on the Qwen1.5-7B-Chat model with MetaMath_DPO_FewShot data and GSM8K evaluation reveal:
- DPO achieves accuracy of – as increases; MinorDPO attains – (+5 percentage points).
- At learning rates , DPO collapses with small (repeated tokens); MinorDPO remains robust and yields $38$– accuracy.
- MinorDPO performs on par or better than DPOP (which employs an extra SFT penalty, ) without additional hyperparameter tuning.
- MinorDPO accommodates learning rates an order of magnitude higher than DPO without collapse (Xie et al., 2024).
A practical guideline is to set in for stable convergence and to leverage higher batch sizes and learning rates as model/data scale.
7. Influence on Related Methods
The central insight underpinning MinorDPO—sample-level gating of the reject penalty—generalizes to supervised fine-tuning as well. The same mechanism motivates the MinorSFT loss, which applies the gating on single-label settings (no explicit reject) and weights sample contributions by a DPO-style sigmoid coefficient: Empirical traces show Minor-style methods maintain smaller model deviation (measured as average log-likelihood difference over a reference set) and achieve higher accuracy on domain QA tasks relative to DPO or SFT (Xie et al., 2024).
MinorDPO thus provides both a robust principled modification to DPO and a paradigmatic step toward more stable, reference-anchored fine-tuning of LLMs.