MinorDPO: Preference Tuning for LLM Stability

Updated 13 January 2026

MinorDPO is a preference-based fine-tuning approach that modifies DPO by applying a one-sided reject penalty to improve model stability.
It employs a gating mechanism that only updates penalties when the student model assigns higher likelihood to dispreferred outputs than a frozen reference.
Empirical results indicate MinorDPO can boost accuracy by up to 5 percentage points and sustain robust performance at higher learning rates compared to DPO.

MinorDPO is a preference-based fine-tuning algorithm for LLMs that addresses instability and over-penalization issues present in the Direct Preference Optimization (DPO) objective. It introduces a one-sided "reject penalty" constraint that regularizes model deviations in line with the KL term present in classical reinforcement learning from human feedback (RLHF), while maintaining the simplicity of DPO’s RL-free, cross-entropy-based formulation. The key technical innovation in MinorDPO is the modification of the penalty on dispreferred (reject) samples: updates are only performed if the candidate likelihood under the student model exceeds that under a frozen reference, thereby eliminating unnecessary gradient pressure once the model’s preference aligns with the reference policy.

1. Background: Preference Optimization and DPO

Standard RLHF fine-tuning for LLMs employs a two-stage pipeline: (1) a reward model is learned from human preference data on completions, and (2) the LLM is fine-tuned using a policy-optimization algorithm such as Proximal Policy Optimization (PPO), with an explicit KL constraint to prevent excessive policy deviation from a frozen reference.

Direct Preference Optimization (DPO) reframes this process by dispensing with explicit reward-modeling and RL altogether. Instead, DPO optimizes a binary cross-entropy loss on observed preference triplets $(x, y_w, y_l)$ —where $y_w$ is the preferred option and $y_l$ is the reject—using log-likelihood ratios as implicit rewards: $L_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( \beta (\log \pi_\theta(y_w|x) - \log \pi_\theta(y_l|x)) \right) \right]$ where $\sigma$ is the sigmoid function and $\beta$ modulates the margin sensitivity (Xie et al., 2024, Xie et al., 2024).

2. Technical Formulation of MinorDPO

MinorDPO introduces a modification to the DPO loss, focusing specifically on the reject-penalty term. The core idea is to activate the reject-side penalty only when the student model over-assigns probability to the dispreferred answer relative to the reference. This is formalized as:

Let $r_w = \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)}$ ,
Let $r_l = \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}$ ,
Define $r_l^+ = \max(0, r_l)$ .

The MinorDPO loss is: $L_{\mathrm{MinorDPO}}(\pi_\theta) = -\mathbb{E}_{(x, y_w, y_l)\sim D} \left[ \log \sigma\left(\beta (r_w - r_l^+) \right) \right]$ This mechanism enforces that no further pressure is applied to reduce $\pi_\theta(y_l|x)$ once it falls below $\pi_\text{ref}(y_l|x)$ (Xie et al., 2024, Xie et al., 2024).

3. Comparison with DPO, IPO, and KTO

The main difference between DPO and MinorDPO lies in the handling of the reject log-ratio:

Algorithm	Loss Penalty on Reject Sample	KL Constraint Behavior
DPO	$-r_l$	Always penalizes, risk of overpush
MinorDPO	$-\max(0, r_l)$	Penalizes only if $r_l > 0$
IPO	Full difference $r_w - r_l$	Similar overfitting as DPO
KTO	Implicit via batch pairing	No explicit reject tail truncation

MinorDPO uniquely truncates the penalty on $y_l$ , eliminating continued downward pressure once $\pi_\theta(y_l|x)$ is sufficiently suppressed (Xie et al., 2024).

4. Algorithm, Gradient Structure, and Hyperparameters

Fine-tuning with MinorDPO involves the following process:

Collect a dataset $D$ of preference triples $(x, y_w, y_l)$ .
For each mini-batch, compute:
- Log-likelihoods under both $\pi_\theta$ and $\pi_\text{ref}$ .
- Rewards $r_w$ and $r_l$ , and apply the $\max(0, r_l)$ gating.
- The cross-entropy loss $\log \sigma(\beta (r_w - r_l^+))$ .
Back-propagate gradients and update $\theta$ via Adam or a similar optimizer.

The explicit gradient of the MinorDPO objective: $\nabla L_{\mathrm{MinorDPO}} = -\, \mathbb{E} \left[ \beta \sigma(-\beta\,\text{margin}') \left( \nabla \log \pi_\theta(y_w) - \mathbb{I}_{r_l > 0}\,\nabla \log \pi_\theta(y_l) \right) \right]$ where the indicator $\mathbb{I}_{r_l > 0}$ "turns off" the reject update when $r_l \leq 0$ (Xie et al., 2024).

Key hyperparameters:

$\beta$ : governs margin scaling ( $\approx 0.1$ –$0.2$ is effective for 7B-scale models),
learning rate (robust up to $10^{-5}$ for Qwen1.5-7B-Chat),
batch size (e.g., 64–128) (Xie et al., 2024, Xie et al., 2024).

5. Theoretical Motivation and Robustness

In DPO, the constant negative push on $y_l$ can cause catastrophic "token collapse" (excessive shrinking of $\pi_\theta(y_l|x)$ toward zero) and excessive model drift, especially when learning rates are increased or preference pairs are ambiguous. This is due to the lack of a lower bound on how much the probability of the reject sample is reduced.

MinorDPO addresses this by imposing a one-sided penalty, analogously to the classical RLHF KL-regularizer which forbids $\pi_\theta$ from egressing the reference distribution on any output. The result is reduced gradient variance and a built-in safeguard against overfitting and instability:

When $\pi_\theta(y_l|x) \leq \pi_\text{ref}(y_l|x)$ , no further penalty—preserving generality.
When $\pi_\theta(y_l|x) > \pi_\text{ref}(y_l|x)$ , penalty is imposed—enforcing alignment (Xie et al., 2024, Xie et al., 2024).

A notable side effect is the "stay-close" property: model updates focus on promoting $y_w$ without unnecessary penalization of already-suppressed $y_l$ .

6. Empirical Results and Practical Recommendations

Experiments on the Qwen1.5-7B-Chat model with MetaMath_DPO_FewShot data and GSM8K evaluation reveal:

DPO achieves accuracy of $25\%$ – $37\%$ as $\beta$ increases; MinorDPO attains $30\%$ – $40\%$ (+5 percentage points).
At learning rates $1 \times 10^{-5}$ , DPO collapses with small $\beta$ (repeated tokens); MinorDPO remains robust and yields $38$– $40\%$ accuracy.
MinorDPO performs on par or better than DPOP (which employs an extra SFT penalty, $\lambda=50$ ) without additional hyperparameter tuning.
MinorDPO accommodates learning rates an order of magnitude higher than DPO without collapse (Xie et al., 2024).

A practical guideline is to set $\beta$ in $[0.1, 0.2]$ for stable convergence and to leverage higher batch sizes and learning rates as model/data scale.

The central insight underpinning MinorDPO—sample-level gating of the reject penalty—generalizes to supervised fine-tuning as well. The same $\max(0, r)$ mechanism motivates the MinorSFT loss, which applies the gating on single-label settings (no explicit reject) and weights sample contributions by a DPO-style sigmoid coefficient: $\nabla_\theta L_{\mathrm{MinorSFT}} = -\mathbb{E}_{(x,y)}\left[ \frac{2}{m} \sigma( -\beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} ) \nabla_\theta \log \pi_\theta(y|x) \right]$ Empirical traces show Minor-style methods maintain smaller model deviation (measured as average log-likelihood difference over a reference set) and achieve higher accuracy on domain QA tasks relative to DPO or SFT (Xie et al., 2024).

MinorDPO thus provides both a robust principled modification to DPO and a paradigmatic step toward more stable, reference-anchored fine-tuning of LLMs.