Label-Free Preference Optimization

Updated 6 January 2026

Label-Free Preference Optimization is a paradigm that aligns generative models using implicit reward signals derived from model internals rather than human annotations.
It employs methods such as self-generated feedback, pseudo-labeling, and groupwise comparisons to create robust and scalable training pipelines.
Empirical studies show that these methods enhance performance in video, text generation, combinatorial tasks, and reinforcement learning while mitigating annotation costs.

Label-free preference optimization refers to a class of methods for aligning models—especially generative models, LLMs, diffusion models, and reinforcement learning (RL) policies—with desirable preference structures without requiring human-labeled or reference-based data. Instead, these methods leverage implicit, synthetic, or self-generated preference signals, statistically defined transformations of quantitative rewards, or structural invariants to drive optimization. Recent developments span fully reference-free objectives, pseudo- or self-labeling pipelines, multi-response/groupwise preference models, reward thresholding, and probabilistic inference frameworks, enabling broader scalability and robustness across supervised, unsupervised, and semi-supervised settings.

1. Foundations and Motivation

Traditional preference optimization typically relies on explicit pairwise human feedback, gold labels, or reference models to construct preference pairs for methods like Direct Preference Optimization (DPO). This data requirement imposes scalability and cost bottlenecks and can introduce distributional mismatches between the model’s generation regime and the labeled data. Label-free preference optimization addresses these limitations by eliminating the need for such labels or references. Key foundational paradigms include:

Reference-free reward design, using the model’s own likelihood or other internally computed signals as a proxy for preferences.
Synthetic or groupwise preference supervision, generating “win/lose” examples by editing, augmentation, or pseudo feedback (e.g., test-case evaluation, self-consistency).
Pseudo-labeling and thresholding approaches, where unpaired or entirely unlabeled data are algorithmically assigned preference labels or scores, often leveraging statistics from a small labeled subset.
Probabilistic inference and decoupled feedback frameworks, supporting flexible learning from positive-only, negative-only, or mixed unpaired feedback without explicit pairwise labeling.

Label-free optimization thereby enables preference alignment at scale, often with enhanced robustness to label noise and lower susceptibility to reference drift or overfitting.

2. Reference-Free Reward and Preference Modeling

A central innovation in label-free preference optimization is the construction of optimization signals directly from model-internal quantities, eliminating any dependence on external reference or SFT (supervised fine-tuned) models.

Lean Preference Optimization (LeanPO) defines the reward of a response $y$ under input $x$ as the model’s own per-token log-likelihood:

$R(x, y) = \frac{1}{|y|} \sum_{i=1}^{|y|} \log \pi_\theta(y_i \mid x, y_{<i})$

This reference-free reward acts as an implicit preference signal, ensuring stability and preventing ratio-based likelihood displacement observed in DPO-based regimes, particularly in complex or redundant output domains such as video (Wang et al., 5 Jun 2025).

Groupwise and multi-response comparisons (e.g., REFA, mDPO/mIPO) optimize over sets of candidate responses, treating the joint likelihood of positive/negative sets as a multi-preference signal—e.g., maximizing the model probability on responses scoring above the mean reward relative to those at/below the mean, with further weighting by deviation-based importance scores (Gupta et al., 2024, Wang et al., 2024).
Augmentation and editing pipelines (e.g., discriminator-free DPO for diffusion models) synthesize negative examples (e.g., edited, temporally shuffled, or noised-out videos) from real “positive” data, sidestepping explicit label acquisition while yielding an unambiguous hierarchical structure of preferred/unpreferred outcomes (Cheng et al., 11 Apr 2025).
Self-reflection and domain-dependent pipelines (e.g., LeanPO’s prior injection + self-reflection + augmentation cycle) continually generate high-quality preference pairs by prompting the model with domain-specific knowledge and its own outputs (Wang et al., 5 Jun 2025).

3. Pseudo-Labeling and Self-Supervision

Label-free preference optimization often leverages pseudo-labeling frameworks, learning to assign preference labels or scores to unpaired or unlabeled samples, sometimes requiring only a small seed of labeled data:

Semi-Supervised Preference Optimization (SSPO) employs a formal threshold theorem: a reward threshold $\delta^*$ estimated from limited labeled win/loss data robustly separates unpaired data into pseudo-win/pseudo-loss classes. KDE-based threshold estimation and a weighted objective combine to propagate preference supervision across large unlabeled sets (Lee et al., 28 Oct 2025).
Test-case–based pseudo feedback (PFPO) converts the outcome of solution verification (e.g., test-case pass rates) into preference signals: “solutions passing all tests” are treated as preferred without any human label. Both single-case (math) and multi-case (code) reasoning tasks can be optimized using such pseudo feedback, bootstrap-able from stronger LLMs (“frontier” models) or self-consistency schemes (Jiao et al., 2024).
Statistical rejection sampling (RSO) employs a reward model to select preference samples most likely to represent the target policy, obviating the need for further human annotation (Liu et al., 2023).

4. Groupwise and Multi-Preference Objectives

Label-free preference frameworks have increasingly moved toward groupwise and multi-preference losses to address issues of diversity, bias, and robustness, which are not adequately captured by pairwise or pointwise comparisons.

REFA generalizes preference loss to the case where, for each prompt or query, a set of responses is partitioned into positives ( $Y^+$ ) and negatives ( $Y^-$ ) by their relative reward. The objective maximizes the weighted sum of model probabilities for $Y^+$ relative to $Y^-$ , further refined by deviation-based weights favoring higher-quality outliers:

$L(\theta) = -\log\frac{\sum_{y\in Y^+} w_y\,\pi_\theta(y|x)}{\sum_{y\in Y^+} w_y\,\pi_\theta(y|x) + \gamma\sum_{y\in Y^-} w_y\,\pi_\theta(y|x)} + \mathcal{R}(\theta)$

Here, length normalization and EOS-probability regularizers prevent trivial solution incentives (e.g., generating short, uninformative outputs); deviation-based weighting (e.g., $w_y = \exp(\alpha(r_y - \bar{r}))$ ) amplifies the impact of high-quality responses (Gupta et al., 2024).

Multi-sample comparison frameworks (mDPO/mIPO) apply the Bradley–Terry or squared-margin loss not to individual responses, but to groups of $k$ samples per class (chosen/rejected), measuring groupwise log-likelihood or log-ratio differences. Empirically, groupwise supervision improves diversity, resilience to label noise, and stability in generative tasks (Wang et al., 2024).

5. Algorithmic and Theoretical Advances

Tables below summarize the key mechanisms in leading label-free preference optimization methods.

Method	Signal Source	Key Objective / Loss	Reference-Free?	Special Mechanisms
LeanPO	Model likelihood	Per-token log-likelihood margin	Yes	Self-generation pipeline, dynamic label smoothing
REFA	Group likelihood	Weighted multi-preference likelihood ratio	Yes	Length normalization, EOS penalty, deviation weights
SSPO	Thresholding	Supervised + pseudo-labeled cross-entropy	Effectively	KDE thresholding, semi-supervised heuristics
PFPO	Test cases	Pairwise DPO on test-passing solutions	Yes	Pseudo feedback via LLM/test-cases, self-consistency
DF-DPO	Synthetic edits	DPO margin on real vs. edited samples	No	Distribution-matched edit pipeline

Most methods provide theoretical justification for their objectives—e.g., optimality of reward thresholding in the presence of sub-Gaussian reward distributions (SSPO), equivalence to Bayesian preference modeling (LeanPO, mDPO), or KL-constrained optimization (REFA).

6. Empirical Impact and Application Domains

Label-free preference optimization demonstrates substantial empirical improvements across a variety of settings:

Video-LLMs: LeanPO achieves up to $+12.4\%$ relative improvement on Video-MME accuracy, and surpasses DPO-based approaches in VideoChatGPT multi-turn dialogue evaluations; ablations confirm benefits from self-reflection and dynamic label smoothing (Wang et al., 5 Jun 2025).
Text Generation: REFA outperforms previous reference-free baselines (InfoNCA, SimPO) on AlpacaEval2.0, delivering $\sim26.6\%$ length-controlled win-rate and $\sim24.2\%$ overall win-rate (Gupta et al., 2024).
Combinatorial Optimization: Preference Optimization delivers faster convergence and up to $50\%$ gap reduction vs. classic RL on TSP, CVRP, and FFSP problems by modeling qualitative binary preferences from quantitative reward samples (Pan et al., 13 May 2025).
LLM Prompt Optimization: Prompt Duel Optimizer (PDO) outperforms label-dependent baselines and achieves sample-efficient optimal prompt discovery through pairwise dueling-bandit acquisition strategies combined with local mutation (Wu et al., 14 Oct 2025).
RL and Control: Reference-free methods such as DPPO and EM-based PMPO close the gap with reward-based RL, even in negative-only or mixed-only feedback regimes (An et al., 2023, Abdolmaleki et al., 2024).

7. Practical Considerations and Future Directions

Label-free preference optimization is characterized by reduced annotation overhead, the potential for infinite-scale pseudo-labeled preference data, and enhanced robustness against label noise or reference model drift. Operational considerations include:

Efficient dataset generation pipelines (trusted augmentation, self-reflection) (Wang et al., 5 Jun 2025, Cheng et al., 11 Apr 2025).
Careful tuning of thresholding, weighting, or regularization hyperparameters to avoid degenerate solutions (e.g., brevity collapse, reward hacking) (Gupta et al., 2024, Lee et al., 28 Oct 2025).
Incorporation of domain knowledge or adaptive feedback (e.g., local search, learnable distortions) to improve data quality and model robustness (Pan et al., 13 May 2025, Cheng et al., 11 Apr 2025).
Extension to new feedback modalities—e.g., self-consistency, test-case bootstrapping, or online reward/advantage thresholding to approach fully unsupervised preference alignment (Jiao et al., 2024, Lee et al., 28 Oct 2025, Abdolmaleki et al., 2024).

Ongoing challenges include generalization under adversarial or highly out-of-distribution feedback, scalability across arbitrary output domains, and unifying multi-modal or continual learning settings. Label-free preference optimization continues to evolve towards more sophisticated, theoretically justified, and practically robust frameworks for aligning modern ML systems with complex, high-dimensional, and noisy preference signals.