Papers
Topics
Authors
Recent
Search
2000 character limit reached

Soft Preference Learning

Updated 4 July 2026
  • Soft preference learning is a framework that infers behavior from graded and ambiguous feedback rather than fixed reward functions.
  • It employs probabilistic models like Bradley–Terry and Bayesian optimization to capture noisy, partial, and heterogeneous preference signals.
  • These methods enhance robust decision-making and scalable multi-objective policy optimization in applications such as robotics, recommendation, and interactive elicitation.

Soft preference learning denotes a family of methods that infer and optimize behavior from probabilistic, graded, ambiguous, heterogeneous, or partially observed preference feedback rather than from an exactly specified reward function or a purely hard label. In the cited literature, this includes pairwise and listwise comparison models, soft labels p[0,1]p \in [0,1], positive-only and negative-only feedback, trajectory-conditioned mixtures of reward experts, soft attributes in recommendation, soft planning constraints in robotics, and soft-hard bounds in interactive multi-objective decision support (Mu et al., 18 Jul 2025, Abdolmaleki et al., 2024, Biyik et al., 2023, Narcomey et al., 2024).

1. Early probabilistic formulations

One of the earliest explicit formulations treated preference feedback as an online learning signal rather than as a scalar reward. In "Online Learning with Preference Feedback" (Shivaswamy et al., 2011), each round presents a context xtx_t, a structured object yty_t, and an improved object yty_t' supplied by the user. The latent utility is linear, U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y), and the feedback assumption is α\alpha-informative: (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t, with α(0,1]\alpha \in (0,1] and slack ξt0\xi_t \ge 0. The Preference Perceptron update

wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)

admits the regret bound

xtx_t0

Here, softness appears in the informativeness parameter xtx_t1 and the slack sequence xtx_t2, rather than in an explicit reward model.

A complementary probabilistic treatment was developed for sequential Bayesian optimization with pairwise comparison (Ignatenko et al., 2021). There, the latent user preference function is parametric and unimodal,

xtx_t3

and binary responses arise from noisy latent utilities via a probit link: xtx_t4 The agent maintains a Gaussian approximate posterior over transformed parameters and selects queries by maximizing the mutual information between the next response and the latent preference parameters. The associated normalized weighted KL-divergence, termed Remaining System Uncertainty (RSU), functions both as an acquisition objective and as a ground-truth-free performance metric.

These formulations established two recurring themes that persist in later work: preference information is typically noisy and incomplete, and the learning problem is naturally posed in terms of predictive distributions rather than deterministic labels.

2. Probabilistic semantics of softness

The dominant formal language for soft preference learning is a probabilistic choice model over comparisons. In "PrefMoE" (Yuan et al., 1 May 2026), a preference dataset is written as xtx_t5, where xtx_t6 denotes preference for one segment, the other, or a tie. The Bradley–Terry likelihood is

xtx_t7

and the reward model is trained with BT cross-entropy. Ties are modeled directly in the loss. This formulation treats ambiguity as part of the data-generating process rather than as annotation error to be removed.

In preference-based multi-objective reinforcement learning, the same probabilistic structure is applied to trajectory segments under a weight vector xtx_t8. "Preference-based Multi-Objective Reinforcement Learning" formalizes labels xtx_t9 over segment pairs yty_t0, and uses a Bradley–Terry/Luce softmax likelihood over discounted multi-objective utilities. Indifference yty_t1 is explicitly admitted and treated as a no-gradient label during reward-model learning (Mu et al., 18 Jul 2025).

Other formulations soften preferences at the policy objective rather than at the reward-model stage. PMPO models the probability of success as

yty_t2

derives an EM lower bound, and separates positive-only, negative-only, and mixed feedback through

yty_t3

thereby making softness a property of probabilistic outcome likelihoods and of the relative weighting of accept and reject signals (Abdolmaleki et al., 2024).

A different perspective is given by the win-rate analysis of preference learning. "Preference learning made easy: Everything should be understood through win rate" proves that the only grounded evaluator satisfying preference-consistency and prevalence-consistency is an yty_t4-win rate,

yty_t5

and argues that soft, calibrated per-pair preference probabilities are the correct primitive signal for learning and evaluation (Zhang et al., 14 Feb 2025).

Across these formulations, softness is not restricted to label noise. It includes ties, probabilistic confidence, partial supervision, prevalence-aware evaluation, and uncertainty retained in the predictive distribution.

3. Preference-derived rewards and control policies

A major branch of soft preference learning learns a reward or utility representation from comparisons and then optimizes a policy under that learned signal. In multi-objective RL, Pb-MORL defines a vector-valued reward yty_t6, a return vector

yty_t7

and scalarized utility yty_t8. The framework learns a multi-objective reward model yty_t9 aligned with soft pairwise preferences, proves that weighted-optimal policies are Pareto-optimal, shows that varying yty_t'0 recovers the convex frontier, and gives a unit-weight method for identifying non-convex Pareto-optimal policies (Mu et al., 18 Jul 2025). Empirically, Pb-MORL matches or exceeds an oracle that has access to the true reward function: it matches the oracle in expected utility on Deep Sea Treasure, matches expected utility and surpasses hypervolume on Fruit Tree, surpasses the oracle in expected utility on multi-energy management, and surpasses the oracle in both expected utility and hypervolume on the multi-lane highway task.

"PrefMoE" addresses a different failure mode: heterogeneous and partially conflicting supervision. It decomposes segment reward into a convex mixture of yty_t'1 expert rewards,

yty_t'2

with trajectory-level soft routing and a load-balancing regularizer. Under 100-annotator noise on D4RL locomotion, the Gym-average improves from yty_t'3 for PrefMMT to yty_t'4 for PrefMoE, and the paper reports that at yty_t'5 PrefMoE retains yty_t'6 of baseline performance whereas PrefMMT retains yty_t'7 (Yuan et al., 1 May 2026).

PFM removes reward modeling altogether. "Preference Alignment with Flow Matching" learns a vector field that transports dispreferred samples toward preferred ones, starting from a reference policy and training with conditional flow matching rather than with explicit or implicit rewards. On a conditional MNIST preference task, the reported preference scores are yty_t'8 for PFM, yty_t'9 for RLHF fine-tuning, U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)0 for DPO fine-tuning, and U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)1 for iterative PFM (Kim et al., 2024). The same paper reports average normalized returns U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)2 for PFM, U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)3 for DPO, and U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)4 for behavior cloning across 12 D4RL Gym-MuJoCo datasets.

PMPO extends this landscape by showing that soft preference learning need not require pairwise winners and losers. It supports positive-only, negative-only, and mixed data, with stability in negative-only regimes emerging from the KL term to U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)5; on control tasks, the paper states that U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)6 was needed for stable negative-only learning, and on RGB Stacking the reported average rewards are U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)7 for BC alone, U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)8 for Reject+BC, and U(x,y)=wϕ(x,y)U(x,y)=w_*^\top \phi(x,y)9 for Accept+Reject+BC (Abdolmaleki et al., 2024).

4. Direct preference optimization with soft labels and diversity control

Another major branch optimizes policies directly from preference data. "Soft Preference Optimization" defines a pairwise preference probability under the model,

α\alpha0

and a softness-parameterized loss α\alpha1. Under a Bradley–Terry assumption and asymptotically large data, minimizing this loss yields

α\alpha2

with α\alpha3 controlling distributional softness and a separate KL regularizer anchoring the full output distribution to a reference model (Sharifnassab et al., 2024). In the reported TinyStories experiment, SPO reaches a peak win rate of approximately α\alpha4 against the reference, while DPO peaks at approximately α\alpha5.

Geometric-averaged DPO extends soft labels to pairwise alignment. The proposed GDPO loss multiplies the internal DPO logit by α\alpha6, where α\alpha7 is a soft preference label, so that ambiguous pairs contribute vanishing gradient. The paper reports that on Anthropic Helpful, binary win versus GPT-4 improves from α\alpha8 for DPO to α\alpha9 for GDPO, and on Plasma Plan from (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,0 to (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,1 (Furuta et al., 2024).

SPL in "Diverse Preference Learning for Capabilities and Alignment" decouples the entropy and cross-entropy terms that are coupled inside the standard KL penalty. Its RL-style objective is

(U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,2

and the optimal policy satisfies

(U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,3

The paper attributes diversity loss in KL-regularized RLHF and DPO to the exponentiation of majority preferences, and reports that SPL Pareto-dominates token-level temperature scaling on diversity-quality trade-offs while improving best-of-(U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,4 accuracy, calibration, and viewpoint diversity (Slocum et al., 29 Oct 2025).

PFP shifts softness from scalar preference probabilities to explicit distributions over preference features. It extracts five feature dimensions with five discrete sub-features each—Style, Tone, Harmlessness, Background knowledge, and Informativeness—trains per-dimension classifiers, and solves a distribution-preserving optimization with Sinkhorn-Knopp so that the empirical feature distribution in each online batch matches the seed distribution. On Mistral-7B, the reported AlpacaEval 2.0 length-controlled win rates are (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,5 for Iterative DPO, (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,6 for SPA, and (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,7 for PFP; Anthropic-HHH harmlessness rises from (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,8 to (U(xt,yt)U(xt,yt))=α(U(xt,yt)U(xt,yt))ξt,(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,9 and honestness from α(0,1]\alpha \in (0,1]0 to α(0,1]\alpha \in (0,1]1 across iterations (Kim et al., 6 Jun 2025).

Anchored soft-label formulations generalize this further. ADPO introduces soft teacher probabilities α(0,1]\alpha \in (0,1]2, reference-policy anchoring, and listwise Plackett–Luce objectives. In contextual bandits, the paper reports WinMass improvements of α(0,1]\alpha \in (0,1]3 over standard DPO, and under heavy-tailed contamination a KDE-smoothed listwise variant reaches α(0,1]\alpha \in (0,1]4 versus α(0,1]\alpha \in (0,1]5 for standard DPO (Zixian, 21 Oct 2025).

5. Interactive, structured, and domain-specific formulations

Soft preference learning is also an elicitation problem. "Interactive Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds" introduces soft bounds α(0,1]\alpha \in (0,1]6 as aspirational targets and hard bounds α(0,1]\alpha \in (0,1]7 as strict feasibility limits. The Soft-Hard Utility Function assigns utility α(0,1]\alpha \in (0,1]8 at the soft bound, α(0,1]\alpha \in (0,1]9 at the hard bound, and ξt0\xi_t \ge 00 below the hard bound, while maintaining Gaussian posteriors over ξt0\xi_t \ge 01 and ξt0\xi_t \ge 02 and a posterior over scalarization weights ξt0\xi_t \ge 03. Active-MoSH couples posterior sampling with GP-UCB-style acquisition and submodular sparsification; T-MoSH adds a sensitivity-based search to expose overlooked high-value regions. In the user study on AI-generated image selection, Likert trust scores are reported as significantly higher for Active-T-MoSH than all baselines, for example mean ξt0\xi_t \ge 04 versus full ranking mean ξt0\xi_t \ge 05 (Chen et al., 27 Jun 2025).

Bayesian interactive elicitation appears in a different form in the Monte Carlo tree search framework of "Preference Construction" (Wang et al., 19 Mar 2025). Preferences are represented by an additive value function ξt0\xi_t \ge 06 with Dirichlet prior and variational posterior, pairwise likelihood follows Bradley–Terry, and questioning is cast as a finite-horizon MDP optimized by MCTS for cumulative uncertainty reduction. The reparameterization trick reduces median gradient variance on the TC dataset from ξt0\xi_t \ge 07 without RT to ξt0\xi_t \ge 08 with RT, and the MCTS policy achieves the lowest uncertainty metrics across ξt0\xi_t \ge 09 instances.

In recommendation, softness often refers to semantic ambiguity. "Preference Elicitation with Soft Attributes in Interactive Recommendation" learns concept activation vectors wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)0 for tags such as “funny,” “inspiring,” or “thought-provoking,” uses Bayesian uncertainty wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)1, and combines item queries with attribute critiques. On RecSim NG, the paper reports CAV test quality wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)2, Spearman correlation approximately wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)3 between predicted and ground-truth tags, and wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)4 gains in cosine and NDCG from modeling CAV uncertainty. On MovieLens 20M, accounting for wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)5 improves NDCG by wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)6 (Biyik et al., 2023).

Vague and dynamic user feedback leads to another soft formulation. VPPL computes instantaneous scores

wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)7

then applies time-aware decay

wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)8

so that non-clicked options lose mass without being removed. On Yelp, the reported wt+1=wt+ϕ(xt,yt)ϕ(xt,yt)w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)9 rises from xtx_t00 for MCMIPL to xtx_t01 for AVPPL; on MovieLens, xtx_t02 reaches xtx_t03 with average turns xtx_t04 (Zhang et al., 2023).

Structured representations and planning semantics push softness beyond scalar reward even further. LRHP learns a vector representation xtx_t05 from preference pairs using a special token xtx_t06, then uses those representations for preference data selection and continuous preference margin prediction; the reported Spearman correlation for margin prediction is xtx_t07, and on LLaMA-3-8B PDS-LRHP reaches reward-model accuracy xtx_t08 on Helpful versus xtx_t09 for Vanilla (Wang et al., 2024). In robotics, "Learning Human Preferences Over Robot Behavior as Soft Planning Constraints" decomposes the preference space as xtx_t10, treats task goals as hard constraints and user desires as soft PDDL preferences, and models noisy human comparisons by

xtx_t11

The paper reports that, on noiseless data, distribution supervision reaches approximately xtx_t12 and xtx_t13, while on noisy data the learned models can exceed the Perfectly Rational baseline at matching noise levels (Narcomey et al., 2024).

6. Recurring debates, limitations, and open directions

A persistent debate concerns what preference learning should optimize. The win-rate framework argues that grounded evaluation must be a form of win rate and that common non-WRO methods such as DPO and SFT on preferred samples lack win rate–correspondence and win rate–consistency (Zhang et al., 14 Feb 2025). A related critique appears in GDPO, which attributes standard DPO’s failures partly to over-optimization on ambiguous pairs, and in SPL, which attributes diversity collapse to the KL penalty’s coupling of entropy and cross-entropy (Furuta et al., 2024, Slocum et al., 29 Oct 2025).

Another recurrent issue is heterogeneity. PrefMoE addresses disagreement and inconsistency through trajectory-level soft routing over multiple experts, while PFP addresses online feature-distribution drift through distribution preservation over interpretable preference features (Yuan et al., 1 May 2026, Kim et al., 6 Jun 2025). This suggests that scalar reward surrogates are often too narrow when annotator populations are diverse, partially conflicting, or distributionally shifting.

Scalability remains central. Pb-MORL notes that query complexity grows combinatorially with the number of objectives xtx_t14, Active-MoSH highlights computational overhead and cognitive demands, soft-attribute recommendation notes semantic subjectivity and tag sparsity, and sequential Bayesian optimization emphasizes model misspecification and higher-dimensional sampling cost (Mu et al., 18 Jul 2025, Chen et al., 27 Jun 2025, Biyik et al., 2023, Ignatenko et al., 2021). In many settings, softness improves robustness but does not remove the need for active querying, posterior approximation, or careful weight-space design.

A final limitation is semantic and normative. Soft labels, soft attributes, soft bounds, and soft planning constraints all preserve uncertainty, but they do not by themselves resolve which uncertainty should be preserved, whose preferences should count, or how conflicting groups should be represented. The cited works typically address this through probabilistic modeling, regularization, or explicit diversity control rather than through a single consensus scalar. Within the current literature, soft preference learning is therefore best understood not as one algorithmic template, but as a general design principle: preserve graded preference information as long as possible, and postpone hard commitment until the downstream optimization or decision rule explicitly requires it.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft Preference Learning.