Soft Preference Learning

Updated 4 July 2026

Soft preference learning is a framework that infers behavior from graded and ambiguous feedback rather than fixed reward functions.
It employs probabilistic models like Bradley–Terry and Bayesian optimization to capture noisy, partial, and heterogeneous preference signals.
These methods enhance robust decision-making and scalable multi-objective policy optimization in applications such as robotics, recommendation, and interactive elicitation.

Soft preference learning denotes a family of methods that infer and optimize behavior from probabilistic, graded, ambiguous, heterogeneous, or partially observed preference feedback rather than from an exactly specified reward function or a purely hard label. In the cited literature, this includes pairwise and listwise comparison models, soft labels $p \in [0,1]$ , positive-only and negative-only feedback, trajectory-conditioned mixtures of reward experts, soft attributes in recommendation, soft planning constraints in robotics, and soft-hard bounds in interactive multi-objective decision support (Mu et al., 18 Jul 2025, Abdolmaleki et al., 2024, Biyik et al., 2023, Narcomey et al., 2024).

1. Early probabilistic formulations

One of the earliest explicit formulations treated preference feedback as an online learning signal rather than as a scalar reward. In "Online Learning with Preference Feedback" (Shivaswamy et al., 2011), each round presents a context $x_t$ , a structured object $y_t$ , and an improved object $y_t'$ supplied by the user. The latent utility is linear, $U(x,y)=w_*^\top \phi(x,y)$ , and the feedback assumption is $\alpha$ -informative: $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ with $\alpha \in (0,1]$ and slack $\xi_t \ge 0$ . The Preference Perceptron update

$w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$

admits the regret bound

$x_t$ 0

Here, softness appears in the informativeness parameter $x_t$ 1 and the slack sequence $x_t$ 2, rather than in an explicit reward model.

A complementary probabilistic treatment was developed for sequential Bayesian optimization with pairwise comparison (Ignatenko et al., 2021). There, the latent user preference function is parametric and unimodal,

$x_t$ 3

and binary responses arise from noisy latent utilities via a probit link: $x_t$ 4 The agent maintains a Gaussian approximate posterior over transformed parameters and selects queries by maximizing the mutual information between the next response and the latent preference parameters. The associated normalized weighted KL-divergence, termed Remaining System Uncertainty (RSU), functions both as an acquisition objective and as a ground-truth-free performance metric.

These formulations established two recurring themes that persist in later work: preference information is typically noisy and incomplete, and the learning problem is naturally posed in terms of predictive distributions rather than deterministic labels.

2. Probabilistic semantics of softness

The dominant formal language for soft preference learning is a probabilistic choice model over comparisons. In "PrefMoE" (Yuan et al., 1 May 2026), a preference dataset is written as $x_t$ 5, where $x_t$ 6 denotes preference for one segment, the other, or a tie. The Bradley–Terry likelihood is

$x_t$ 7

and the reward model is trained with BT cross-entropy. Ties are modeled directly in the loss. This formulation treats ambiguity as part of the data-generating process rather than as annotation error to be removed.

In preference-based multi-objective reinforcement learning, the same probabilistic structure is applied to trajectory segments under a weight vector $x_t$ 8. "Preference-based Multi-Objective Reinforcement Learning" formalizes labels $x_t$ 9 over segment pairs $y_t$ 0, and uses a Bradley–Terry/Luce softmax likelihood over discounted multi-objective utilities. Indifference $y_t$ 1 is explicitly admitted and treated as a no-gradient label during reward-model learning (Mu et al., 18 Jul 2025).

Other formulations soften preferences at the policy objective rather than at the reward-model stage. PMPO models the probability of success as

$y_t$ 2

derives an EM lower bound, and separates positive-only, negative-only, and mixed feedback through

$y_t$ 3

thereby making softness a property of probabilistic outcome likelihoods and of the relative weighting of accept and reject signals (Abdolmaleki et al., 2024).

A different perspective is given by the win-rate analysis of preference learning. "Preference learning made easy: Everything should be understood through win rate" proves that the only grounded evaluator satisfying preference-consistency and prevalence-consistency is an $y_t$ 4-win rate,

$y_t$ 5

and argues that soft, calibrated per-pair preference probabilities are the correct primitive signal for learning and evaluation (Zhang et al., 14 Feb 2025).

Across these formulations, softness is not restricted to label noise. It includes ties, probabilistic confidence, partial supervision, prevalence-aware evaluation, and uncertainty retained in the predictive distribution.

3. Preference-derived rewards and control policies

A major branch of soft preference learning learns a reward or utility representation from comparisons and then optimizes a policy under that learned signal. In multi-objective RL, Pb-MORL defines a vector-valued reward $y_t$ 6, a return vector

$y_t$ 7

and scalarized utility $y_t$ 8. The framework learns a multi-objective reward model $y_t$ 9 aligned with soft pairwise preferences, proves that weighted-optimal policies are Pareto-optimal, shows that varying $y_t'$ 0 recovers the convex frontier, and gives a unit-weight method for identifying non-convex Pareto-optimal policies (Mu et al., 18 Jul 2025). Empirically, Pb-MORL matches or exceeds an oracle that has access to the true reward function: it matches the oracle in expected utility on Deep Sea Treasure, matches expected utility and surpasses hypervolume on Fruit Tree, surpasses the oracle in expected utility on multi-energy management, and surpasses the oracle in both expected utility and hypervolume on the multi-lane highway task.

"PrefMoE" addresses a different failure mode: heterogeneous and partially conflicting supervision. It decomposes segment reward into a convex mixture of $y_t'$ 1 expert rewards,

$y_t'$ 2

with trajectory-level soft routing and a load-balancing regularizer. Under 100-annotator noise on D4RL locomotion, the Gym-average improves from $y_t'$ 3 for PrefMMT to $y_t'$ 4 for PrefMoE, and the paper reports that at $y_t'$ 5 PrefMoE retains $y_t'$ 6 of baseline performance whereas PrefMMT retains $y_t'$ 7 (Yuan et al., 1 May 2026).

PFM removes reward modeling altogether. "Preference Alignment with Flow Matching" learns a vector field that transports dispreferred samples toward preferred ones, starting from a reference policy and training with conditional flow matching rather than with explicit or implicit rewards. On a conditional MNIST preference task, the reported preference scores are $y_t'$ 8 for PFM, $y_t'$ 9 for RLHF fine-tuning, $U(x,y)=w_*^\top \phi(x,y)$ 0 for DPO fine-tuning, and $U(x,y)=w_*^\top \phi(x,y)$ 1 for iterative PFM (Kim et al., 2024). The same paper reports average normalized returns $U(x,y)=w_*^\top \phi(x,y)$ 2 for PFM, $U(x,y)=w_*^\top \phi(x,y)$ 3 for DPO, and $U(x,y)=w_*^\top \phi(x,y)$ 4 for behavior cloning across 12 D4RL Gym-MuJoCo datasets.

PMPO extends this landscape by showing that soft preference learning need not require pairwise winners and losers. It supports positive-only, negative-only, and mixed data, with stability in negative-only regimes emerging from the KL term to $U(x,y)=w_*^\top \phi(x,y)$ 5; on control tasks, the paper states that $U(x,y)=w_*^\top \phi(x,y)$ 6 was needed for stable negative-only learning, and on RGB Stacking the reported average rewards are $U(x,y)=w_*^\top \phi(x,y)$ 7 for BC alone, $U(x,y)=w_*^\top \phi(x,y)$ 8 for Reject+BC, and $U(x,y)=w_*^\top \phi(x,y)$ 9 for Accept+Reject+BC (Abdolmaleki et al., 2024).

4. Direct preference optimization with soft labels and diversity control

Another major branch optimizes policies directly from preference data. "Soft Preference Optimization" defines a pairwise preference probability under the model,

$\alpha$ 0

and a softness-parameterized loss $\alpha$ 1. Under a Bradley–Terry assumption and asymptotically large data, minimizing this loss yields

$\alpha$ 2

with $\alpha$ 3 controlling distributional softness and a separate KL regularizer anchoring the full output distribution to a reference model (Sharifnassab et al., 2024). In the reported TinyStories experiment, SPO reaches a peak win rate of approximately $\alpha$ 4 against the reference, while DPO peaks at approximately $\alpha$ 5.

Geometric-averaged DPO extends soft labels to pairwise alignment. The proposed GDPO loss multiplies the internal DPO logit by $\alpha$ 6, where $\alpha$ 7 is a soft preference label, so that ambiguous pairs contribute vanishing gradient. The paper reports that on Anthropic Helpful, binary win versus GPT-4 improves from $\alpha$ 8 for DPO to $\alpha$ 9 for GDPO, and on Plasma Plan from $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 0 to $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 1 (Furuta et al., 2024).

SPL in "Diverse Preference Learning for Capabilities and Alignment" decouples the entropy and cross-entropy terms that are coupled inside the standard KL penalty. Its RL-style objective is

$(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 2

and the optimal policy satisfies

$(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 3

The paper attributes diversity loss in KL-regularized RLHF and DPO to the exponentiation of majority preferences, and reports that SPL Pareto-dominates token-level temperature scaling on diversity-quality trade-offs while improving best-of- $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 4 accuracy, calibration, and viewpoint diversity (Slocum et al., 29 Oct 2025).

PFP shifts softness from scalar preference probabilities to explicit distributions over preference features. It extracts five feature dimensions with five discrete sub-features each—Style, Tone, Harmlessness, Background knowledge, and Informativeness—trains per-dimension classifiers, and solves a distribution-preserving optimization with Sinkhorn-Knopp so that the empirical feature distribution in each online batch matches the seed distribution. On Mistral-7B, the reported AlpacaEval 2.0 length-controlled win rates are $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 5 for Iterative DPO, $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 6 for SPA, and $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 7 for PFP; Anthropic-HHH harmlessness rises from $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 8 to $(U(x_t, y_t') - U(x_t, y_t)) = \alpha (U(x_t, y_t^*) - U(x_t, y_t)) - \xi_t,$ 9 and honestness from $\alpha \in (0,1]$ 0 to $\alpha \in (0,1]$ 1 across iterations (Kim et al., 6 Jun 2025).

Anchored soft-label formulations generalize this further. ADPO introduces soft teacher probabilities $\alpha \in (0,1]$ 2, reference-policy anchoring, and listwise Plackett–Luce objectives. In contextual bandits, the paper reports WinMass improvements of $\alpha \in (0,1]$ 3 over standard DPO, and under heavy-tailed contamination a KDE-smoothed listwise variant reaches $\alpha \in (0,1]$ 4 versus $\alpha \in (0,1]$ 5 for standard DPO (Zixian, 21 Oct 2025).

5. Interactive, structured, and domain-specific formulations

Soft preference learning is also an elicitation problem. "Interactive Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds" introduces soft bounds $\alpha \in (0,1]$ 6 as aspirational targets and hard bounds $\alpha \in (0,1]$ 7 as strict feasibility limits. The Soft-Hard Utility Function assigns utility $\alpha \in (0,1]$ 8 at the soft bound, $\alpha \in (0,1]$ 9 at the hard bound, and $\xi_t \ge 0$ 0 below the hard bound, while maintaining Gaussian posteriors over $\xi_t \ge 0$ 1 and $\xi_t \ge 0$ 2 and a posterior over scalarization weights $\xi_t \ge 0$ 3. Active-MoSH couples posterior sampling with GP-UCB-style acquisition and submodular sparsification; T-MoSH adds a sensitivity-based search to expose overlooked high-value regions. In the user study on AI-generated image selection, Likert trust scores are reported as significantly higher for Active-T-MoSH than all baselines, for example mean $\xi_t \ge 0$ 4 versus full ranking mean $\xi_t \ge 0$ 5 (Chen et al., 27 Jun 2025).

Bayesian interactive elicitation appears in a different form in the Monte Carlo tree search framework of "Preference Construction" (Wang et al., 19 Mar 2025). Preferences are represented by an additive value function $\xi_t \ge 0$ 6 with Dirichlet prior and variational posterior, pairwise likelihood follows Bradley–Terry, and questioning is cast as a finite-horizon MDP optimized by MCTS for cumulative uncertainty reduction. The reparameterization trick reduces median gradient variance on the TC dataset from $\xi_t \ge 0$ 7 without RT to $\xi_t \ge 0$ 8 with RT, and the MCTS policy achieves the lowest uncertainty metrics across $\xi_t \ge 0$ 9 instances.

In recommendation, softness often refers to semantic ambiguity. "Preference Elicitation with Soft Attributes in Interactive Recommendation" learns concept activation vectors $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 0 for tags such as “funny,” “inspiring,” or “thought-provoking,” uses Bayesian uncertainty $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 1, and combines item queries with attribute critiques. On RecSim NG, the paper reports CAV test quality $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 2, Spearman correlation approximately $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 3 between predicted and ground-truth tags, and $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 4 gains in cosine and NDCG from modeling CAV uncertainty. On MovieLens 20M, accounting for $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 5 improves NDCG by $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 6 (Biyik et al., 2023).

Vague and dynamic user feedback leads to another soft formulation. VPPL computes instantaneous scores

$w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 7

then applies time-aware decay

$w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 8

so that non-clicked options lose mass without being removed. On Yelp, the reported $w_{t+1} = w_t + \phi(x_t, y_t') - \phi(x_t, y_t)$ 9 rises from $x_t$ 00 for MCMIPL to $x_t$ 01 for AVPPL; on MovieLens, $x_t$ 02 reaches $x_t$ 03 with average turns $x_t$ 04 (Zhang et al., 2023).

Structured representations and planning semantics push softness beyond scalar reward even further. LRHP learns a vector representation $x_t$ 05 from preference pairs using a special token $x_t$ 06, then uses those representations for preference data selection and continuous preference margin prediction; the reported Spearman correlation for margin prediction is $x_t$ 07, and on LLaMA-3-8B PDS-LRHP reaches reward-model accuracy $x_t$ 08 on Helpful versus $x_t$ 09 for Vanilla (Wang et al., 2024). In robotics, "Learning Human Preferences Over Robot Behavior as Soft Planning Constraints" decomposes the preference space as $x_t$ 10, treats task goals as hard constraints and user desires as soft PDDL preferences, and models noisy human comparisons by

$x_t$ 11

The paper reports that, on noiseless data, distribution supervision reaches approximately $x_t$ 12 and $x_t$ 13, while on noisy data the learned models can exceed the Perfectly Rational baseline at matching noise levels (Narcomey et al., 2024).

6. Recurring debates, limitations, and open directions

A persistent debate concerns what preference learning should optimize. The win-rate framework argues that grounded evaluation must be a form of win rate and that common non-WRO methods such as DPO and SFT on preferred samples lack win rate–correspondence and win rate–consistency (Zhang et al., 14 Feb 2025). A related critique appears in GDPO, which attributes standard DPO’s failures partly to over-optimization on ambiguous pairs, and in SPL, which attributes diversity collapse to the KL penalty’s coupling of entropy and cross-entropy (Furuta et al., 2024, Slocum et al., 29 Oct 2025).

Another recurrent issue is heterogeneity. PrefMoE addresses disagreement and inconsistency through trajectory-level soft routing over multiple experts, while PFP addresses online feature-distribution drift through distribution preservation over interpretable preference features (Yuan et al., 1 May 2026, Kim et al., 6 Jun 2025). This suggests that scalar reward surrogates are often too narrow when annotator populations are diverse, partially conflicting, or distributionally shifting.

Scalability remains central. Pb-MORL notes that query complexity grows combinatorially with the number of objectives $x_t$ 14, Active-MoSH highlights computational overhead and cognitive demands, soft-attribute recommendation notes semantic subjectivity and tag sparsity, and sequential Bayesian optimization emphasizes model misspecification and higher-dimensional sampling cost (Mu et al., 18 Jul 2025, Chen et al., 27 Jun 2025, Biyik et al., 2023, Ignatenko et al., 2021). In many settings, softness improves robustness but does not remove the need for active querying, posterior approximation, or careful weight-space design.

A final limitation is semantic and normative. Soft labels, soft attributes, soft bounds, and soft planning constraints all preserve uncertainty, but they do not by themselves resolve which uncertainty should be preserved, whose preferences should count, or how conflicting groups should be represented. The cited works typically address this through probabilistic modeling, regularization, or explicit diversity control rather than through a single consensus scalar. Within the current literature, soft preference learning is therefore best understood not as one algorithmic template, but as a general design principle: preserve graded preference information as long as possible, and postpone hard commitment until the downstream optimization or decision rule explicitly requires it.