Papers
Topics
Authors
Recent
2000 character limit reached

Selective Preference Optimization (SePO)

Updated 25 December 2025
  • Selective Preference Optimization (SePO) is a framework that refines LLM alignment by focusing on the most informative training signals via prompt, token, and sample selection.
  • It employs methods like PVar-guided prompt selection, selective token-level optimization, and adaptive sample weighting to accelerate convergence and enhance performance.
  • Empirical studies show SePO reduces computational overhead while improving post-alignment effectiveness, achieving faster training and higher accuracy across benchmarks.

Selective Preference Optimization (SePO) denotes a set of methodologies for aligning LLMs by selectively focusing optimization on the most informative, high-leverage elements of the training signal during preference-based fine-tuning. SePO covers prompt-level, token-level, and sample-weighting approaches, all designed to increase efficiency, effectiveness, and compute utilization when optimizing LLMs with human or reward-model preferences. Across these methods, the central paradigm is to downweight or omit uninformative samples, tokens, or pairs—empirically demonstrated to accelerate training and improve post-alignment performance across a range of LLMs and benchmarks (Guo et al., 14 Oct 2025, Dong, 10 Jul 2025, Yang et al., 24 Aug 2024, Ma et al., 30 Dec 2024).

1. Motivation and Core Concepts

Direct Preference Optimization (DPO) and related paradigms have established the Bradley–Terry style loss over pairs of responses as a principal mechanism for LLM alignment. However, not all elements within preference data contribute equally to gradient signal or model improvement. Standard DPO typically uses uniform weighting across prompts, preference pairs, or tokens, regardless of informativeness.

SePO was introduced to address three main inefficiencies:

  • Prompt-level redundancy: Many prompts yield low-variance (uninformative) comparisons due to consistent model behavior or reward model judgments.
  • Token-level redundancy: The majority of sequence tokens offer weak alignment signal, with key semantic or factual tokens driving human preference.
  • Difficulty-blindness: Uniform treatment of “easy” and “hard” examples results in wasted gradient budget on well-learned or trivial samples.

By making selection or weighting explicit—at the level of prompts, tokens, or training pairs—SePO methods aim to discover and exploit the portion of the dataset where model learning is maximally productive (Guo et al., 14 Oct 2025, Dong, 10 Jul 2025, Yang et al., 24 Aug 2024, Ma et al., 30 Dec 2024).

2. Prompt-Level Selection via Preference Variance

Preference variance (PVar) is defined for a prompt xx under current (or initialization) model parameters θ\theta as

PVarθ[x]=Vary1,y2πθ(x)[pθ(x;y1,y2)]\operatorname{PVar}_\theta[x] = \operatorname{Var}_{y_1, y_2 \sim \pi_\theta(\cdot|x)} [p_\theta(x; y_1, y_2)]

where pθ(x;y1,y2)p_\theta(x; y_1, y_2) is the pairwise preference probability from DPO. Using an external reward model rϕr_\phi, PVar^[x]\widehat{\operatorname{PVar}}[x] is estimated by sampling nn responses, computing pairwise preference scores, and evaluating

PVar^[x]=1n(n1)ij(p^ij12)2\widehat{\operatorname{PVar}}[x] = \frac{1}{n(n-1)} \sum_{i \neq j} \left( \hat p_{ij} - \frac{1}{2} \right)^2

with p^ij=σ(rϕ(x,yi)rϕ(x,yj))\hat p_{ij} = \sigma(r_\phi(x, y_i) - r_\phi(x, y_j)).

(Guo et al., 14 Oct 2025) introduced SePO as a prompt selection algorithm where only the top αM\alpha M prompts by PVar^[x]\widehat{\operatorname{PVar}}[x] are used for DPO fine-tuning, with all others omitted. Theoretical analysis showed that the gradient norm for a prompt is upper-bounded by a constant times PVarθ[x]1/3\operatorname{PVar}_\theta[x]^{1/3}, hence low-PVar prompts induce near-zero updates and are largely redundant. Experimentally, focusing on the upper quantiles of PVar (e.g., top 10–50%) results in equal or superior model alignment using a fraction of the original annotation or preference cost.

Key findings include:

  • Prompts with higher preference variance produce faster convergence and higher win-rates on AlpacaEval 2.0 and Arena-Hard.
  • Training with only the top 10% of human-annotated high-PVar prompts (from UltraFeedback) outperforms using the full dataset.
  • PVar-based selection remains effective when computed using small (1B parameter) reward models (Guo et al., 14 Oct 2025).

3. Token-Level Selection and Selective-DPO

Standard DPO applies a reward signal aggregated over the entire response. SePO extends this by identifying key tokens in the sequence which are most responsible for preference distinctions, and restricting optimization to those positions.

For each candidate token yiy_i (in either “win” or “lose” sequences), token importance is measured with respect to log-probability differences between the current policy πθ\pi_\theta and a reference model πref\pi_\mathrm{ref}:

s(yi)=(1)I[yiyl][logπref(yix,y<i)logπθ(yix,y<i)]s(y_i) = (-1)^{\mathbb{I}[y_i \in y^l]} \left[ \log \pi_{\mathrm{ref}}(y_i|x,y_{<i}) - \log \pi_\theta(y_i|x,y_{<i}) \right]

Top k%k\% tokens by s(yi)s(y_i) are selected for each response. The selective-DPO objective then replaces the standard sequence-level reward with a sum over the selected token positions:

Rsel(y)=iIylogπθ(yix,y<i)πref(yix,y<i)R_{\mathrm{sel}}(y) = \sum_{i \in \mathcal{I}_y} \log\frac{\pi_\theta(y_i|x, y_{<i})}{\pi_{\mathrm{ref}}(y_i|x, y_{<i})}

LSelective-DPO=E(x,yw,yl)[logσ(β(Rsel(yw)Rsel(yl)))]L_{\mathrm{Selective\text{-}DPO}} = -\,\mathbb{E}_{(x,y^w,y^l)} \left[ \log\sigma\left(\beta\,(R_{\mathrm{sel}}(y^w)-R_{\mathrm{sel}}(y^l)) \right) \right]

Empirical results demonstrate that SePO with top-40% token selection and a strong reference model yields the highest gains, particularly for large-scale students and complex benchmarks (e.g., Arena-Hard, MT-Bench) (Dong, 10 Jul 2025). This method reduces computational overhead (as gradients only flow through a subset of tokens) and avoids the noise amplification associated with indiscriminate token-level optimization.

4. Oracle-Based Token Selection and Reference-Free Contrastive SePO

A further refinement, as seen in (Yang et al., 24 Aug 2024), is SePO via oracle-based token reward estimation. Here, a compact “oracle” model πϕ\pi_\phi is trained (via DPO) on a modest set of response-level preferences to estimate token-level rewards:

rϕ(yiq,y<i)logπϕ(yiq,y<i)πref(yiq,y<i)r_\phi(y_i|q, y_{<i}) \propto \log\frac{\pi_\phi(y_i|q,y_{<i})}{\pi_{\mathrm{ref}}(y_i|q,y_{<i})}

Key tokens are selected as the top k%k\% in preferred responses and bottom k%k\% in rejected responses. Policy fine-tuning is performed using a reference-free, contrastive loss over only the selected tokens:

LSePO(θ)=E(q,yw,yl)[logσ(u^w(q,yw;θ)u^l(q,yl;θ))]\mathcal{L}_{\mathrm{SePO}}(\theta) = -\,\mathbb{E}_{(q,y_w,y_l)} \left[ \log \sigma \left(\hat u_w(q,y_w;\theta) - \hat u_l(q,y_l;\theta) \right) \right]

where u^w\hat u_w and u^l\hat u_l are averages of log-probabilities over the selected tokens.

This approach achieves superior performance versus DPO and other baselines using ∼30% of tokens, supports weak-oracle-to-strong-policy generalization, and is robust to weak out-of-distribution data. The major efficiency gain is due to reduced backpropagation cost (∼70% lower FLOPs), while generalization is improved by filtering out non-informative tokens (Yang et al., 24 Aug 2024).

5. Difficulty-Adaptive Sample Weighting

(Ma et al., 30 Dec 2024) introduced a plug-and-play SePO strategy centered on sample-weighting, applicable to any pairwise preference optimization protocol. For each prompt xx, the model generates NN responses under temperature TT; the error frequency is measured by PeP_e (incorrect answers), and a score w(x)w(x) computed:

w(x)={1+αPeN,Pc=0 max(1,1+αPePc+ϵ1N),Pc>0w(x) = \begin{cases} 1 + \alpha \cdot \frac{P_e}{N}, & P_c = 0 \ \max \left( 1, 1 + \alpha \cdot \frac{P_e}{P_c + \epsilon} \cdot \frac{1}{N} \right), & P_c > 0 \end{cases}

Weights w(x)w(x) are then used to upweight the loss for hard (unstable or incorrect) samples in the DPO loss:

L=E(x,yw,yl)[w(x)(logπ(ywx)πref(ywx)logπ(ylx)πref(ylx))]L = -\,\mathbb{E}_{(x,y_w,y_l)}\left[ w(x)\,\left( \log\frac{\pi^*(y_w|x)}{\pi_\mathrm{ref}(y_w|x)}-\log\frac{\pi^*(y_l|x)}{\pi_\mathrm{ref}(y_l|x)} \right) \right]

Sample-weighted SePO yields consistent accuracy and stability improvements across mathematical reasoning tasks and models. For example, on MATH500, Qwen2-7B-DPO+SePO achieves 57.6% accuracy (+6.6%), outperforming unweighted DPO’s 55.8% (Ma et al., 30 Dec 2024).

6. Experimental Results and Benchmarks

The following table summarizes key SePO configurations and their empirical impact, as reported in the cited literature:

Approach Selection Granularity Efficiency Notable Gain/Result Reference
PVar-guided SePO Prompt ≤10–50% prompts Top 10% PVar: best eval. with 10× less data (Guo et al., 14 Oct 2025)
Selective-DPO Token (logprob gap) 30–50% tokens per seq. 33B ref, 3B student: +1.2% WR over DPO (Dong, 10 Jul 2025)
Oracle-contrastive SePO Token (oracle reward) 30% tokens per seq. +0.7–1.6pp WR over SimPO/DPO, lower FLOP (Yang et al., 24 Aug 2024)
SePO reweighting Sample All (with adaptive weight) +2–4pp math accuracy vs. unweighted DPO/SimPO (Ma et al., 30 Dec 2024)

On all tested benchmarks (AlpacaEval 2.0, Arena-Hard, MT-Bench, GSM8K, MATH500, UltraFeedback), SePO variants outperform baseline DPO, with benefits in convergence speed, sample efficiency, and robustness to distributional shift.

7. Limitations and Open Questions

Known constraints of SePO methodologies include:

  • Reward/Reference Model Quality: The quality of selection or weighting in prompt- or token-level SePO is bounded by the strength of the reward or reference model; poor proxies can degrade performance (Guo et al., 14 Oct 2025, Dong, 10 Jul 2025, Yang et al., 24 Aug 2024).
  • Selection Ratio Sensitivity: Empirical performance plateaus in the k[30%,50%]k\in[30\%,50\%] interval for token selection; too aggressive pruning (small kk) misses key signals, while too lax inclusion (large kk) loses efficiency gain (Yang et al., 24 Aug 2024, Dong, 10 Jul 2025).
  • Cross-Vocabulary Selection: Present SePO approaches assume shared tokenization between oracle/reference and policy; heterogeneous vocabulary settings remain open (Yang et al., 24 Aug 2024).
  • Dynamicity: Preference variance and token informativeness may shift as the policy evolves, motivating periodic or online re-selection (Guo et al., 14 Oct 2025, Yang et al., 24 Aug 2024).
  • Generalization Beyond Pairwise Labels: Current SePO frameworks focus on pairwise or binary preference supervision, with multi-response or continuous-feedback generalizations an open direction (Guo et al., 14 Oct 2025).
  • Possible Instruction-following Trade-offs: Token-level SePO may reduce strict adherence to prompts in favor of optimizing for preference signal, which might necessitate hybrid or mixed objectives (Dong, 10 Jul 2025).

A plausible implication is that future work will move towards dynamic, streaming, or human-in-the-loop SePO, as well as better mechanisms for adapting selection to non-standard or evolving data regimes.


For further implementation detail and experimental replication, see the corresponding works: (Guo et al., 14 Oct 2025, Dong, 10 Jul 2025, Yang et al., 24 Aug 2024, Ma et al., 30 Dec 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Selective Preference Optimization (SePO).