Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal Transport-Based Token Weighting (OTPO)

Updated 15 April 2026
  • OTPO is a technique that uses optimal transport to compute adaptive, semantic-aware token weights, mitigating noise and length bias in preference optimization.
  • It formulates token weighting as an optimal transport problem using unbalanced Sinkhorn iterations and regularization for stable, scalable alignment.
  • Empirical benchmarks demonstrate that OTPO improves alignment metrics in LLM tasks like instruction-following and summarization, achieving win-rate gains up to 8.6%.

Optimal Transport-Based Token Weighting (OTPO) frameworks constitute a class of techniques in preference optimization that leverage optimal transport (OT) theory to assign adaptive, semantic-aware importance to tokens when aligning LLMs to human preferences. These approaches replace uniform token weighting found in standard Direct Preference Optimization (DPO) with OT-derived weights, focusing the optimization objective on the most informative and consequential token alignments. By formulating reward differences or preference losses as optimal transport problems over token embeddings, OTPO schemes mitigate the impact of irrelevant or noisy tokens, reduce length bias, and yield more contrastive and robust alignment between model outputs and human intent (Li et al., 24 May 2025).

1. Motivation and Conceptual Foundations

Standard DPO directly optimizes the log-likelihood difference between a preferred (chosen) and a non-preferred (rejected) sequence. In this regime, each token’s log-ratio contribution is weighted uniformly: Δr=iqcijqrj\Delta_r = \sum_i q_c^i - \sum_j q_r^j where qciq_c^i and qrjq_r^j denote log-likelihood ratios of tokens in the chosen and rejected responses, respectively. Uniform weighting allows irrelevant or high-frequency tokens to dominate the difference, confounding alignment and inducing length bias.

OTPO paradigms introduce a semantically informed reweighting mechanism. Instead of treating all token positions equally, OTPO assigns higher weights to token pairs that are semantically aligned—typically assessed via hidden state similarities—and down-weights less relevant, less matched tokens. This weighting is determined by solving an OT problem between the token-level embeddings of candidate and reference responses, yielding an alignment plan that focuses model optimization on meaningful distinctions (Li et al., 24 May 2025).

The approach connects to broader applications of OT-based token weighting, including interpretable semantic similarity (Lee et al., 2022) and global distributional alignment in preference learning (Zhu et al., 2 Apr 2026), but distinguishes itself via integration into DPO-style direct preference objectives.

2. Mathematical Formulation and Algorithm

The OTPO workflow is formalized as follows (Li et al., 24 May 2025):

Token Embedding Extraction

Obtain last-layer hidden states for the chosen (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m]) and rejected (yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n]) responses: hci,hrjRdh_c^i,\, h_r^j \in \mathbb{R}^d

Cost Matrix Construction

Define CRm×nC\in\mathbb{R}^{m\times n} by Euclidean distances: Cij=hcihrj2C_{ij} = \| h_c^i - h_r^j \|_2

Regularized Unbalanced OT Objective

Solve for optimal transport ΓR0m×n\Gamma^{*}\in\mathbb{R}_{\ge 0}^{m\times n}: Γ=argminΓ0ijΓijCij+ϵ1ijΓijlogΓij+ϵ2(KL(Γ1n1m)+KL(Γ1m1n))\Gamma^* = \arg\min_{\Gamma\geq 0} \sum_{ij} \Gamma_{ij}C_{ij} + \epsilon_1\sum_{ij} \Gamma_{ij}\log\Gamma_{ij} + \epsilon_2\Bigl( \mathrm{KL}(\Gamma\mathbf{1}_n \|\mathbf{1}_m) + \mathrm{KL}(\Gamma^\top\mathbf{1}_m \|\mathbf{1}_n)\Bigr) The entropy regularizer (qciq_c^i0) and KL terms (qciq_c^i1) control the smoothness and marginal fidelity, supporting “unbalanced” mass.

Token Weight Computation

Row and column sums yield raw weights: qciq_c^i2 Normalize both to a budget qciq_c^i3 for stability.

OT-Weighted Preference Difference

Form the OT-weighted reward difference: qciq_c^i4

Loss Objective

Use the same sigmoid cross-entropy as DPO: qciq_c^i5

Algorithmic Steps

  1. Compute qciq_c^i6 for each paired response.
  2. Form qciq_c^i7.
  3. Solve OT via unbalanced Sinkhorn iterations.
  4. Sum and normalize weights.
  5. Compute qciq_c^i8 from model/reference.
  6. Compute qciq_c^i9.
  7. Backpropagate qrjq_r^j0.

Hyperparameters

  • qrjq_r^j1: entropy regularization, typically qrjq_r^j2.
  • qrjq_r^j3: KL marginal fidelity, typically fixed at qrjq_r^j4.
  • qrjq_r^j5: weight budget, set to qrjq_r^j6.
  • Sinkhorn iterations: qrjq_r^j7--qrjq_r^j8.
  • Complexity: qrjq_r^j9 per instance ((yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])0 is sequence length); little overhead compared to Transformer forward pass.

3. Comparison to Alternative OT-Based Token Weighting

OTPO is part of a broader movement toward leveraging optimal transport for token- or distribution-level alignment in NLP:

  • RCMD/CLRCMD (Lee et al., 2022): “Relaxed Contextualized Mover’s Distance” computes OT-based sentence distances for semantic similarity by cost matrices over token embeddings with (typically cosine) distance, but enforces looser (relaxed) marginal constraints. While RCMD emphasizes interpretability and efficient, sparse matching, OTPO emphasizes precise semantically weighted contributions to the reward difference in RLHF.
  • PLOT (Zhu et al., 2 Apr 2026): “Preference Learning via Optimal Transport” constrains OT at the vocabulary distribution level, comparing model output pseudo-distribution (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])1 to a data-defined target (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])2 using a cost matrix derived from absolute pairwise differences of embedding norms. The final loss is a (Wasserstein-1) scalar, not a position-weighted objective. While PLOT exploits global OT-loss, OTPO explicitly focuses on token-level assignment and gradient routing.

A summary table contrasting representative OT-based schemes:

Method OT Domain Weighting Granularity
OTPO (Li et al., 24 May 2025) Token embeddings Token/position-level
RCMD (Lee et al., 2022) Token embeddings Token/nearest-neighbor
PLOT (Zhu et al., 2 Apr 2026) Vocab distributions Token-frequency/global

4. Implementation Details and Practical Considerations

OTPO leverages established OT numerical libraries such as PythonOT’s unbalanced-OT and POT Sinkhorn. The chosen distance is typically Euclidean, although others (e.g., cosine) may be substituted. Budget normalization ((yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])3) ensures reward difference magnitudes remain consistent across length-variable responses, addressing length bias.

Algorithmic stability is maintained through regularization hyperparameters ((yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])4, (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])5) and through robust normalization of weights. Empirically, OTPO loss rewards remain stable across wide hyperparameter intervals.

Runtime overhead is minimal: each step incurs (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])6 the compute of standard DPO, dominated by the (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])7 cost of the cost matrix and Sinkhorn, which is marginal compared to Transformer forward/backward passes.

Ablation studies reveal that naive weighting (e.g., uniform or heuristic embedding similarity) fails to match performance or reliability of full OT-based weighting.

5. Empirical Benchmarks and Comparative Performance

Extensive experiments in (Li et al., 24 May 2025) demonstrate that OTPO delivers robust gains in instruction-following, summarization, and alignment. Key findings include:

  • On AlpacaEval2 (Llama-3-8B + UltraFeedback), OTPO improves LC-win-rate from (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])8 (DPO) to (yc=[yc1,...,ycm])(y_c = [y_c^1, ..., y_c^m])9.
  • For Llama-3.2-3B, the improvement is from (yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n])0 (DPO) to (yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n])1 (OTPO).
  • In TL;DR summarization with Qwen-2.5-3B, OTPO achieves an (yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n])2 win-rate improvement over the best baseline.
  • Additional wins of (yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n])3--(yr=[yr1,...,yrn])(y_r = [y_r^1, ..., y_r^n])4 absolute over other DPO variants are observed on HelpSteer2 across Qwen-2.5-3B and Mistral-7B.

Empirical reward margins and model alignment are robust to regularizer settings. Replacing OT with non-adaptive or purely similarity-based

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Transport-Based Token Weighting (OTPO).