Optimal Transport-Based Token Weighting (OTPO)
- OTPO is a technique that uses optimal transport to compute adaptive, semantic-aware token weights, mitigating noise and length bias in preference optimization.
- It formulates token weighting as an optimal transport problem using unbalanced Sinkhorn iterations and regularization for stable, scalable alignment.
- Empirical benchmarks demonstrate that OTPO improves alignment metrics in LLM tasks like instruction-following and summarization, achieving win-rate gains up to 8.6%.
Optimal Transport-Based Token Weighting (OTPO) frameworks constitute a class of techniques in preference optimization that leverage optimal transport (OT) theory to assign adaptive, semantic-aware importance to tokens when aligning LLMs to human preferences. These approaches replace uniform token weighting found in standard Direct Preference Optimization (DPO) with OT-derived weights, focusing the optimization objective on the most informative and consequential token alignments. By formulating reward differences or preference losses as optimal transport problems over token embeddings, OTPO schemes mitigate the impact of irrelevant or noisy tokens, reduce length bias, and yield more contrastive and robust alignment between model outputs and human intent (Li et al., 24 May 2025).
1. Motivation and Conceptual Foundations
Standard DPO directly optimizes the log-likelihood difference between a preferred (chosen) and a non-preferred (rejected) sequence. In this regime, each token’s log-ratio contribution is weighted uniformly: where and denote log-likelihood ratios of tokens in the chosen and rejected responses, respectively. Uniform weighting allows irrelevant or high-frequency tokens to dominate the difference, confounding alignment and inducing length bias.
OTPO paradigms introduce a semantically informed reweighting mechanism. Instead of treating all token positions equally, OTPO assigns higher weights to token pairs that are semantically aligned—typically assessed via hidden state similarities—and down-weights less relevant, less matched tokens. This weighting is determined by solving an OT problem between the token-level embeddings of candidate and reference responses, yielding an alignment plan that focuses model optimization on meaningful distinctions (Li et al., 24 May 2025).
The approach connects to broader applications of OT-based token weighting, including interpretable semantic similarity (Lee et al., 2022) and global distributional alignment in preference learning (Zhu et al., 2 Apr 2026), but distinguishes itself via integration into DPO-style direct preference objectives.
2. Mathematical Formulation and Algorithm
The OTPO workflow is formalized as follows (Li et al., 24 May 2025):
Token Embedding Extraction
Obtain last-layer hidden states for the chosen and rejected responses:
Cost Matrix Construction
Define by Euclidean distances:
Regularized Unbalanced OT Objective
Solve for optimal transport : The entropy regularizer (0) and KL terms (1) control the smoothness and marginal fidelity, supporting “unbalanced” mass.
Token Weight Computation
Row and column sums yield raw weights: 2 Normalize both to a budget 3 for stability.
OT-Weighted Preference Difference
Form the OT-weighted reward difference: 4
Loss Objective
Use the same sigmoid cross-entropy as DPO: 5
Algorithmic Steps
- Compute 6 for each paired response.
- Form 7.
- Solve OT via unbalanced Sinkhorn iterations.
- Sum and normalize weights.
- Compute 8 from model/reference.
- Compute 9.
- Backpropagate 0.
Hyperparameters
- 1: entropy regularization, typically 2.
- 3: KL marginal fidelity, typically fixed at 4.
- 5: weight budget, set to 6.
- Sinkhorn iterations: 7--8.
- Complexity: 9 per instance (0 is sequence length); little overhead compared to Transformer forward pass.
3. Comparison to Alternative OT-Based Token Weighting
OTPO is part of a broader movement toward leveraging optimal transport for token- or distribution-level alignment in NLP:
- RCMD/CLRCMD (Lee et al., 2022): “Relaxed Contextualized Mover’s Distance” computes OT-based sentence distances for semantic similarity by cost matrices over token embeddings with (typically cosine) distance, but enforces looser (relaxed) marginal constraints. While RCMD emphasizes interpretability and efficient, sparse matching, OTPO emphasizes precise semantically weighted contributions to the reward difference in RLHF.
- PLOT (Zhu et al., 2 Apr 2026): “Preference Learning via Optimal Transport” constrains OT at the vocabulary distribution level, comparing model output pseudo-distribution 1 to a data-defined target 2 using a cost matrix derived from absolute pairwise differences of embedding norms. The final loss is a (Wasserstein-1) scalar, not a position-weighted objective. While PLOT exploits global OT-loss, OTPO explicitly focuses on token-level assignment and gradient routing.
A summary table contrasting representative OT-based schemes:
| Method | OT Domain | Weighting Granularity |
|---|---|---|
| OTPO (Li et al., 24 May 2025) | Token embeddings | Token/position-level |
| RCMD (Lee et al., 2022) | Token embeddings | Token/nearest-neighbor |
| PLOT (Zhu et al., 2 Apr 2026) | Vocab distributions | Token-frequency/global |
4. Implementation Details and Practical Considerations
OTPO leverages established OT numerical libraries such as PythonOT’s unbalanced-OT and POT Sinkhorn. The chosen distance is typically Euclidean, although others (e.g., cosine) may be substituted. Budget normalization (3) ensures reward difference magnitudes remain consistent across length-variable responses, addressing length bias.
Algorithmic stability is maintained through regularization hyperparameters (4, 5) and through robust normalization of weights. Empirically, OTPO loss rewards remain stable across wide hyperparameter intervals.
Runtime overhead is minimal: each step incurs 6 the compute of standard DPO, dominated by the 7 cost of the cost matrix and Sinkhorn, which is marginal compared to Transformer forward/backward passes.
Ablation studies reveal that naive weighting (e.g., uniform or heuristic embedding similarity) fails to match performance or reliability of full OT-based weighting.
5. Empirical Benchmarks and Comparative Performance
Extensive experiments in (Li et al., 24 May 2025) demonstrate that OTPO delivers robust gains in instruction-following, summarization, and alignment. Key findings include:
- On AlpacaEval2 (Llama-3-8B + UltraFeedback), OTPO improves LC-win-rate from 8 (DPO) to 9.
- For Llama-3.2-3B, the improvement is from 0 (DPO) to 1 (OTPO).
- In TL;DR summarization with Qwen-2.5-3B, OTPO achieves an 2 win-rate improvement over the best baseline.
- Additional wins of 3--4 absolute over other DPO variants are observed on HelpSteer2 across Qwen-2.5-3B and Mistral-7B.
Empirical reward margins and model alignment are robust to regularizer settings. Replacing OT with non-adaptive or purely similarity-based