Token-Weighted Loss Methods

Updated 19 April 2026

Token-Weighted Loss is an objective function for sequence models that scales each token's contribution using data- and context-specific weights.
It employs various weighting schemes such as frequency-based, entropy-based, and model confidence methods to emphasize important tokens during training.
Applications in neural machine translation, language modeling, and ASR demonstrate improved performance metrics and robustness through targeted token emphasis.

A token-weighted loss is any objective function for sequence modeling in which the contribution of each target token's prediction to the overall optimization criterion is explicitly scaled by a token-specific (often data- or context-dependent) weight. In contrast to standard maximum likelihood training—which uniformly weights all tokens—token-weighting enables differential emphasis on semantically, structurally, or statistically important tokens, supports robustness to noisy supervision, and steers optimization toward particular learning goals. Contemporary token-weighted loss approaches span autoregressive language modeling, neural machine translation, long-context LMs, direct preference optimization, sequence transduction, and generative recommendation, with weighting functions designed using frequency heuristics, dynamic model confidence, entropy, importance sampling, semantic information gain, optimal transport, or task-specific error rates.

1. Mathematical Foundations of Token-Weighted Loss

Formally, in a sequence model with input $x$ and ground-truth target sequence $y = (y_1, \dots, y_T)$ , the standard loss is negative log-likelihood: $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ The token-weighted variant introduces a vector of non-negative, typically $\mathbb{R}_{\geq 0}$ , weights $w_t$ : $\mathcal{L}_{\mathrm{TW}}(x, y) = -\sum_{t=1}^T w_t \cdot \log P_\theta(y_t \mid y_{<t}, x)$ The weighting scheme $w_t$ may be static (frequency, position, structural property) or dynamic (function of model confidence, entropy, or external reward).

Several generalizations and specializations exist:

Soft cross-entropy targets: replace the Dirac delta on $y_t$ with a “cloud” $q_t(\cdot)$ , yielding $\mathcal{L} = -\sum_{t=1}^T \sum_{w} q_t(w) \log P_\theta(w \mid h_t)$ (Elbayad et al., 2018).
Multiplicative reward or information content: $y = (y_1, \dots, y_T)$ 0 where $y = (y_1, \dots, y_T)$ 1 may be a power, exponential, or bounded transformation.

For preference learning, token weightings enter sequence-level objectives—such as Direct Preference Optimization (DPO)—by replacing the sum over token log-ratio differences with a weighted sum (Yang et al., 26 May 2025, Li et al., 24 May 2025, Liu et al., 2024).

2. Weight Construction and Instantiation

Token weights $y = (y_1, \dots, y_T)$ 2 can be engineered or learned via several methodologies:

A. Frequency-based weighting:

Low-frequency (rare) tokens are upweighted to correct corpus imbalance. For example, $y = (y_1, \dots, y_T)$ 3, with $y = (y_1, \dots, y_T)$ 4 the corpus count of token $y = (y_1, \dots, y_T)$ 5, $y = (y_1, \dots, y_T)$ 6 the median count, and $y = (y_1, \dots, y_T)$ 7 hyperparameters. Monotonicity and bounded expectation are maintained so common tokens are never downweighted below 1 (Gu et al., 2020).

B. Difficulty/Entropy-based weighting:

Model “difficulty” is quantified as the entropy of the predicted distribution $y = (y_1, \dots, y_T)$ 8, and the weight is $y = (y_1, \dots, y_T)$ 9 with $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 0 a tunable focus parameter (Su et al., 2023). This directs more gradient to uncertain (high-entropy) tokens.

C. Model confidence weighting:

Infer token weights from a teacher or student model’s token-wise confidence $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 1, e.g., $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 2, with $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 3 controlling “stiffness” (Keren et al., 2024). Useful in semi-supervised or noisy-label conditions.

D. Information gain/semantic gain weighting:

Estimate conditional semantic information gain from adding a token (e.g., by change in feature-space dispersion or prefix-conditional uncertainty reduction) (Chiu et al., 25 Jan 2026).

E. Dynamic difficulty (self-weighting):

Weigh by prediction confidence during training, e.g., $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 4 where $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 5 is model probability for the target; suppresses learning on trivial tokens, accentuates hard cases (Jiang et al., 2020).

F. Importance sampling and contrastive model difference:

Estimate token importance from log-probability differences between “preferred” and “non-preferred” models or via contrastive prompts, e.g., $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 6 (Liu et al., 2024).

G. Optimal transport-derived weights:

Learn inter-response token attributions using an optimal transport plan between preferred/rejected responses; aggregate row and column marginal flows as token weights for loss scaling (Li et al., 24 May 2025).

3. Implementation, Algorithmic Structure, and Tuning

The introduction of token-weighted loss entails minimal changes to established training routines:

Per-token weights are precomputed (static) or derived dynamically per batch.
The forward pass accumulates weighted token losses; standard autograd handles gradient scaling.
Normalization (e.g., mean per-batch scaling to 1) is often essential for numerically stable training (Keren et al., 2024).

Common pseudocode fragment: $w_t$ 1 Hyperparameter search (e.g., $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 7, $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 8 for weight functions) is routinely performed on held-out validation sets to maximize downstream task metrics. Sane initialization and regularization prevent dominance by outliers or noisy signals.

4. Applications and Empirical Benefits

Token-weighted loss, in its diverse forms, is now adopted across multiple domains and architectures:

Neural machine translation and text generation:

Frequency/difficulty weighting improves recall of rare tokens and lexical diversity with gains up to $\mathcal{L}_{\mathrm{CE}}(x, y) = -\sum_{t=1}^T \log P_\theta(y_t \mid y_{<t}, x)$ 9 BLEU on rare-heavy sentence buckets (Gu et al., 2020, Su et al., 2023).
Entropy-based weights (MiLe Loss) consistently outperform classic cross-entropy in low- and mid-frequency tokens and across reasoning benchmarks (Su et al., 2023).

Preference optimization in LLMs:

Token-Importance Guided DPO (TI-DPO) supplants uniform log-ratio sums in DPO loss with gradient-based importance weights, yielding measurable gains on generalization and speed of convergence relative to DPO and other RLHF baselines (Yang et al., 26 May 2025).
TIS-DPO leverages token-level importance sampling from contrastive LLM predictions, attaining large increases in safety and helpfulness scores in alignment as measured by external rewards (Liu et al., 2024).
OTPO employs optimal transport to adapt weights per instance, achieving superior length-controlled win-rates and interpretable attributions (Li et al., 24 May 2025).

Long-context LMs:

Token-wise weights reflecting differential context reliance lead to strong improvements (up to $\mathbb{R}_{\geq 0}$ 0 points on long-context benchmarks), with a simple two-step scoring/postprocessing framework (Helm et al., 12 Mar 2025).

ASR and sequence transduction:

Confidence-weighted token losses in RNN-T enable recovery of 64–99% of accuracy lost to corrupted transcripts, with up to 38% relative reduction in WER (Keren et al., 2024).

Speaker change detection:

Edit-distance-based token weighting allows explicit penalization of rare error types, dramatically increasing recall (e.g., 16.8% relative recall gain without harming precision) (Zhao et al., 2022).

Repetition reduction in NLG:

TLDR shows that reweighting tokens by model-difficulty, even with no extra hyperparameters, roughly halves repetition metrics with negligible quality drop (Jiang et al., 2020).

Generative recommender systems:

Multi-objective curriculum training with semantic and frequency-aware token weights achieves +6% to +7% absolute Hit@5/NDCG@5 improvement and enhanced robustness on tail items (Chiu et al., 25 Jan 2026).

5. Comparative Analysis and Method Families

The space of token-weighted losses encompasses a rich variety of methodologically distinct approaches, yet several structural themes emerge:

Family	Key Weight Function	Primary Domain(s)
Frequency-based	$\mathbb{R}_{\geq 0}$ 1 as $\mathbb{R}_{\geq 0}$ 2	NMT, text gen
Difficulty/entropy-based	$\mathbb{R}_{\geq 0}$ 3 entropy $\mathbb{R}_{\geq 0}$ 4	LM pretraining, NLG
Confidence/teacher-based	$\mathbb{R}_{\geq 0}$ 5 conf( $\mathbb{R}_{\geq 0}$ 6)	ASR, SSL, NLU
Dynamic/hardness (self)	$\mathbb{R}_{\geq 0}$ 7	NLG, anti-repetition
Information gain/OT	$\mathbb{R}_{\geq 0}$ 8 from semantic or context gain/flow	LLM alignment, recsys
Contrastive/prob-difference	$\mathbb{R}_{\geq 0}$ 9	Preference/RLHF
Sequence/soft targets	$w_t$ 0 from sim/emb proximity	captioning/translation

While all approaches modulate gradient flow at the token level, frequency and entropy weights address structural and statistical corpus biases, contrastive and OT-based methods target semantic alignment, and confidence or hardness weights enhance robustness to noise or optimize learning efficiency.

6. Limiting Factors, Open Problems, and Extensions

Notwithstanding empirically demonstrated benefits, token-weighted losses introduce new design and tuning dimensions:

Overweighting highly uncertain or noisy positions may amplify annotation errors (Su et al., 2023).
Excessive upweighting of rare items risks destabilizing optimization or sacrificing performance on frequent classes (Gu et al., 2020).
Down-weighting trivial tokens may compromise calibration/perplexity metrics even as downstream accuracy rises.
In RLHF, the construction of contrastive models or optimal transport plans incurs extra computation, although empirical evidence suggests such costs are manageable (Li et al., 24 May 2025, Liu et al., 2024).

Future research explores adaptive or learned weight functions (possibly end-to-end), fusion with structured sequence-level preferences, and broader application to tasks with complex token-informative structures such as code generation, automatic evaluation, or dialog act prediction.

7. References to Key Works

Token-level adaptive objectives for NMT: (Gu et al., 2020)
MiLe Loss (entropy-based token weighing): (Su et al., 2023)
TLDR (dynamic self-weighting): (Jiang et al., 2020)
Token-weighted RNN-T for noisy ASR/SSL: (Keren et al., 2024)
OTPO (optimal transport-based preference optimization): (Li et al., 24 May 2025)
TI-DPO (gradient-based importance weights in DPO): (Yang et al., 26 May 2025)
TIS-DPO (contrastive model-based token importance in DPO): (Liu et al., 2024)
Token-weighted multi-target recommendation: (Chiu et al., 25 Jan 2026)
Token weighting for long-range LM: (Helm et al., 12 Mar 2025)
SCD detection with token-level loss: (Zhao et al., 2022)
Token-level and sequence-level smoothing: (Elbayad et al., 2018)