Sequence Alignment (un)Likelihood Training (SALT)

Updated 7 February 2026

Sequence Alignment (un)Likelihood Training (SALT) is a method that integrates token-level human edits into seq2seq models to selectively reinforce correct tokens and penalize errors.
It employs the Needleman–Wunsch algorithm to align model outputs with human edits, creating precise token masks for likelihood and unlikelihood objectives.
Empirical evaluations show SALT outperforms both maximum likelihood training and traditional RLHF, achieving higher ROUGE and UMLS-F1 scores in summarization tasks.

Sequence Alignment (un)Likelihood Training (SALT) is a method for refining sequence-to-sequence models using fine-grained supervision derived from token-level human edits. By integrating human-corrected feedback and systematically penalizing tokens tagged as erroneous, SALT outperforms both standard maximum likelihood training and traditional reinforcement learning from human feedback (RLHF) in the context of both general-domain and medical summarization. This approach harnesses explicit alignment between model predictions and human revisions to enable selective likelihood and unlikelihood training, maximizing correspondence with validated outputs while minimizing reiterated mistakes (Yao et al., 2023).

1. Motivation and Conceptual Foundations

Traditional fine-tuning using maximum likelihood estimation does not differentiate between the significance of errors: minor grammatical shifts and substantial factual discrepancies contribute equally to the loss. In authentic summarization workflows, users often manually edit model-generated summaries, providing explicit token-level information on what to delete, retain, or insert. Such edits encapsulate nuanced, context-sensitive preferences not readily captured by simple preference-based ranking feedback.

SALT operationalizes this feedback by intertwining two complementary objectives:

The likelihood objective reinforces tokens in the human-edited summary, thereby encouraging retention and generation of user-verified and newly inserted content.
The unlikelihood objective penalizes the model for predicting tokens that users have excised or replaced, explicitly dissuading repeated mistakes.

This dual mechanism provides a more targeted supervisory signal, simultaneously optimizing for correction and factual improvement (Yao et al., 2023).

2. Technical Formalism of SALT

Given an input $x$ (for example, the source document or conversation), the model generates an initial summary $y^{\text{gen}} = S_{AI}$ . A human then edits this summary, yielding $y^{\text{edit}} = S_E$ . The model parameterizes a distribution $P_\theta$ over output tokens.

The objectives comprising SALT are defined as follows:

Likelihood Loss:

$L_{\rm likelihood} = -\,\mathbb{E}_{(x,y^\text{edit})}\bigl[\log P_\theta(y^\text{edit}\mid x)\bigr]$

Unlikelihood Loss:

$L_{\rm unlikelihood} = -\,\mathbb{E}_{(x,y^\text{gen})} \Bigl[ \sum_{t=1}^{T} \log\bigl(1 - P_\theta(y^\text{gen}_t \mid x, y^\text{gen}_{<t})\bigr) \Bigr]$

Combined Objective:

$L_{\rm SALT} = L_{\rm likelihood} + \lambda\, L_{\rm unlikelihood}$

with $\lambda \geq 0$ as a trade-off parameter between reinforcing user-confirmed tokens and penalizing erroneous ones.

After aligning $S_{AI}$ and $S_E$ with the Needleman–Wunsch algorithm, per-token losses are further weighted using binary indicators:

$1_{\text{AI-C}}(t) = 1$ if token $t$ in $S_{AI}$ was changed or deleted by the user,
$1_{\text{AI-NC}}(t) = 1$ if token $t$ in $S_{AI}$ was not changed,
$1_{\text{E-C}}, 1_{\text{E-NC}}$ analogously defined for $S_E$ .

The sequence-level losses aggregate these per-token contributions, weighted by learned coefficients (Yao et al., 2023).

3. Sequence Alignment and Mask Construction

Critical to SALT is the precise localization of candidate tokens for reinforcement or penalization. The Needleman–Wunsch global alignment algorithm aligns $S_{AI}$ and $S_E$ at the token level, labeling each pairwise position as a correspondence, insertion, deletion, or substitution. This enables SALT to derive explicit binary masks for “changed/deleted” ( $\text{AI-C}$ ), “not changed” ( $\text{AI-NC}$ ), “edited” ( $\text{E-C}$ ), and “not edited” ( $\text{E-NC}$ ) positions.

These masks directly inform the per-token application of likelihood and unlikelihood losses: only those model tokens actually removed or replaced by the user are penalized, and only truly validated tokens are reinforced. This selective, fine-grained supervision distinguishes SALT from rewards over entire sequences or implicit preference modeling (Yao et al., 2023).

4. Imitation Edits and Smoothing Strategies

Because curating large volumes of human-edited data $S_E$ is cost-prohibitive, SALT introduces imitation edits. Here, any ground truth summary $S_I$ (e.g., from datasets such as CNN/DailyMail or XSum) is treated as if it were a user edit of $S_{AI}$ , with alignment and mask construction proceeding identically.

To mitigate instability inherent in this approach (since $S_I$ is not a genuine user edit), two smoothing heuristics are applied:

Only runs of two or more consecutive deleted tokens are penalized, reducing spurious gradients from function words or punctuation.
Examples where over 60% of $S_{AI}$ tokens would be marked “changed” are discarded, capping the effect of extreme divergences.

These stabilizations maintain the integrity of the loss signal in the absence of genuine human corrections (Yao et al., 2023).

5. SALT Training Loop

The SALT training process follows these core steps:

Initialize the model from a pre-trained checkpoint.
Generate $S_{AI}$ predictions for each input (offline or on-the-fly).
For each minibatch:
- Fetch $(S_{AI}, S_\text{edit})$ pairs, where $S_\text{edit}$ is a human edit ( $S_E$ ) or imitation edit ( $S_I$ ).
- Align $S_{AI}$ and $S_\text{edit}$ to derive masks.
- Compute masked per-token likelihood and unlikelihood losses.
- Compute the final loss $L = L_{\rm likelihood} + \lambda L_{\rm unlikelihood}$ .
- Backpropagate the loss and update parameters.

This workflow centralizes sequence alignment and masking in each training iteration, ensuring targeted updating based on edit provenance (Yao et al., 2023).

6. Comparative Performance and Empirical Validation

SALT is empirically validated against Direct Preference Optimization (DPO), a recent RLHF method in which $S_{AI}$ is designated as “rejected” and $S_E$ as “chosen” using a preference-maximizing loss. On the CCUser medical summarization data, SALT $_{l+u}$ achieves superior results: ROUGE-1 0.394 vs. DPO $_{\beta=0.1}$ 0.379 (±0.015) and Meteor 0.320 vs. 0.301; reward accuracy is also higher for SALT.

The approach is also tested on T5-small and T5-large models, using CCUser, CNN/DailyMail, and XSum datasets. Key metrics include ROUGE-1 F1, UMLS-F1, GPT-4 human-preference MRR, and SAGE scores quantifying old mistakes ( $G_{w1}$ ), new information ( $G_{w2}$ ), and verified tokens ( $G_{w3}$ ). On CCUser_eval, SALT $_{l+u}$ improves ROUGE-1 from 57.77 to 58.39 and UMLS-F1 from 61.02 to 62.13. Augmentation with imitation edits (SALT $_{l+u}$ +RSALT $_{l+u}$ ) further elevates ROUGE-1 to 36.26 on CC_eval, surpassing the baseline at 36.07 (Yao et al., 2023).

7. Ablation Studies, Analysis, and Practical Recommendations

Ablation experiments delineate the contribution of each SALT variant. Using only unlikelihood loss ( $\text{SALT}_u$ ) sharply reduces repeated mistakes ( $G_{w1}$ ) by 17% but does not increase novel, informative tokens. Conversely, increasing the weight on user-kept tokens ( $\text{SALT}_{l_i}$ ) raises the proportion of new tokens ( $G_{w2}$ ) by 4.3%. The combined SALT $_{l+u}$ variant effectively reduces repeated mistakes (by 5.4%) while boosting verified content (by 2.9%).

Tuning $\lambda \approx 1$ yields an optimal trade-off; excessive penalization ( $\lambda \gg 1$ ) impairs factual recall. Further, imitation edit-based training yields ROUGE-1 improvements of 0.6–1.3 points over likelihood alone, and alignment smoothing enhances training stability by approximately 0.2 ROUGE points. A plausible implication is that fine-grained, token-level supervision extends readily to abundant or replayed data via imitation edits without compromising training effectiveness (Yao et al., 2023).

In summary, SALT leverages explicit token-level judgments embedded in human edits to provide targeted reinforcement and penalization within the sequence modeling loop. The combination of global sequence alignment, per-token masking, and flexible incorporation of both human and imitation edits offers measurable improvements over both naïve likelihood training and preference-based RLHF, particularly in domains such as medical summarization demanding precision supervision (Yao et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Summarization with Human Edits (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequence Alignment (un)Likelihood Training (SALT).