Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Unlikelihood Training

Updated 23 April 2026
  • Contrastive Unlikelihood Training is a technique that integrates likelihood-based and contrastive objectives to suppress problematic tokens and reduce text degeneration in language models.
  • It employs token-level loss functions that impose a sharp margin between desired and undesirable outputs, ensuring stable gradients and improved generation quality.
  • Empirical evaluations show that CUT significantly lowers repetition rates and improves model alignment, outperforming traditional unlikelihood and reward-based methods.

Contrastive Unlikelihood Training (CUT) is a class of training objectives for LMs that explicitly penalize undesirable outputs by contrasting the probabilities assigned to correct and problematic tokens or behaviors. CUT integrates contrastive and unlikelihood principles with standard likelihood-based objectives, enabling more targeted suppression of degenerative or misaligned outputs without loss of general modeling capacity. While the original formalism addressed text degeneration via token-level repetition, recent work extends CUT to fine-grained alignment with natural language judgments, providing a general approach for both generation quality enhancement and instruction-following model alignment (Jiang et al., 2022, Xu et al., 2023).

1. Theoretical Foundations and Loss Formulation

CUT operates by distinguishing between positive (desirable) and negative (problematic or misaligned) tokens, contrasting their predicted probabilities within the loss computation. In the original formulation for text degeneration, at time step tt with context x<tx_{<t}, the loss consists of two components:

  • Cross-Entropy (CE):

LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})

where xtx_t is the label token.

  • Contrastive Unlikelihood (CUT) Loss:

LCUTt=log(1+nSNtexp(zn(t)zxt(t)))L_{\mathrm{CUT}}^t = \log\left(1 + \sum_{n \in S_N^t} \exp(z_n^{(t)} - z_{x_t}^{(t)})\right)

Here, SNtS_N^t denotes a small, selected set of negative tokens (e.g., recent context tokens likely to cause repetition), and zj(t)z_j^{(t)} is the pre-softmax logit. This imposes a margin between the label and negative token logits.

  • Combined Objective: The full per-step loss is

Lt=LCEt+LCUTtL^t = L_{\mathrm{CE}}^t + L_{\mathrm{CUT}}^t

Recent extensions to model alignment with natural language judgments employ a more sophisticated contrastive signal:

  • Contrastive Unlikelihood Loss: Identify positions UU in response yy where negative judgments would over-justify a token, then suppress those tokens via the unlikelihood term:

x<tx_{<t}0

x<tx_{<t}1

  • Contrastive Likelihood Loss: Ensures model gives correct responses higher probability when only positive judgments are provided, and restricts likelihood of inappropriate responses conditioned on negative judgments:

x<tx_{<t}2

  • Total Loss:

x<tx_{<t}3

2. Motivation and Conceptual Advancements

The chief motivation for CUT is the inadequacy of standard cross-entropy and vanilla unlikelihood training (UL-T) in addressing text degeneration, particularly repetition and incoherence. CE treats all negatives equally, lacking discrimination between harmful (e.g., repeated) and irrelevant tokens. UL-T suppresses repeated tokens but does so in a manner that interacts non-locally due to the summation inside the log, with potential instability and indirect effects on irrelevant tokens.

CUT directly contrasts true tokens with a targeted negative set, introducing a sharp margin in the logits and producing more stable gradients that do not affect the bulk of the vocabulary. In the context of alignment, CUT leverages full-text judgments to localize the suppression of problematic behaviors at the token level, contrasting model outputs under positive and negative feedback conditions (Jiang et al., 2022, Xu et al., 2023).

3. Implementation Methodology

Token-Level Degeneration Suppression

For standard language modeling:

  1. Compute x<tx_{<t}4, logits x<tx_{<t}5.
  2. Identify x<tx_{<t}6 and negative set x<tx_{<t}7 (typically last x<tx_{<t}8 context tokens).
  3. Calculate x<tx_{<t}9 and LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})0.
  4. Backpropagate total loss.

Judgment-Based Alignment

For alignment tasks:

  1. Build triplets LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})1 where LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})2 is the instruction, LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})3 the response, LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})4 the judgment (positive/negative).
  2. Construct three example types: Align-P (correct response, positive judgment), Align-N (incorrect, negative judgment), Misalign (incorrect, fake positive judgment).
  3. For each batch:
    • For negative judgments, find paired fake positives.
    • Compute per-token probabilities under both judgments.
    • Detect inappropriately justified tokens and apply unlikelihood.
    • Apply contrastive likelihood loss for overall sequence.
    • Backpropagate LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})5.

Hyperparameters and Practical Considerations

For text degeneration, optimal negative window LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})6 is around LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})7 of the sequence length (e.g., LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})8 for max len LCEt=logp(xtx<t)L_{\mathrm{CE}}^t = -\log p(x_t|x_{<t})9), with loss applied to the first xtx_t0 positions. For alignment, key parameters include xtx_t1 for detection, unlikelihood weight xtx_t2, batch size, and learning rate; empirical defaults are established for LLaMA2 and GPT-2 variants (Jiang et al., 2022, Xu et al., 2023).

4. Empirical Evaluation and Results

Language Modeling and Text Degeneration

On Wikitext-103 (GPT-2 Small, M=60), CUT achieves the following (greedy decoding, 100-token continuations):

  • 1-gram repetition rate drops from 71.0% (CE) to 22.1% (CUT)
  • 4-gram repetition rate from 50.9% to 0.8%
  • Unique unigrams increase from 12,787 to 22,832
  • Perplexity rises moderately (18.01→18.72)
  • Human evaluation: CUT preferred over top-k and SimCTG-CS by >55% (Jiang et al., 2022)

Open-Domain Dialogue

With BlenderBot 400M (5 datasets):

  • rep-1: 9.2% (CUT) vs. 25.8% (baseline), rep-4: 0.05% vs. 6.6%
  • Unique tokens: 6,404 vs. 5,955
  • Perplexity: 13.26→14.70 (Jiang et al., 2022)

Model Alignment with Judgments

With only 1317 judgment data (off-the-shelf Shepherd), LLaMA2-13B fine-tuned with CUT attains AlpacaEval=62.56, outperforming the largest reward-based models including DaVinci003 (1.87 for Base LLaMA2-13B; best prior baseline 10.22). Iterative online alignment extends performance to 91.36 on AlpacaEval with LLaMA2-chat-13B. On TruthfulQA, accuracy improves from 36.28% to 49.36% (Xu et al., 2023).

Model/Approach AlpacaEval
Base LLaMA2-13B 1.87
Demonstration (MLE only) 7.56
Hindsight 10.22
DaVinci003 (175B) <62.56
CUT (LLaMA2-13B) 62.56
CUT (LLaMA2-chat-13B warm) 87.24 (→91.36)

CUT also increases ROUGE-L for summarization and maintains or improves performance with minimal examples (Xu et al., 2023).

5. Comparative Analysis and Ablation Studies

CUT yields advantages over UL-T and reward-based alignment methods (e.g., DPO):

  • Gradient Focus: CUT gradients suppress only selected negative tokens, preventing undesirable redistribution of probability mass or indirect boosting of irrelevant tokens.
  • Fine-Grained Suppression: Judgment-based CUT targets only tokens explicitly justified by negative judgments, as ablations show that removing contrast or fine-grained detection leads to large drops or divergence.
  • Empirical Superiority: On AlpacaEval, CUT (LLaMA2-chat, re-annotated judgments) achieves 86.36 vs. 62.89 for DPO with UltraFeedback; CUT dominates generation-based evaluation, though DPO slightly edges out on ranking tasks (Xu et al., 2023).

Ablations reveal both contrastive and unlikelihood terms are indispensable. Overapplying unlikelihood (dropping token detection) causes divergence.

6. Integration Strategies and Scalability

CUT is applicable to autoregressive and encoder-decoder LMs of varying sizes. For large-scale pre-training or fine-tuning:

  • Loss can be applied after initial stabilization under cross-entropy.
  • Scaling is orthogonal to base model size.
  • Practical hyperparameters are established for various regimes; learning rates may be higher than for other unlikelihood systems without destabilization.

In model alignment, CUT integrates seamlessly with LoRA adapters, standard templates, and batch sizes, making it conducive to efficient and effective alignment workflows.

7. Future Directions and Significance

Contrastive Unlikelihood Training provides a principled, targeted mechanism for mitigatiion of text degeneration and alignment of generative models using high-bandwidth supervision such as natural language judgments. Experiments demonstrate that CUT is capable of closing much of the gap to reward-based pipelines with much less data, delivering superior specificity, factuality, and alignment with user intent (Jiang et al., 2022, Xu et al., 2023). The mechanisms enable both quality improvement in generation and fine-grained controllable alignment, suggesting broad applicability to ongoing and future work in controllable and safe LLM development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Unlikelihood Training (CUT).