Contrastive Decoding: Open-ended Text Generation as Optimization (2210.15097v2)

Published 27 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Given a LLM (LM), maximum probability is a poor decoding objective for open-ended generation, because it produces short and repetitive text. On the other hand, sampling can often produce incoherent text that drifts from the original topics. We propose contrastive decoding (CD), a reliable decoding approach that optimizes a contrastive objective subject to a plausibility constraint. The contrastive objective returns the difference between the likelihood under a large LM (called the expert, e.g. OPT-13B) and a small LM (called the amateur, e.g. OPT-125M), and the constraint ensures that the outputs are plausible. CD is inspired by the fact that the failures of larger LMs (e.g., repetition, incoherence) are even more prevalent in smaller LMs, and that this difference signals which texts should be preferred. CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone. It also works across model scales (OPT-13B and GPT2-1.5B) and significantly outperforms four strong decoding algorithms (e.g., nucleus, top-k) in automatic and human evaluations across wikipedia, news and story domains.

PDF HTML Abstract

Introduction to Contrastive Decoding

The paper introduces a novel approach called contrastive decoding (CD), designed to ameliorate common issues encountered in open-ended text generation with LLMs (LMs). Traditional maximum likelihood methods tend to result in redundant and brief text, while straightforward sampling methods often lead to incoherence and divergence from initial context. CD innovatively leverages both a large LM ("expert") and a small LM ("amateur"), focusing on the practical use of discrepancies in their performance to guide text generation toward coherence without abandoning lexical diversity.

Understanding the Approach

CD functions on the principle that smaller LMs exhibit prominent issues such as repetition and incoherence more frequently than their larger counterparts. By calculating the difference in log probabilities between a large and small LM for given text, and subjectively navigating this space under a constraint of plausibility, CD effectively sieves out undesirable textual patterns. Remarkably, this approach requires no additional training on top of the existing pre-trained models and easily adapts across different scales and architectures, such as the OPT and GPT-2 series.

Empirical Validation

The method surpasses several strong baselines including nucleus, top-k, and typical sampling algorithms in various domains like Wikipedia, news, and storytelling. Notably, automatic evaluations reveal that CD achieves higher coherence scores, maintaining comparable fluency levels to other methods, with a preference for CD noted in human evaluations as well. Importantly, the divergence between CD and sampling methods narrows with increasing model size, hinting at gradual but significant improvements as models scale.

Advantages and Extensions

CD's reliance on contrasting probabilities from different model capacities promotes an intriguing notion that such discrepancies can be harnessed without necessitating complex re-training or fine-tuning procedures. This stands as an advantage for efficient deployment in practical applications. Moreover, the paper suggests several interesting avenues for further exploration, such as contrasting early and later checkpoints of the same LM or extending the contrasting approach to task-oriented language generation.

In conclusion, contrastive decoding, through its innovative use of existing LMs of varying capacities, provides an effective means to improve the quality of open-ended text generation. Its ability to generate content that aligns closer with a given topic while preserving natural language flow represents a significant stride forward in generative AI.

PDF Markdown Bookmark Chat (Pro)

References (29)

Authors (8)

Xiang Lisa Li (18 papers)
Ari Holtzman (39 papers)
Daniel Fried (69 papers)
Percy Liang (239 papers)
Jason Eisner (56 papers)
Tatsunori Hashimoto (80 papers)
Luke Zettlemoyer (225 papers)
Mike Lewis (78 papers)

Citations (261)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/YKirstain/status/1798284827214700748

https://twitter.com/YouJiacheng/status/1845133607863714133

https://twitter.com/alborz_esf/status/1746309492437123510

https://twitter.com/waxhn/status/1858393253680501205

https://twitter.com/reversemagnus/status/1885699483482177668