Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contrastive Decoding: Open-ended Text Generation as Optimization (2210.15097v2)

Published 27 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Given a LLM (LM), maximum probability is a poor decoding objective for open-ended generation, because it produces short and repetitive text. On the other hand, sampling can often produce incoherent text that drifts from the original topics. We propose contrastive decoding (CD), a reliable decoding approach that optimizes a contrastive objective subject to a plausibility constraint. The contrastive objective returns the difference between the likelihood under a large LM (called the expert, e.g. OPT-13B) and a small LM (called the amateur, e.g. OPT-125M), and the constraint ensures that the outputs are plausible. CD is inspired by the fact that the failures of larger LMs (e.g., repetition, incoherence) are even more prevalent in smaller LMs, and that this difference signals which texts should be preferred. CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone. It also works across model scales (OPT-13B and GPT2-1.5B) and significantly outperforms four strong decoding algorithms (e.g., nucleus, top-k) in automatic and human evaluations across wikipedia, news and story domains.

Introduction to Contrastive Decoding

The paper introduces a novel approach called contrastive decoding (CD), designed to ameliorate common issues encountered in open-ended text generation with LLMs (LMs). Traditional maximum likelihood methods tend to result in redundant and brief text, while straightforward sampling methods often lead to incoherence and divergence from initial context. CD innovatively leverages both a large LM ("expert") and a small LM ("amateur"), focusing on the practical use of discrepancies in their performance to guide text generation toward coherence without abandoning lexical diversity.

Understanding the Approach

CD functions on the principle that smaller LMs exhibit prominent issues such as repetition and incoherence more frequently than their larger counterparts. By calculating the difference in log probabilities between a large and small LM for given text, and subjectively navigating this space under a constraint of plausibility, CD effectively sieves out undesirable textual patterns. Remarkably, this approach requires no additional training on top of the existing pre-trained models and easily adapts across different scales and architectures, such as the OPT and GPT-2 series.

Empirical Validation

The method surpasses several strong baselines including nucleus, top-k, and typical sampling algorithms in various domains like Wikipedia, news, and storytelling. Notably, automatic evaluations reveal that CD achieves higher coherence scores, maintaining comparable fluency levels to other methods, with a preference for CD noted in human evaluations as well. Importantly, the divergence between CD and sampling methods narrows with increasing model size, hinting at gradual but significant improvements as models scale.

Advantages and Extensions

CD's reliance on contrasting probabilities from different model capacities promotes an intriguing notion that such discrepancies can be harnessed without necessitating complex re-training or fine-tuning procedures. This stands as an advantage for efficient deployment in practical applications. Moreover, the paper suggests several interesting avenues for further exploration, such as contrasting early and later checkpoints of the same LM or extending the contrasting approach to task-oriented language generation.

In conclusion, contrastive decoding, through its innovative use of existing LMs of varying capacities, provides an effective means to improve the quality of open-ended text generation. Its ability to generate content that aligns closer with a given topic while preserving natural language flow represents a significant stride forward in generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Cont: Contrastive neural text generation. ArXiv, abs/2205.14690.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  3. Decoding methods for neural narrative generation. CoRR, abs/2010.07375.
  4. Bryan Eikema and Wilker Aziz. 2020. Is map decoding all you need? the inadequacy of the mode in neural machine translation. In COLING, pages 4506–4520.
  5. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
  6. SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP).
  7. H. Paul Grice. 1975. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors, Speech Acts, volume 3 of Syntax and Semantics.
  8. Pun generation with surprise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1734–1744, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. The curious case of neural text degeneration. In International Conference on Learning Representations.
  10. Laurence Horn. 1984. Toward a new taxonomy for pragmatic inference: Q-based and r-based implicature. Meaning, form, and use in context: Linguistic applications, 11:42.
  11. Professor forcing: A new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  12. Stephen C Levinson. 2000. Presumptive Meanings: The Theory of Generalized Conversational Implicature. MIT Press.
  13. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
  14. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
  15. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
  16. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  17. Typical decoding for natural language generation. CoRR, abs/2202.00666.
  18. Pointer sentinel mixture models. In International Conference on Learning Representations.
  19. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
  20. MAUVE: Measuring the gap between neural text and human text using divergence frontiers. In Advances in Neural Information Processing Systems.
  21. Language models are unsupervised multitask learners. https://openai.com/blog/better-language-models/.
  22. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  23. Yixuan Su and Nigel Collier. 2022. Contrastive search is what you need for neural text generation. arXiv preprint arXiv:2210.14140.
  24. A contrastive framework for neural text generation. Neurips, abs/2202.06417.
  25. Improving multi-step prediction of learned time series models. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1).
  26. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  27. Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306, Austin, Texas. Association for Computational Linguistics.
  28. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  29. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiang Lisa Li (18 papers)
  2. Ari Holtzman (39 papers)
  3. Daniel Fried (69 papers)
  4. Percy Liang (239 papers)
  5. Jason Eisner (56 papers)
  6. Tatsunori Hashimoto (80 papers)
  7. Luke Zettlemoyer (225 papers)
  8. Mike Lewis (78 papers)
Citations (261)