Contrastive Decoding Improves Reasoning in Large Language Models (2309.09117v2)

Published 17 Sep 2023 in cs.CL and cs.AI

Abstract: We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from LLMs.

PDF HTML Abstract

An Overview of Contrastive Decoding in LLMs

The paper "Contrastive Decoding Improves Reasoning in LLMs" presents an empirical investigation into using Contrastive Decoding (CD) as a method to enhance reasoning tasks in LLMs. Historically, text generation from LLMs has required distinct approaches depending on the nature of the task: truncated sampling for open-ended tasks and greedy decoding for reasoning-focused tasks. This paper highlights that these divisions might be unnecessary with the introduction of a unified generation method called Contrastive Decoding.

Methodology and Results

Contrastive Decoding, first introduced by Li et al. (2022), operates by balancing the probabilities from a stronger 'expert' model and a weaker 'amateur' model. Specifically, CD generates text that maximizes the difference in likelihoods between outputs proposed by both models, thus efficiently mitigating common pitfalls such as repetitive or generic outputs. Despite its training-free nature, the CD method has demonstrated significant enhancements in reasoning capabilities over traditional greedy decoding strategies.

This paper empirically validates the application of CD on a variety of reasoning benchmarks. Notably, LLaMA (with 65 billion parameters) using CD outperforms several premier models—such as LLaMA 2, GPT-3.5, and PaLM-2 L on the HellaSwag commonsense reasoning benchmark, and GSM8K mathematical word reasoning. These results augment prior findings, establishing CD's superiority over both nucleus sampling for long-form text generation and greedy decoding for reasoning tasks.

Experimental Design and Analysis

The researchers methodically examined the parameter sensitivities of CD, using the LLaMA model variants combined with controlled experimental setups to ensure robust evaluation. Across tasks like algebraic problem-solving and commonsense reasoning, CD consistently yielded improvements, though its efficacy was contingent upon certain task-specific factors. Importantly, CD was shown to diminish surface-level text copying from prompts while ensuring fewer reasoning step omissions during inference, leading to improved reasoning and accuracy.

To further understand the distinct benefits of CD, an analysis revealed CD models performed fewer semantic misunderstandings and omitted fewer reasoning steps compared to baseline models, even though arithmetic errors persisted. A broader application of CD on factual recall tasks, such as OpenBookQA and TriviaQA, suggested slight performance degradations, implying that these tasks may benefit less from the nuanced balance of probabilities that CD encourages.

Implications and Future Directions

The strategy not only unifies disparate generation methods into a versatile approach but also offers a computational cost-effective solution, adding minimal overhead compared to other reasoning-enhancement techniques such as self-consistency. This characteristic posits CD as particularly advantageous when intelligence augmentation for reasoning is desired without proportional scaling of computational resources.

Looking forward, the paper presents several avenues for further research and model refinement. The exploration of different model architectures, particularly those outside the LLaMA family, remains open for future work. Additional studies could investigate intersectional gains from combining CD with advancements in embedding augmentations or adversarial training approaches to further reinforce the robustness of LLMs under diverse operational conditions.

In conclusion, the paper underscores CD's potential to transcend traditional limitations of text generation methodologies, propelling reasoning capabilities of LLMs to new heights. It provides a compelling foundation upon which further explorations into adaptive decoding strategies can build, paving the way for more intelligent, contextually aware natural language generation systems.