Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation (2402.14874v2)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We propose a straightforward approach called Distillation Contrastive Decoding (DCD) to enhance the reasoning capabilities of LLMs during inference. In contrast to previous approaches that relied on smaller amateur models or analysis of hidden state differences, DCD employs Contrastive Chain-of-thought Prompting and advanced distillation techniques, including Dropout and Quantization. This approach effectively addresses the limitations of Contrastive Decoding (CD), which typically requires both an expert and an amateur model, thus increasing computational resource demands. By integrating contrastive prompts with distillation, DCD obviates the need for an amateur model and reduces memory usage. Our evaluations demonstrate that DCD significantly enhances LLM performance across a range of reasoning benchmarks, surpassing both CD and existing methods in the GSM8K and StrategyQA datasets.

The paper, "Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation," presents a novel method termed Distillation Contrastive Decoding (DCD) aimed at enhancing reasoning abilities in LLMs. The proposed approach addresses the limitations associated with traditional Contrastive Decoding (CD), which typically relies on both an expert and a smaller, amateur model for effective inference. This reliance not only increases computational demands but also presents challenges when relatively smaller models of the same architecture are unavailable.

Key Contributions:

  1. Distillation Contrastive Decoding (DCD): Unlike traditional CD methods that require the dual loading of models, DCD leverages Contrastive Chain-of-thought (CoT) prompts combined with distillation techniques, such as dropout and quantization, to obtain amateur reasoning information. This integration eliminates the need for a distinct amateur LLM while maintaining or enhancing performance, significantly reducing memory usage during inference.
  2. Improved Reasoning Benchmarks: The method demonstrates superior performance across various reasoning benchmarks, significantly outperforming both CD and Chain-of-thought Prompting (CP) methods. Specifically, on arithmetic reasoning tasks (GSM8K), DCD boosts the performance of Llama2 models by as much as 3.79% and exceeds CD by 1.89%. In commonsense reasoning tasks (StrategyQA), DCD surpasses traditional methods and enhances Llama2 models' performance by up to 5.9%.
  3. Methodology and Abstraction: DCD does not rely on the availability of specific amateur models, making it highly adaptable across different model architectures, including Llama2, Mistral-7B, and DeepSeek-7B. The method capitalizes on distillation techniques to simulate smaller models internally, which not only achieves effective reasoning but also efficiently utilizes computational resources.

Technical Insights:

  • Contrastive Decoding Limitations: Traditional CD's dependency on an amateur model presents both logistical and computational challenges, especially when smaller model variants are unavailable or impractical to deploy in larger architectures.
  • Contrastive CoT Prompting: DCD leverages various forms of contrastive CoT design, including both correct and incorrect reasoning exemplars, to enhance logical task performance by minimizing inference errors.
  • Distillation Techniques: Dropout rates, applied judiciously during inference, play a critical role in achieving optimal model performance. Experimentation has shown that a moderate dropout rate, between 0.2 and 0.4, generally yields the best results for both arithmetic and commonsense tasks.
  • Performance Correlation: The paper notes a correlation between high scores on tasks such as MMLU and the enhancement provided by DCD, indicating that models with a strong foundational knowledge base particularly benefit from the DCD approach.

In conclusion, this work provides a significant step in advancing the effectiveness and efficiency of LLM reasoning capabilities, offering a robust solution to the constraints of traditional Contrastive Decoding methods. By eliminating the need for external amateur models and reducing resource demands, DCD emerges as a viable strategy for improving logic and reasoning tasks in LLMs. Further research may explore the application of DCD to even more complex reasoning scenarios and larger, more sophisticated model architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609.
  2. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. Https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  3. Language models are few-shot learners.
  4. Language models are few-shot learners. CoRR, abs/2005.14165.
  5. Contrastive chain-of-thought prompting.
  6. Dola: Decoding by contrasting layers improves factuality in large language models.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Deepseek llm: Scaling open-source language models with longtermism.
  9. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
  10. Measuring massive multitask language understanding. CoRR, abs/2009.03300.
  11. Mistral 7b.
  12. Large language models are zero-shot reasoners.
  13. Inference-time intervention: Eliciting truthful answers from a language model.
  14. Contrastive decoding: Open-ended text generation as optimization.
  15. Deductive verification of chain-of-thought reasoning.
  16. Sean O’Brien and Mike Lewis. 2023. Contrastive decoding improves reasoning in large language models.
  17. Measuring and narrowing the compositionality gap in language models.
  18. Llama: Open and efficient foundation language models.
  19. Llama 2: Open foundation and fine-tuned chat models.
  20. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics.
  21. Chain-of-thought prompting elicits reasoning in large language models.
  22. Representation engineering: A top-down approach to ai transparency.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Phuc Phan (2 papers)
  2. Hieu Tran (27 papers)
  3. Long Phan (21 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com