The paper, "Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation," presents a novel method termed Distillation Contrastive Decoding (DCD) aimed at enhancing reasoning abilities in LLMs. The proposed approach addresses the limitations associated with traditional Contrastive Decoding (CD), which typically relies on both an expert and a smaller, amateur model for effective inference. This reliance not only increases computational demands but also presents challenges when relatively smaller models of the same architecture are unavailable.
Key Contributions:
- Distillation Contrastive Decoding (DCD): Unlike traditional CD methods that require the dual loading of models, DCD leverages Contrastive Chain-of-thought (CoT) prompts combined with distillation techniques, such as dropout and quantization, to obtain amateur reasoning information. This integration eliminates the need for a distinct amateur LLM while maintaining or enhancing performance, significantly reducing memory usage during inference.
- Improved Reasoning Benchmarks: The method demonstrates superior performance across various reasoning benchmarks, significantly outperforming both CD and Chain-of-thought Prompting (CP) methods. Specifically, on arithmetic reasoning tasks (GSM8K), DCD boosts the performance of Llama2 models by as much as 3.79% and exceeds CD by 1.89%. In commonsense reasoning tasks (StrategyQA), DCD surpasses traditional methods and enhances Llama2 models' performance by up to 5.9%.
- Methodology and Abstraction: DCD does not rely on the availability of specific amateur models, making it highly adaptable across different model architectures, including Llama2, Mistral-7B, and DeepSeek-7B. The method capitalizes on distillation techniques to simulate smaller models internally, which not only achieves effective reasoning but also efficiently utilizes computational resources.
Technical Insights:
- Contrastive Decoding Limitations: Traditional CD's dependency on an amateur model presents both logistical and computational challenges, especially when smaller model variants are unavailable or impractical to deploy in larger architectures.
- Contrastive CoT Prompting: DCD leverages various forms of contrastive CoT design, including both correct and incorrect reasoning exemplars, to enhance logical task performance by minimizing inference errors.
- Distillation Techniques: Dropout rates, applied judiciously during inference, play a critical role in achieving optimal model performance. Experimentation has shown that a moderate dropout rate, between 0.2 and 0.4, generally yields the best results for both arithmetic and commonsense tasks.
- Performance Correlation: The paper notes a correlation between high scores on tasks such as MMLU and the enhancement provided by DCD, indicating that models with a strong foundational knowledge base particularly benefit from the DCD approach.
In conclusion, this work provides a significant step in advancing the effectiveness and efficiency of LLM reasoning capabilities, offering a robust solution to the constraints of traditional Contrastive Decoding methods. By eliminating the need for external amateur models and reducing resource demands, DCD emerges as a viable strategy for improving logic and reasoning tasks in LLMs. Further research may explore the application of DCD to even more complex reasoning scenarios and larger, more sophisticated model architectures.