Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MCC-KD: Multi-CoT Consistent Knowledge Distillation (2310.14747v3)

Published 23 Oct 2023 in cs.CL

Abstract: LLMs have showcased remarkable capabilities in complex reasoning through chain of thought (CoT) prompting. Recently, there has been a growing interest in transferring these reasoning abilities from LLMs to smaller models. However, achieving both the diversity and consistency in rationales presents a challenge. In this paper, we focus on enhancing these two aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets.

PDF HTML Abstract

Analyses and Insights on "MCC-KD: Multi-CoT Consistent Knowledge Distillation"

In the paper titled "MCC-KD: Multi-CoT Consistent Knowledge Distillation," Chen et al. propose a novel approach to transfer complex reasoning capabilities from LLMs to smaller models using Multi-CoT Consistent Knowledge Distillation (MCC-KD). This approach primarily addresses two key challenges associated with knowledge distillation for reasoning tasks: diversity and consistency of generated rationales.

Background

Chain of Thought (CoT) prompting has been recognized as an effective method to enhance the reasoning abilities of LLMs. This technique decomposes complex tasks into a series of intermediate steps or rationales. Nonetheless, prior studies have shown that such advanced reasoning capabilities only emerge in models with over 100 billion parameters. This necessitates significant computational resources, which limit the deployment of these models on resource-constrained platforms.

MCC-KD Approach

Rationale Extraction and Filtering

MCC-KD leverages CoT prompting to generate multiple rationales for each question using a teacher LLM like GPT-3.5-Turbo. To ensure diversity, it employs a filtering method using Jaccard similarity on N-gram segments, which enhances diversity by retaining only highly diversified rationales. This filtering process mitigates the issue of similar or repetitive rationales typically generated even at higher sampling temperatures.

Incorporating Consistency

To ensure that the distilled reasoning capabilities are stable and generalizable, MCC-KD enforces consistency among the predictions of the student model. This is done by minimizing the bidirectional KL-divergence between the answer distributions generated from diverse rationales for the same question. The training objective combines the traditional cross-entropy loss with this consistency-enforcing KL-divergence term, weighted by a hyperparameter α.

Numerical Results and Implications

Performance Evaluation

Empirical results demonstrate that MCC-KD outperforms existing CoT-based knowledge distillation methods across different model architectures (LLaMA and FlanT5) and scales (3B, 7B, 11B, and 13B parameters) on both mathematical and commonsense reasoning benchmarks. For instance, with LLaMA-7B, MCC-KD achieves an accuracy improvement from 38.01 to 41.58 on the GSM8K dataset. In out-of-distribution (OOD) generalization tasks, which are crucial for robustness, MCC-KD also demonstrates substantial improvements. For example, the accuracy on the ASDiv OOD dataset improves from 47.69 to 49.52 when utilizing FlanT5-XXL.

Ablation Studies

Ablation studies show that removing the multi-CoT consistency constraint significantly degrades performance, indicating its critical role in enhancing model stability. Similarly, omitting the rationale filtering step results in noticeable performance drops, affirming the importance of diversity in the rationales. The paper also experimentally substantiates the effectiveness of leveraging multiple rationales (with K set to 5), balancing computational efficiency and model performance.

Implications and Future Directions

The findings of this paper imply that modeling consistency among diverse reasoning paths is pivotal for effective knowledge distillation in reasoning tasks. This insight could catalyze further research into consistency regularization techniques in other domains of AI and machine learning.

Theoretical Implications

The theoretically grounded approach (using KL-divergence for consistency) enhances our understanding of how reasoning paths contribute to model generalizability and stability.

Practical Implications

Practically, MCC-KD facilitates the deployment of resource-efficient models capable of performing complex reasoning tasks, widening the accessibility to such advanced AI capabilities in real-world applications.

Future Research

Future developments could explore the application of MCC-KD to a wider array of tasks beyond mathematical and commonsense reasoning. Comparative studies with different LLM architectures as teachers, alongside evaluations on additional OOD datasets, could provide deeper insights. Given the strong empirical results, extending this work to consider hybrid models including pre-trained and specialized modules might reveal new possibilities for enhancing inference capabilities.

In summation, the proposed MCC-KD framework underscores the significance of maintaining rationale diversity and consistency for distilling reasoning abilities into smaller models, marking a valuable contribution to the field of AI and knowledge distillation. The methodology and findings pave the way for the development of more efficient yet competent AI systems, crucial for broadening the practical applicability of state-of-the-art AI technologies.

PDF Markdown Bookmark Chat (Pro)

References (39)

Authors (6)

Hongzhan Chen (6 papers)
Siyue Wu (4 papers)
Xiaojun Quan (52 papers)
Rui Wang (996 papers)
Ming Yan (190 papers)
Ji Zhang (176 papers)

Citations (10)

View on Semantic Scholar