Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MCC-KD: Multi-CoT Consistent Knowledge Distillation (2310.14747v3)

Published 23 Oct 2023 in cs.CL
MCC-KD: Multi-CoT Consistent Knowledge Distillation

Abstract: LLMs have showcased remarkable capabilities in complex reasoning through chain of thought (CoT) prompting. Recently, there has been a growing interest in transferring these reasoning abilities from LLMs to smaller models. However, achieving both the diversity and consistency in rationales presents a challenge. In this paper, we focus on enhancing these two aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple rationales for each question and enforce consistency among the corresponding predictions by minimizing the bidirectional KL-divergence between the answer distributions. We investigate the effectiveness of MCC-KD with different model architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both mathematical reasoning and commonsense reasoning benchmarks. The empirical results not only confirm MCC-KD's superior performance on in-distribution datasets but also highlight its robust generalization ability on out-of-distribution datasets.

Analyses and Insights on "MCC-KD: Multi-CoT Consistent Knowledge Distillation"

In the paper titled "MCC-KD: Multi-CoT Consistent Knowledge Distillation," Chen et al. propose a novel approach to transfer complex reasoning capabilities from LLMs to smaller models using Multi-CoT Consistent Knowledge Distillation (MCC-KD). This approach primarily addresses two key challenges associated with knowledge distillation for reasoning tasks: diversity and consistency of generated rationales.

Background

Chain of Thought (CoT) prompting has been recognized as an effective method to enhance the reasoning abilities of LLMs. This technique decomposes complex tasks into a series of intermediate steps or rationales. Nonetheless, prior studies have shown that such advanced reasoning capabilities only emerge in models with over 100 billion parameters. This necessitates significant computational resources, which limit the deployment of these models on resource-constrained platforms.

MCC-KD Approach

Rationale Extraction and Filtering

MCC-KD leverages CoT prompting to generate multiple rationales for each question using a teacher LLM like GPT-3.5-Turbo. To ensure diversity, it employs a filtering method using Jaccard similarity on N-gram segments, which enhances diversity by retaining only highly diversified rationales. This filtering process mitigates the issue of similar or repetitive rationales typically generated even at higher sampling temperatures.

Incorporating Consistency

To ensure that the distilled reasoning capabilities are stable and generalizable, MCC-KD enforces consistency among the predictions of the student model. This is done by minimizing the bidirectional KL-divergence between the answer distributions generated from diverse rationales for the same question. The training objective combines the traditional cross-entropy loss with this consistency-enforcing KL-divergence term, weighted by a hyperparameter α.

Numerical Results and Implications

Performance Evaluation

Empirical results demonstrate that MCC-KD outperforms existing CoT-based knowledge distillation methods across different model architectures (LLaMA and FlanT5) and scales (3B, 7B, 11B, and 13B parameters) on both mathematical and commonsense reasoning benchmarks. For instance, with LLaMA-7B, MCC-KD achieves an accuracy improvement from 38.01 to 41.58 on the GSM8K dataset. In out-of-distribution (OOD) generalization tasks, which are crucial for robustness, MCC-KD also demonstrates substantial improvements. For example, the accuracy on the ASDiv OOD dataset improves from 47.69 to 49.52 when utilizing FlanT5-XXL.

Ablation Studies

Ablation studies show that removing the multi-CoT consistency constraint significantly degrades performance, indicating its critical role in enhancing model stability. Similarly, omitting the rationale filtering step results in noticeable performance drops, affirming the importance of diversity in the rationales. The paper also experimentally substantiates the effectiveness of leveraging multiple rationales (with K set to 5), balancing computational efficiency and model performance.

Implications and Future Directions

The findings of this paper imply that modeling consistency among diverse reasoning paths is pivotal for effective knowledge distillation in reasoning tasks. This insight could catalyze further research into consistency regularization techniques in other domains of AI and machine learning.

Theoretical Implications

The theoretically grounded approach (using KL-divergence for consistency) enhances our understanding of how reasoning paths contribute to model generalizability and stability.

Practical Implications

Practically, MCC-KD facilitates the deployment of resource-efficient models capable of performing complex reasoning tasks, widening the accessibility to such advanced AI capabilities in real-world applications.

Future Research

Future developments could explore the application of MCC-KD to a wider array of tasks beyond mathematical and commonsense reasoning. Comparative studies with different LLM architectures as teachers, alongside evaluations on additional OOD datasets, could provide deeper insights. Given the strong empirical results, extending this work to consider hybrid models including pre-trained and specialized modules might reveal new possibilities for enhancing inference capabilities.

In summation, the proposed MCC-KD framework underscores the significance of maintaining rationale diversity and consistency for distilling reasoning abilities into smaller models, marking a valuable contribution to the field of AI and knowledge distillation. The methodology and findings pave the way for the development of more efficient yet competent AI systems, crucial for broadening the practical applicability of state-of-the-art AI technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  2. Can rationalization improve robustness? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3792–3805, Seattle, United States. Association for Computational Linguistics.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  7. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  8. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  9. Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. In Proceedings of the First Workshop on Learning with Natural Language Supervision, pages 29–39, Dublin, Ireland. Association for Computational Linguistics.
  10. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  11. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  12. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  13. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, Doha, Qatar. Association for Computational Linguistics.
  14. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  15. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  16. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870.
  17. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  18. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  19. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  20. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
  21. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
  22. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  23. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.
  24. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  25. Distilling linguistic context for language model compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 364–378, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  27. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  28. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  29. Distilling multi-step reasoning capabilities of large language models into smaller models via semantic decompositions. arXiv preprint arXiv:2212.00193.
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  31. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
  32. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
  35. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  36. Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562.
  37. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  38. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  39. Refining language models with compositional explanations. Advances in Neural Information Processing Systems, 34:8954–8967.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hongzhan Chen (6 papers)
  2. Siyue Wu (4 papers)
  3. Xiaojun Quan (52 papers)
  4. Rui Wang (996 papers)
  5. Ming Yan (190 papers)
  6. Ji Zhang (176 papers)
Citations (10)