Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs (2405.16064v1)

Published 25 May 2024 in cs.CL

Abstract: Chain-of-thought distillation is a powerful technique for transferring reasoning abilities from LLMs to smaller student models. Previous methods typically require the student to mimic the step-by-step rationale produced by LLMs, often facing the following challenges: (i) Tokens within a rationale vary in significance, and treating them equally may fail to accurately mimic keypoint tokens, leading to reasoning errors. (ii) They usually distill knowledge by consistently predicting all the steps in a rationale, which falls short in distinguishing the learning order of step generation. This diverges from the human cognitive progression of starting with easy tasks and advancing to harder ones, resulting in sub-optimal outcomes. To this end, we propose a unified framework, called KPOD, to address these issues. Specifically, we propose a token weighting module utilizing mask learning to encourage accurate mimicry of keypoint tokens by the student during distillation. Besides, we develop an in-rationale progressive distillation strategy, starting with training the student to generate the final reasoning steps and gradually extending to cover the entire rationale. To accomplish this, a weighted token generation loss is proposed to assess step reasoning difficulty, and a value function is devised to schedule the progressive distillation by considering both step difficulty and question diversity. Extensive experiments on four reasoning benchmarks illustrate our KPOD outperforms previous methods by a large margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Curriculum learning. In International Conference on Machine Learning, pp.  41–48, 2009.
  2. Young children’s mapping between arrays, number words, and digits. Cognition, 129(1):95–101, 2013.
  3. Constrained k-means clustering. Microsoft Research, Redmond, 20, 2000.
  4. Bridle, J. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. Annual Conference on Neural Information Processing Systems, 2, 1989.
  5. Language models are few-shot learners. Annual Conference on Neural Information Processing Systems, 33:1877–1901, 2020.
  6. MCC-KD: Multi-CoT consistent knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  6805–6820. Association for Computational Linguistics, December 2023.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.
  11. Elman, J. L. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.
  12. Freekd: Free-direction knowledge distillation for graph neural networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  357–366, 2022.
  13. On the road to portability: Compressing end-to-end motion planner for autonomous driving. arXiv preprint arXiv:2403.01238, 2024.
  14. Specializing smaller language models towards multi-step reasoning. International Conference on Machine Learning, 2023.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  16. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14852–14882. Association for Computational Linguistics, July 2023.
  17. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  18. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  8003–8017. Association for Computational Linguistics, July 2023.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  21. In-sample curriculum learning by sequence completion for natural language generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11937–11950. Association for Computational Linguistics, July 2023.
  22. Self-paced learning with diversity. Annual Conference on Neural Information Processing Systems, 27, 2014.
  23. Large language models are zero-shot reasoners. Annual Conference on Neural Information Processing Systems, 35:22199–22213, 2022.
  24. Adaptive curriculum learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5067–5076, 2021.
  25. Flexible shaping: How learning in small steps helps. Cognition, 110(3):380–394, 2009.
  26. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2665–2679. Association for Computational Linguistics, July 2023.
  27. Submodular maximization in clean linear time. Annual Conference on Neural Information Processing Systems, 35:17473–17487, 2022.
  28. Token-wise curriculum learning for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  3658–3670, 2021.
  29. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
  30. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  1773–1781. Association for Computational Linguistics, July 2023.
  31. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
  32. Modulation of the palmar grasp behavior in neonates according to texture property. Infant Behavior and Development, 21(4):659–666, 1998.
  33. Narayan, S. The generalized sigmoid activation function: Competitive supervised learning. Information Sciences, 99(1-2):69–82, 1997.
  34. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  35. Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, pp.  1532–1543, 2014.
  36. Peterson, G. B. A day of great illumination: Bf skinner’s discovery of shaping. Journal of the experimental analysis of behavior, 82(3):317–328, 2004.
  37. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, 2019.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. Self-paced learning for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  1074–1080. Association for Computational Linguistics, November 2020.
  40. Cue-cot: Chain-of-thought prompting for responding to in-depth dialogue questions with llms. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  12047–12064, 2023a.
  41. SCOTT: Self-consistent chain-of-thought distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5546–5558. Association for Computational Linguistics, July 2023b.
  42. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  43. Curgraph: Curriculum learning for graph classification. In Proceedings of the Web Conference 2021, pp.  1238–1248, 2021.
  44. Chain-of-thought prompting elicits reasoning in large language models. Annual Conference on Neural Information Processing Systems, 35:24824–24837, 2022.
  45. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  46. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420, 2023.
  47. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  48. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kaituo Feng (14 papers)
  2. Changsheng Li (37 papers)
  3. Xiaolu Zhang (39 papers)
  4. Jun Zhou (370 papers)
  5. Ye Yuan (274 papers)
  6. Guoren Wang (79 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets