Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models Are Reasoning Teachers (2212.10071v2)

Published 20 Dec 2022 in cs.CL, cs.AI, and cs.LG
Large Language Models Are Reasoning Teachers

Abstract: Recent works have shown that chain-of-thought (CoT) prompting can elicit LLMs to solve complex reasoning tasks, step-by-step. However, prompt-based CoT methods are dependent on very large models such as GPT-3 175B which are prohibitive to deploy at scale. In this paper, we use these large models as reasoning teachers to enable complex reasoning in smaller models and reduce model size requirements by several orders of magnitude. We propose Fine-tune-CoT, a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We evaluate our method on a wide range of public models and complex tasks. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks. Additionally, we extend our method by leveraging the teacher model's ability to generate multiple distinct rationales for each original sample. Enriching the fine-tuning data with such diverse reasoning results in a substantial performance boost across datasets, even for very small models. We conduct ablations and sample studies to understand the emergence of reasoning capabilities of student models. Our code implementation and data are available at https://github.com/itsnamgyu/reasoning-teacher.

Insights from "LLMs Are Reasoning Teachers"

The paper "LLMs Are Reasoning Teachers" proposes an approach to distill complex reasoning capabilities from very LLMs to significantly smaller models using a method termed Fine-tune-CoT. This method leverages the reasoning abilities of large teacher models, such as GPT-3 175B, to enhance the reasoning capabilities of small student models through fine-tuning on generated reasoning samples. This not only addresses the computational and economic infeasibility of deploying large models at scale but also significantly reduces the required model size while maintaining or even improving performance on complex reasoning tasks.

Methodology and Key Findings

The core of this approach lies in the Chain-of-Thought (CoT) reasoning capability, where LLMs generate step-by-step rationales to arrive at solutions for complex tasks. The Fine-tune-CoT method utilizes these CoT capabilities by having large models generate reasoning examples and then using these examples to fine-tune smaller models (students). The authors provided extensive experimentation, demonstrating that student models fine-tuned with this approach outperform prompt-based baseline methods significantly across a wide variety of reasoning tasks. Intriguingly, in many cases, the student model not only surpassed the performance of prompt-based methods but also sometimes exceeded the correctness of the teacher model itself, especially when leveraging diverse reasoning paths.

The method was tested on 12 datasets, categorized under arithmetic, symbolic, commonsense, and other reasoning types, showcasing substantial improvements in performance of student models across the board, achieving state-like performances in certain reasoning categories.

Practical Implications and Speculation on Future Developments

The implications of this work in both practical and theoretical realms are manifold. On the practical side, the ability to scale down yet effectively deploy complex reasoning capabilities democratizes access to logical reasoning in machine learning, making it feasible on more modest computational resources, thus broadening the usability of AI in resource-constrained environments. For industrial deployments in real-world applications, this approach provides a cost-effective means to leverage advanced reasoning without the exponential costs associated with LLMs like GPT-3.

Theoretically, this work nudges the research community towards a deeper understanding of reasoning emergence in neural networks and lays groundwork for future methodologies in model distillation and efficiency boundaries. This approach also posits an intriguing question about how reasoning capabilities can be generalized and transferred across dissimilar model architectures, providing ample room for exploration in various neural architectures.

Discussion on Fine-tuning Performance and Scalability

The paper highlights the scalability of Fine-tune-CoT along several dimensions — additional data, student model size, and diverse reasoning examples. This scalability is crucial for performance improvement, showing that Fine-tune-CoT can adapt to improve reasoning performance further with the augmentation of these factors. The discussion section thoroughly addresses the trade-offs and choices alluding to promising avenues for incorporating better student teaching models and optimization of rationales extracted from teachers.

Conclusion and Future Directions

In conclusion, this paper successfully demonstrates a pioneering way to bridge the gap between large-scale machine reasoning and practical deployability. While not laying claim to dramatic theoretical breakthroughs, it provides a valuable method that aligns with the pursuit of efficiency without forfeiting reasoning proficiency. For future exploration, it opens avenues for enhancing diverse reasoning, leveraging more sophisticated CoT methods, and exploring connections with knowledge distillation. The community may witness proposals that expand upon these findings to further amplify the reasoning competencies of small models, setting the stage for advancements in how AI systems infer, learn, and generalize knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Language models show human-like content effects on reasoning. arXiv preprint, arXiv:2207.07051.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  9. Jonathan St BT Evans. 2010. Intuition and reasoning: A dual-process perspective. Psychological Inquiry, 21(4):313–326.
  10. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  12. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  13. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819.
  14. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  16. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  17. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533. Citeseer.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  19. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  20. Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. arXiv preprint, arXiv: 1606.07947.
  21. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  22. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  23. Solving quantitative reasoning problems with language models. arXiv preprint, arXiv: 2206.14858.
  24. Explanations from large language models make small reasoners better. arXiv preprint, arXiv: 2210.06726.
  25. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
  26. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  27. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
  28. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  29. Generating training data with language models: Towards zero-shot language understanding. arXiv preprint, arXiv: 2202.04538.
  30. A statistical perspective on distillation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642. PMLR.
  31. Paul Micaelli and Amos Storkey. 2019. Zero-Shot Knowledge Transfer via Adversarial Belief Matching, chapter -. Curran Associates Inc., Red Hook, NY, USA.
  32. When Does Label Smoothing Help? Curran Associates Inc., Red Hook, NY, USA.
  33. Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pages 4743–4751. PMLR.
  34. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  35. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  36. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  37. Improving language understanding by generative pre-training. -.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  39. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  41. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint, arXiv:2202.07206.
  42. Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
  43. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  44. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
  45. Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  46. Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
  47. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
  48. Progressive network grafting for few-shot knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2541–2549.
  49. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  50. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
  51. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  53. Attention is all you need. arXiv preprint arXiv:1706.03762.
  54. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  55. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  56. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  57. Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826.
  58. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.
  59. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Namgyu Ho (10 papers)
  2. Laura Schmid (5 papers)
  3. Se-Young Yun (114 papers)
Citations (257)
Github Logo Streamline Icon: https://streamlinehq.com