Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (2305.02301v2)

Published 3 May 2023 in cs.CL, cs.AI, and cs.LG
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Abstract: Deploying LLMs is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .

Distilling Step-by-Step: Enhancing NLP Models with Reduced Data and Model Sizes

The paper "Distilling Step-by-Step! Outperforming Larger LLMs with Less Training Data and Smaller Model Sizes" addresses a crucial challenge in the field of NLP: the deployment inefficiencies of LLMs. High memory usage and computational requirements make LLMs impractical for many real-world applications. Smaller, task-specific models present a viable alternative but traditionally require extensive data for finetuning or distillation to reach comparable performance.

Key Contributions

The authors introduce a novel mechanism termed "Distilling step-by-step," which significantly mitigates the data requirements and model sizes typically necessary for fine-tuning or distillation. This method leverages the ability of LLMs to generate rationales — explanations that accompany predictions — as a form of enhanced supervision within a multi-task training framework. By integrating these rationales, smaller models can be trained to outperform LLMs using only a fraction of the data and parameters.

Experimental Outcomes

The paper evaluates this approach across four NLP benchmarks: e-SNLI, ANLI, CQA, and SVAMP, with significant findings that enhance understanding in several domains:

  • Data Efficiency: The method reduced the required training examples by over 50% on average. For instance, in the e-SNLI dataset, the proposed method achieved performance that surpassed standard finetuning using only 12.5% of the data.
  • Model Efficiency: Distilling step-by-step enabled models significantly smaller than LLMs, such as the 770M T5 model, to exceed the performance of a 540B parameter PaLM model with substantially less data.
  • Comparison with Traditional Methods: Compared to both finetuning and traditional distillation approaches, the new strategy showed consistent improvement across all datasets and reduced overhead both in terms of data and computational cost.

Implications and Future Directions

From a practical standpoint, the implications of this work are substantial. By reducing the dependency on large-scale datasets and massive computational infrastructure, this approach democratizes access to advanced NLP capabilities. Organizations with limited resources can deploy high-performance models without investing excessively in hardware or acquiring vast amounts of annotated data.

Theoretically, this work propels forward our understanding of knowledge distillation and highlights the utility of LLM-generated rationales as a critical training component. Future research could explore the integration of these techniques across other domains and further refine the quality of extracted rationales.

Additionally, there's potential to extend these methods to other complex NLP tasks and adopt smaller models in resource-constrained environments. The framework's adaptability to different LLMs also opens avenues for testing with various model architectures to further validate its robustness.

Conclusion

Distilling step-by-step offers a compelling strategy for advancing NLP model efficiency, providing a pragmatic path forward in addressing the computational challenges inherent in current LLM architectures. Its innovative approach marks a step toward more sustainable, scalable, and accessible AI applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264.
  2. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
  3. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934.
  4. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541.
  7. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  8. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  11. Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498.
  12. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  13. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.
  14. Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
  16. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  17. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  18. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  19. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  20. Weighted distillation with unlabeled examples. In Advances in Neural Information Processing Systems.
  21. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  22. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  23. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050.
  24. Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593.
  25. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  26. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
  27. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1–9.
  28. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.
  29. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  30. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  31. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  32. Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359–375.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  34. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
  35. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717.
  36. Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318.
  37. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
  38. Suraj Srinivas and François Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723–4731. PMLR.
  39. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  40. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
  41. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  42. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  43. Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562.
  44. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487.
  45. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  46. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  47. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178.
  48. Measuring association between labels and free-text rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  49. Using “annotator rationales” to improve machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 260–267, Rochester, New York. Association for Computational Linguistics.
  50. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.
  51. Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698–714. Springer.
  52. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  53. Rationale-augmented convolutional neural networks for text classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795–804, Austin, Texas. Association for Computational Linguistics.
  54. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Cheng-Yu Hsieh (23 papers)
  2. Chun-Liang Li (60 papers)
  3. Chih-Kuan Yeh (23 papers)
  4. Hootan Nakhost (10 papers)
  5. Yasuhisa Fujii (18 papers)
  6. Alexander Ratner (24 papers)
  7. Ranjay Krishna (116 papers)
  8. Chen-Yu Lee (48 papers)
  9. Tomas Pfister (89 papers)
Citations (413)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com