Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (2305.02301v2)

Published 3 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Deploying LLMs is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .

PDF HTML Abstract

Distilling Step-by-Step: Enhancing NLP Models with Reduced Data and Model Sizes

The paper "Distilling Step-by-Step! Outperforming Larger LLMs with Less Training Data and Smaller Model Sizes" addresses a crucial challenge in the field of NLP: the deployment inefficiencies of LLMs. High memory usage and computational requirements make LLMs impractical for many real-world applications. Smaller, task-specific models present a viable alternative but traditionally require extensive data for finetuning or distillation to reach comparable performance.

Key Contributions

The authors introduce a novel mechanism termed "Distilling step-by-step," which significantly mitigates the data requirements and model sizes typically necessary for fine-tuning or distillation. This method leverages the ability of LLMs to generate rationales — explanations that accompany predictions — as a form of enhanced supervision within a multi-task training framework. By integrating these rationales, smaller models can be trained to outperform LLMs using only a fraction of the data and parameters.

Experimental Outcomes

The paper evaluates this approach across four NLP benchmarks: e-SNLI, ANLI, CQA, and SVAMP, with significant findings that enhance understanding in several domains:

Data Efficiency: The method reduced the required training examples by over 50% on average. For instance, in the e-SNLI dataset, the proposed method achieved performance that surpassed standard finetuning using only 12.5% of the data.
Model Efficiency: Distilling step-by-step enabled models significantly smaller than LLMs, such as the 770M T5 model, to exceed the performance of a 540B parameter PaLM model with substantially less data.
Comparison with Traditional Methods: Compared to both finetuning and traditional distillation approaches, the new strategy showed consistent improvement across all datasets and reduced overhead both in terms of data and computational cost.

Implications and Future Directions

From a practical standpoint, the implications of this work are substantial. By reducing the dependency on large-scale datasets and massive computational infrastructure, this approach democratizes access to advanced NLP capabilities. Organizations with limited resources can deploy high-performance models without investing excessively in hardware or acquiring vast amounts of annotated data.

Theoretically, this work propels forward our understanding of knowledge distillation and highlights the utility of LLM-generated rationales as a critical training component. Future research could explore the integration of these techniques across other domains and further refine the quality of extracted rationales.

Additionally, there's potential to extend these methods to other complex NLP tasks and adopt smaller models in resource-constrained environments. The framework's adaptability to different LLMs also opens avenues for testing with various model architectures to further validate its robustness.

Conclusion

Distilling step-by-step offers a compelling strategy for advancing NLP model efficiency, providing a pragmatic path forward in addressing the computational challenges inherent in current LLM architectures. Its innovative approach marks a step toward more sustainable, scalable, and accessible AI applications.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (9)

Cheng-Yu Hsieh (23 papers)
Chun-Liang Li (60 papers)
Chih-Kuan Yeh (23 papers)
Hootan Nakhost (10 papers)
Yasuhisa Fujii (18 papers)
Alexander Ratner (24 papers)
Ranjay Krishna (116 papers)
Chen-Yu Lee (48 papers)
Tomas Pfister (89 papers)

Citations (413)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - google-research/distilling-step-by-step (398 stars)

Tweets

https://twitter.com/rachittshah/status/1881683442787909884

YouTube

Show All Videos