Can a student Large Language Model perform as well as it's teacher? (2310.02421v1)
Abstract: The burgeoning complexity of contemporary deep learning models, while achieving unparalleled accuracy, has inadvertently introduced deployment challenges in resource-constrained environments. Knowledge distillation, a technique aiming to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model, emerges as a promising solution to this dilemma. This paper provides a comprehensive overview of the knowledge distillation paradigm, emphasizing its foundational principles such as the utility of soft labels and the significance of temperature scaling. Through meticulous examination, we elucidate the critical determinants of successful distillation, including the architecture of the student model, the caliber of the teacher, and the delicate balance of hyperparameters. While acknowledging its profound advantages, we also delve into the complexities and challenges inherent in the process. Our exploration underscores knowledge distillation's potential as a pivotal technique in optimizing the trade-off between model performance and deployment efficiency.
- Text classification for online conversations with machine learning on aws. AWS Machine Learning Blog, 2022.
- An analysis of the askmsr question-answering system. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 257–264, 2002.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
- Recovery of power flow to critical infrastructures using mode-dependent droop-based inverters. arXiv preprint arXiv:2102.00046, 2021.
- Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106, 2017.
- Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.
- Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802, 2017.
- Alexa, predict my flight delay. arXiv preprint arXiv:2208.09921, 2022a.
- Flight delay prediction using deep learning and conversational voice-based agents. American Academic Scientific Research Journal for Engineering, Technology, and Sciences, 89(1):60–72, 2022b.
- Zero-shot open-book question answering. arXiv preprint arXiv:2111.11520, 2021.
- You don’t need labeled data for open-book question answering. Applied Sciences, 12(1):111, 2022.
- Do generative large language models need billions of parameters? arXiv preprint arXiv:2309.06589, 2023.
- Create, train, and deploy a billion-parameter language model on terabytes of data with tensorflow and amazon sagemaker. AWS Machine Learning Blog, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
- Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643, 2015.
- The structure and performance of an open-domain question answering system. In Proceedings of the 38th annual meeting of the Association for Computational Linguistics, pages 563–570, 2000.
- Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.
- Challenges in energy-efficient deep neural network training with fpga. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 400–401, 2020.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Ellen M Voorhees et al. The trec-8 question answering track report. In Trec, volume 99, pages 77–82. Citeseer, 1999.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.