Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data (2410.18588v1)

Published 24 Oct 2024 in cs.LG
Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data

Abstract: Leading open-source LLMs such as Llama-3.1-Instruct-405B are extremely capable at generating text, answering questions, and solving a variety of natural language understanding tasks. However, they incur higher inference cost and latency compared to smaller LLMs. Knowledge distillation provides a way to use outputs from these large, capable teacher models to train smaller student models which can be used for inference at lower cost and latency, while retaining comparable accuracy. We investigate the efficacy of distillation using the Llama-3.1-405B-Instruct teacher and the smaller Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct student models. Contributions of this work include (a) We evaluate the generalizability of distillation with the above Llama-3.1 teacher-student pairs across different tasks and datasets (b) We show that using synthetic data during distillation significantly improves the accuracy of 8B and 70B models, and when used with reasoning chains, even matches or surpasses the zero-shot accuracy of 405B model on some datasets (c) We empirically show that distillation enables 8B and 70B models to internalize 405B's reasoning ability by using only standard fine-tuning (without customizing any loss function). This allows cost and latency-efficient student model inference. (d) We show pitfalls in evaluation of distillation, and present task-specific evaluation, including both human and LLM-grading, and ground-truth based traditional accuracy benchmarks. This methodical study brings out the fundamental importance of synthetic data quality in knowledge distillation, and of combining multiple, task-specific ways of accuracy and quality evaluation in assessing the effectiveness of distillation.

Knowledge Distillation Using Frontier Open-Source LLMs: Generalizability and the Role of Synthetic Data

The presented paper investigates the concept of knowledge distillation in the context of LLMs with a specific focus on the Llama-3.1-Instruct series. With Llama-3.1-405B-Instruct as the teacher model, the authors examine the efficiency and effectiveness of distilling knowledge into smaller student models, namely, Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct. The paper outlines a methodological framework aimed at reducing computational costs associated with powerful, large-scale models while maintaining high performance levels on various natural language tasks.

The methodological contributions of this paper are numerous, beginning with an evaluation of the distillation process across different tasks and datasets. The researchers concentrated on ensuring the small student models retained the high reasoning and comprehension capabilities of the teacher model. They advocate for the use of synthetic data, which was demonstrated to significantly elevate the performance of students models, often reaching or surpassing the zero-shot efficacy of the much larger teacher on specific datasets.

Methodology

The authors describe a systematic two-step process of knowledge distillation: generating outputs using advanced, task-specific prompts from the teacher model, and subsequently fine-tuning the student models using these outputs. Their methodology leverages task-specific synthetic data, derived from prompts that enhance the training data quality. These advanced strategies notably involve chain-of-thought (CoT) and chain-of-density prompting aimed at ensuring that crucial nuances and reasoning steps are replicated in the distilled models.

Results

The paper provides experimental evidence that supports the viability and effectiveness of the proposed distillation strategy:

  • Summarization Tasks: Distillation with chain-of-density prompting resulted in models that outperformed the larger teacher LLM's vanilla-prompted outputs—achieving notably higher entity densities in summaries.
  • Conversational Tasks: On conversational datasets, the distilled 70B model surpassed the base teacher model's performance on certain evaluation metrics, showing strong alignment with desired response qualities.
  • Natural Language Understanding Tasks: In NLU tasks, particularly for natural language inference and question-answering, student models distilled using CoT outputs often outperformed vanilla prompted student models. However, in the case of more complex mathematical reasoning tasks, direct CoT prompting was shown to be crucial over distillation, pointing to the limitations of simplification via distillation in such scenarios.

Implications

The implications of these findings are multifaceted. From a practical perspective, the approach offers a substantial reduction in inference costs—which is critical for deployment at scale without sacrificing performance. Theoretically, the work underscores the potential of synthetic data to embody complex reasoning and knowledge transfer processes. Furthermore, the research highlights possible limitations of distillation regarding intricate problem-solving abilities, which may still require direct inference from capable models for optimal accuracy.

Future Directions

Future explorations could deepen the understanding of how different synthetic data generation strategies impact diverse LLM capabilities. Refining evaluation frameworks to better capture nuanced competency in conversational agents or further expanding the range of tasks might provide additional insights. The integration of more advanced or mixed distillation methods, with improved faithfulness and comprehension, remains a promising avenue for research.

Overall, this paper offers significant insights and a robust framework for leveraging knowledge distillation to optimize the efficiency of LLMs, with substantial benefits for real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Fine-tune meta llama models in azure ai studio, 2024a. URL https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama/.
  2. Azure marketplace, 2024b. URL https://azuremarketplace.microsoft.com/.
  3. From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269, 2023.
  4. Generation, distillation and evaluation of motivational interviewing-style reflections with a foundational language model, 2024.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Dialogue chain-of-thought distillation for commonsense-aware conversational agents, 2023.
  7. Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
  8. Learning to maximize mutual information for chain-of-thought distillation, 2024.
  9. Medically aware gpt-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference, pages 354–372. PMLR, 2021.
  10. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  11. Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024.
  12. Distilling and transferring knowledge via cgan-generated samples for image classification and regression, 04 2021.
  13. Evaluation of generative ai applications with azure ai studio, 2024. URL https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-approach-gen-ai.
  14. Charlie George. Fine-tuning chatgpt: Surpassing gpt-4 summarization performance–a 63 URL https://blog.langchain.dev/fine-tuning-chatgpt-surpassing-gpt-4-summarization/.
  15. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  16. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  17. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proc. 23rd International Conference on Machine learning (ICML’06), pages 377–384. ACM Press, 2006.
  18. Distilling the knowledge in a neural network, 2015.
  19. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.
  20. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  21. Efficient attentions for long document summarization, 2021.
  22. Comparison of soft and hard target rnn-t distillation for large-scale asr, 2022.
  23. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  24. Why machine reading comprehension models learn shortcuts?, 2021.
  25. LangChain. Scoring evaluator, 2024. URL https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/string/scoring_eval_chain/.
  26. Best practices for llm evaluation of rag applications, 2023. URL https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG.
  27. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  28. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050, 2023.
  29. Shadow knowledge distillation: Bridging offline and online knowledge transfer. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 635–649. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/040d3b6af368bf71f952c18da5713b48-Paper-Conference.pdf.
  30. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models, 2023.
  31. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.
  32. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
  33. Simpo: Simple preference optimization with a reference-free reward, 2024.
  34. Meta. Introducing llama 3.1: Our most capable models to date, 2024. URL https://ai.meta.com/blog/meta-llama-3-1/.
  35. State of what art? a call for multi-prompt llm evaluation, 2024.
  36. Council of Chief State School Officers. Common core state standards for mathematics, 2022. URL https://learning.ccsso.org/wp-content/uploads/2022/11/ADA-Compliant-Math-Standards.pdf. Accessed: 2024-06-21.
  37. Department of Education Louisiana. K-12 louisiana student standards for mathematics, 2017. URL https://www.louisianabelieves.com/docs/default-source/teacher-toolbox-resources/louisiana-student-standards-for-k-12-math.pdf. Accessed: 2024-06-21.
  38. California Department of Education Sacramento. Overview of the standards chapters of the mathematics framework for california public schools: Kindergarten through grade twelve, 2015. URL https://www.cde.ca.gov/ci/ma/cf/documents/mathfwoverview.pdf. Accessed: 2024-06-21.
  39. George Papamakarios. Distilling model knowledge, 2015.
  40. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  42. Explain yourself! leveraging language models for commonsense reasoning. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1487. URL https://aclanthology.org/P19-1487.
  43. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021.
  44. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  45. Can llms master math? investigating large language models on math stack exchange. In Proceedings of 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, USA, July. 2024. ACM.
  46. Learning by distilling context, 2022.
  47. What makes reading comprehension questions easier?, 2018.
  48. Self-criticism: Aligning large language models with their understanding of helpfulness, honesty, and harmlessness. In Mingxuan Wang and Imed Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 650–662, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry.62. URL https://aclanthology.org/2023.emnlp-industry.62.
  49. Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
  50. Llama: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302.13971.
  51. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  52. Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562, 2022a.
  53. Pinto: Faithful language reasoning using prompt-generated rationales, 2023.
  54. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
  55. Towards zero-label language learning. arXiv preprint arXiv:2109.09193, 2021.
  56. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  57. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  58. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  59. Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023.
  60. Feature normalized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020.
  61. A survey on knowledge distillation of large language models, 2024.
  62. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017.
  63. Distilling system 2 into system 1, 2024. URL https://arxiv.org/abs/2407.06023.
  64. A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators, 2024.
  65. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  66. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
  67. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anup Shirgaonkar (2 papers)
  2. Nikhil Pandey (2 papers)
  3. Nazmiye Ceren Abay (2 papers)
  4. Tolga Aktas (3 papers)
  5. Vijay Aski (6 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com