Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Distillation of Black-Box Large Language Models (2401.07013v2)

Published 13 Jan 2024 in cs.CL

Abstract: Given the exceptional performance of proprietary LLMs like GPT-4, recent research has increasingly focused on boosting the capabilities of smaller models through knowledge distillation (KD) from these powerful yet black-box teachers. While leveraging the high-quality outputs of these teachers is advantageous, the inaccessibility of their internal states often limits effective knowledge transfer. To overcome this limitation, we introduce Proxy-KD, a novel method that uses a proxy model to facilitate the efficient transfer of knowledge from black-box LLMs to smaller models. Our experiments show that Proxy-KD not only enhances the performance of KD from black-box teacher models but also surpasses traditional white-box KD techniques.~This approach presents a compelling new avenue for distilling knowledge from advanced LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Generalized knowledge distillation for auto-regressive language models, 2023.
  2. Explicit knowledge transfer for weakly-supervised code generation, 2023.
  3. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  4. Evaluating large language models trained on code, 2021.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  8. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2023.
  9. Knowledge distillation of large language models, 2023.
  10. Measuring massive multitask language understanding, 2021.
  11. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  12. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  13. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  14. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  15. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870, 2023.
  16. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4163–4174, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.372. URL https://aclanthology.org/2020.findings-emnlp.372.
  17. Let gpt be a math tutor: Teaching math word problem solvers with customized exercise generation, 2023.
  18. Autoregressive knowledge distillation through imitation learning. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6121–6133, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.494. URL https://aclanthology.org/2020.emnlp-main.494.
  19. The flan collection: Designing data and methods for effective instruction tuning, 2023.
  20. Improved knowledge distillation via teacher assistant, 2019.
  21. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  22. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  23. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  24. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  25. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  28. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hongzhan Chen (6 papers)
  2. Xiaojun Quan (52 papers)
  3. Ming Yan (190 papers)
  4. Ji Zhang (176 papers)
  5. Ruijun Chen (12 papers)
  6. Yuqi Yi (2 papers)
  7. Chenliang Li (92 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.