Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model (2402.14035v3)

Published 21 Feb 2024 in cs.LG and cs.AI

Abstract: Recent advancements in foundation models have yielded impressive performance across a wide range of tasks. Meanwhile, for specific applications, practitioners have been developing specialized application models. To enjoy the benefits of both kinds of models, one natural path is to transfer the knowledge in foundation models into specialized application models, which are generally more efficient for serving. Techniques from knowledge distillation may be applied here, where the application model learns to mimic the foundation model. However, specialized application models and foundation models have substantial gaps in capacity, employing distinct architectures, using different input features from different modalities, and being optimized on different distributions. These differences in model characteristics lead to significant challenges for distillation methods. In this work, we propose creating a teaching committee comprising both foundation model teachers and complementary teachers. Complementary teachers possess model characteristics akin to the student's, aiming to bridge the gap between the foundation model and specialized application models for a smoother knowledge transfer. Further, to accommodate the dissimilarity among the teachers in the committee, we introduce DiverseDistill, which allows the student to understand the expertise of each teacher and extract task knowledge. Our evaluations demonstrate that adding complementary teachers enhances student performance. Finally, DiverseDistill consistently outperforms baseline distillation methods, regardless of the teacher choices, resulting in significantly improved student performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models, 2023.
  2. Ensemble knowledge distillation for learning improved and efficient networks, 2020.
  3. Sequential modeling enables scalable learning for large vision models, 2023.
  4. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.
  5. Language models are few-shot learners, 2020.
  6. Pile: Pairwise iterative logits ensemble for multi-teacher labeled distillation, 2022.
  7. Emerging trends: A gentle introduction to fine-tuning. Natural Language Engineering, 27(6):763–778, 2021.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  9. Skdbert: Compressing bert via stochastic knowledge distillation, 2022.
  10. AutoSKDBERT: Learn to stochastically distill BERT, 2023. URL https://openreview.net/forum?id=csARsNPKgVi.
  11. A survey on in-context learning, 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  13. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020a. Curran Associates Inc. ISBN 9781713829546.
  14. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12345–12355. Curran Associates, Inc., 2020b. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/91c77393975889bd08f301c9e13a44b7-Paper.pdf.
  15. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046, 2023.
  16. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, 2017. URL https://api.semanticscholar.org/CorpusID:30258763.
  17. A survey of quantization methods for efficient neural network inference, 2021.
  18. Knowledge distillation of large language models, 2023.
  19. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4), dec 2015. ISSN 2160-6455. doi: 10.1145/2827872. URL https://doi.org/10.1145/2827872.
  20. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, pp. 293–299 vol.1, 1993. doi: 10.1109/ICNN.1993.298572.
  21. Distilling the knowledge in a neural network, 2015.
  22. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.
  23. Lora: Low-rank adaptation of large language models, 2021.
  24. Distillation from heterogeneous models for top-k recommendation. In Proceedings of the ACM Web Conference 2023, WWW ’23, pp. 801–811, New York, NY, USA, 2023a. Association for Computing Machinery. ISBN 9781450394161. doi: 10.1145/3543507.3583209. URL https://doi.org/10.1145/3543507.3583209.
  25. Do llms understand user preferences? evaluating llms on user rating prediction, 2023b.
  26. Transferring pre-trained multimodal representations with cross-modal similarity matching, 2023.
  27. Krizhevsky, A. Learning multiple layers of features from tiny images. 2009. URL https://api.semanticscholar.org/CorpusID:18268744.
  28. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9), jan 2023. ISSN 0360-0300. doi: 10.1145/3560815. URL https://doi.org/10.1145/3560815.
  29. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415:106–113, November 2020. ISSN 0925-2312. doi: 10.1016/j.neucom.2020.07.048. URL http://dx.doi.org/10.1016/j.neucom.2020.07.048.
  30. Improved knowledge distillation via teacher assistant, 2019.
  31. Improved knowledge distillation via teacher assistant. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5191–5198, Apr. 2020. doi: 10.1609/aaai.v34i04.5963. URL https://ojs.aaai.org/index.php/AAAI/article/view/5963.
  32. Metavl: Transferring in-context learning ability from language models to vision-language models, 2023.
  33. Deep learning recommendation model for personalization and recommendation systems, 2019.
  34. Learning student-friendly teacher networks for knowledge distillation, 2022.
  35. Feed: Feature-level ensemble for knowledge distillation, 2019.
  36. Learning transferable visual models from natural language supervision, 2021.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  38. Zero-shot text-to-image generation, 2021.
  39. Imagenet large scale visual recognition challenge, 2015.
  40. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
  41. Dime-fm: Distilling multimodal and efficient foundation models, 2023.
  42. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks, 2022a.
  43. Clip-td: Clip targeted distillation for vision-language tasks, 2022b.
  44. One teacher is enough? pre-trained language model distillation from multiple teachers, 2021.
  45. Tinyvit: Fast pretraining distillation for small vision transformers, 2022.
  46. Distilling text-image foundation models, 2023. URL https://openreview.net/forum?id=VsqE7E-lWB.
  47. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pp.  1285–1294, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348874. doi: 10.1145/3097983.3098135. URL https://doi.org/10.1145/3097983.3098135.
  48. Reinforced multi-teacher selection for knowledge distillation, 2020.
  49. Talking models: Distill pre-trained knowledge to downstream models via interactive communication, 2023.
  50. Student customized knowledge distillation: Bridging the gap between student and teacher. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5037–5046, 2021. doi: 10.1109/ICCV48922.2021.00501.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets