Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation (2405.03764v2)

Published 6 May 2024 in cs.CL and cs.IR

Abstract: Pre-trained LLMs have become an integral component of question-answering systems, achieving remarkable performance. However, for practical deployment, it is crucial to perform knowledge distillation to maintain high performance while operating under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student model performance, how can knowledge from multiple teacher models be effectively ensemble during this stage without the guidance of labels? We propose a novel algorithm, GOVERN, to tackle this issue. GOVERN has demonstrated significant improvements in both offline and online experiments, enabling the student model to achieve results comparable to that of teacher ensembles. Our experiments show that GOVERN remarkably requires a mere 1\% of the ensemble method's inference budget to achieve 99.5\% of performance. The proposed algorithm has been successfully deployed in a real-world commercial question-answering system, demonstrating its real-world applicability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. PILE: Pairwise iterative logits ensemble for multi-teacher labeled distillation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 587–595, Abu Dhabi, UAE. Association for Computational Linguistics.
  2. UnitedQA: A hybrid approach for open domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3080–3090, Online. Association for Computational Linguistics.
  3. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. In Neural Information Processing Systems.
  4. R2-D2: A modular baseline for open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 854–870, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  5. Efficient knowledge distillation from an ensemble of teachers. In Interspeech.
  6. Efficient knowledge distillation from an ensemble of teachers. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 3697–3701. ISCA.
  7. Distilling the knowledge in a neural network.
  8. Weighted distillation with unlabeled examples. In NeurIPS.
  9. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  10. Adaptive knowledge distillation based on entropy. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 7409–7413. IEEE.
  11. Daliang Li and Junpu Wang. 2019. Fedmd: Heterogenous federated learning via model distillation. CoRR, abs/1910.03581.
  12. Dynamic knowledge distillation for pre-trained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415:106–113.
  14. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, Online. Association for Computational Linguistics.
  15. Álvaro Romaniega Sancho. 2022. On the probability of the condorcet jury theorem or the miracle of aggregation. Math. Soc. Sci., 119:41–55.
  16. Ernie-tiny : A progressive distillation framework for pretrained transformer compression. CoRR, abs/2106.02241.
  17. FedED: Federated learning via ensemble distillation for medical relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2118–2128, Online. Association for Computational Linguistics.
  18. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
  19. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
  20. Unified and effective ensemble knowledge distillation. CoRR, abs/2204.00548.
  21. Learning from multiple teacher networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  22. Reinforced multi-teacher selection for knowledge distillation. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 14284–14291. AAAI Press.
  23. Confidence-aware multi-teacher knowledge distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 4498–4502. IEEE.
  24. Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pages 12437–12446. PMLR.
  25. A survey for efficient open domain question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14447–14465, Toronto, Canada. Association for Computational Linguistics.
  26. Automatically generating questions from queries for community-based question answering. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 929–937, Chiang Mai, Thailand. Asian Federation of Natural Language Processing.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com