Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models (2303.09639v2)

Published 16 Mar 2023 in cs.CL

Abstract: Large pretrained LLMs have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments. However, KD can be ineffective when the student is manually selected from a set of existing options, since it can be a sub-optimal choice within the space of all possible student architectures. We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task agnostic distillation from a multilingual teacher. In each episode of the search process, a NAS controller predicts a reward based on the distillation loss and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full training corpus. KD-NAS can automatically trade off efficiency and effectiveness, and recommends architectures suitable to various latency budgets. Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance, and has been deployed in 3 software offerings requiring large throughput, low latency and deployment on CPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
  2. Bram Bakker. 2002. Reinforcement learning with long short-term memory. In Advances in Neural Information Processing Systems, volume 14. MIT Press.
  3. Teacher guided architecture search.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Adabert: Task-adaptive bert compression with differentiable neural architecture search.
  6. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages.
  7. Electra: Pre-training text encoders as discriminators rather than generators.
  8. Unsupervised cross-lingual representation learning at scale.
  9. Xnli: Evaluating cross-lingual sentence representations.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding.
  11. Neural architecture search: A survey.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.
  13. Niklas Heidloff. 2023. Ibm announces new foundation model capabilities.
  14. Distilling the knowledge in a neural network.
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
  16. Bidirectional lstm-crf models for sequence tagging.
  17. Annealing knowledge distillation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2493–2504, Online. Association for Computational Linguistics.
  18. Improving task-agnostic bert distillation with layer mapping search. Neurocomputing, 461:194–203.
  19. Tinybert: Distilling bert for natural language understanding.
  20. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  21. Revisiting intermediate layer distillation for compressing language models: An overfitting perspective.
  22. Alexander Lang. 2023. Fair is fast, and fast is fair: Ibm slate foundation models for nlp.
  23. BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover’s distance. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3009–3018, Online. Association for Computational Linguistics.
  24. Darts: Differentiable architecture search.
  25. Roberta: A robustly optimized bert pretraining approach.
  26. Search to distill: Pearls are everywhere but not the eyes.
  27. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  28. Xtremedistiltransformers: Task transfer for task-agnostic distillation.
  29. Accelerating neural architecture search via proxy data.
  30. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(2):477–523.
  31. Squad: 100,000+ questions for machine comprehension of text.
  32. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
  33. The evolved transformer.
  34. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
  35. Mnasnet: Platform-aware neural architecture search for mobile.
  36. Bert rediscovers the classical nlp pipeline.
  37. Tieleman Tijmen and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
  38. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  39. Well-read students learn better: On the importance of pre-training compact models.
  40. A comparative analysis of task-agnostic distillation methods for compressing transformer language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track.
  41. Attention is all you need.
  42. Glue: A multi-task benchmark and analysis platform for natural language understanding.
  43. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
  44. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.
  45. Xiaobo Wang. 2021. Teacher guided neural architecture search for face recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 35(4):2817–2825.
  46. How to distill your BERT: An empirical study on the impact of weight initialisation and distillation objectives. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1843–1852, Toronto, Canada. Association for Computational Linguistics.
  47. Huggingface’s transformers: State-of-the-art natural language processing.
  48. Why skip if you can combine: A simple knowledge distillation technique for intermediate layers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1016–1021, Online. Association for Computational Linguistics.
  49. Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7649–7661, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  50. Nas-bert: Task-agnostic and adaptive-size bert compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery Data Mining. ACM.
  51. NAS evaluation is frustratingly hard. In International Conference on Learning Representations (ICLR).
  52. Sergey Zagoruyko and Nikos Komodakis. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.
  53. Econas: Finding proxies for economical neural architecture search.
  54. Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning.
  55. Learning transferable architectures for scalable image recognition.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Aashka Trivedi (9 papers)
  2. Takuma Udagawa (18 papers)
  3. Michele Merler (10 papers)
  4. Rameswar Panda (79 papers)
  5. Yousef El-Kurdi (6 papers)
  6. Bishwaranjan Bhattacharjee (18 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.