m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers (2402.16918v3)
Abstract: Modular neural architectures are gaining attention for their powerful generalization and efficient adaptation to new domains. However, training these models poses challenges due to optimization difficulties arising from intrinsic sparse connectivity. Leveraging knowledge from monolithic models through techniques like knowledge distillation can facilitate training and enable integration of diverse knowledge. Nevertheless, conventional knowledge distillation approaches are not tailored to modular models and struggle with unique architectures and enormous parameter counts. Motivated by these challenges, we propose module-to-module knowledge distillation (m2mKD) for transferring knowledge between modules. m2mKD combines teacher modules of a pretrained monolithic model and student modules of a modular model with a shared meta model respectively to encourage the student module to mimic the behaviour of the teacher module. We evaluate m2mKD on two modular neural architectures: Neural Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). Applying m2mKD to NACs yields significant improvements in IID accuracy on Tiny-ImageNet (up to 5.6%) and OOD robustness on Tiny-ImageNet-R (up to 4.2%). Additionally, the V-MoE-Base model trained with m2mKD achieves 3.5% higher accuracy than end-to-end training on ImageNet-1k. Code is available at https://github.com/kamanphoebe/m2mKD.
- Do deep nets really need to be deep? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Language models are few-shot learners. arXiv preprint arXiv: 2005.14165.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs].
- Efficient knowledge distillation from an ensemble of teachers. In Interspeech.
- Coordination among neural modules through a shared global workspace. ArXiv, abs/2103.01197.
- DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
- Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(11):7436–7456.
- Fastermoe: Modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, page 120–134, New York, NY, USA. Association for Computing Machinery.
- The many faces of robustness: A critical analysis of out-of-distribution generalization.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV.
- Distilling the knowledge in a neural network.
- Parameter-efficient transfer learning for nlp. arXiv preprint arXiv: 1902.00751.
- Lora: Low-rank adaptation of large language models. International Conference On Learning Representations.
- Tutel: Adaptive mixture-of-experts at scale.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
- Paraphrasing complex network: network compression via factor transfer. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 2765–2774.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. International Conference on Learning Representations.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. pages 32–33.
- Adaptive knowledge distillation based on entropy. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7409–7413.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Prefix-tuning: Optimizing continuous prompts for generation. Annual Meeting Of The Association For Computational Linguistics.
- Module-wise adaptive distillation for multimodality foundation models.
- Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393.
- Multimodal contrastive learning with limoe: the language-image mixture of experts.
- Deep Incubation: Training Large Models by Divide-and-Conquering. arXiv:2212.04129 [cs].
- Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate.
- Modular deep learning. arXiv preprint arXiv: Arxiv-2302.11529.
- Contextual parameter generation for universal neural machine translation. arXiv preprint arXiv: 1808.08493.
- Efficient parametrization of multi-domain deep neural networks. arXiv preprint arXiv: 1803.10082.
- Scaling vision with sparse mixture of experts. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 8583–8595. Curran Associates, Inc.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
- Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv: 2306.04640.
- Densely guided knowledge distillation using multiple teacher assistants. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9375–9384.
- Training data-efficient image transformers & distillation through attention. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
- The Caltech-UCSD Birds-200-2011 Dataset.
- Neural Attentive Circuits.
- Bert-of-theseus: Compressing bert by progressive module replacing. In Conference on Empirical Methods in Natural Language Processing.
- Deep model reassembly. arXiv preprint arXiv: 2210.17409.
- Reinforced multi-teacher selection for knowledge distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14284–14291.
- Moefication: Transformer feed-forward layers are mixtures of experts. FINDINGS.
- Knowledge Distillation via Module Replacing for Automatic Speech Recognition with Recurrent Neural Network Transducer. In Proc. Interspeech 2022, pages 4436–4440.
- St-moe: Designing stable and transferable sparse expert models.
- Ka Man Lo (5 papers)
- Yiming Liang (22 papers)
- Wenyu Du (21 papers)
- Yuantao Fan (8 papers)
- Zili Wang (52 papers)
- Wenhao Huang (98 papers)
- Lei Ma (195 papers)
- Jie Fu (229 papers)