Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation (2403.01479v3)
Abstract: The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.
- Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, (NIPS), pages 2654–2662.
- Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, (ICLR) Conference Track Proceedings.
- Model compression. In Proceedings of International Conference on Knowledge Discovery and Data Mining, (SIGKDD), pages 535–541.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT), pages 4171–4186.
- Pay better attention to attention: Head selection in multilingual and multi-domain sequence modeling. Advances in Neural Information Processing Systems, 34:2668–2681.
- Knowledge distillation: A survey. Int. J. Comput. Vis., 129(6):1789–1819.
- Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1053–1062. Association for Computational Linguistics.
- Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Tinybert: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: (EMNLP), pages 4163–4174.
- James M Joyce. 2011. Kullback-leibler divergence. In International encyclopedia of statistical science, pages 720–722. Springer.
- Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In International Conference on Learning Representations.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (EMNLP), pages 66–71.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (ACL), pages 7871–7880.
- Dynamic knowledge distillation for pre-trained language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 379–389.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Distilling linguistic context for language model compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 364–378. Association for Computational Linguistics.
- ALP-KD: attention-based layer projection for knowledge distillation. In Conference on Innovative Applications of Artificial Intelligence, (IAAI), The Symposium on Educational Advances in Artificial Intelligence, (EAAI), pages 13657–13665.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191. Association for Computational Linguistics.
- Fitnets: Hints for thin deep nets. In International Conference on Learning Representations, (ICLR).
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Neural machine translation of rare words with subword units. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, (ACL).
- Sam Shleifer and Alexander M. Rush. 2020. Pre-trained summarization distillation. CoRR, abs/2010.13002.
- Patient knowledge distillation for BERT model compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, (EMNLP-IJCNLP), pages 4322–4331.
- Mobilebert: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, (ACL), pages 2158–2170.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, (NIPS), pages 3104–3112.
- Well-read students learn better: On the importance of pre-training compact models. arXiv: Computation and Language.
- Attention is all you need. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, (NIPS), pages 5998–6008.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355. Association for Computational Linguistics.
- Selective knowledge distillation for neural machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, (ACL/IJCNLP), pages 6456–6466.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems, (NeurIPS).
- Why skip if you can combine: A simple knowledge distillation technique for intermediate layers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (EMNLP), pages 1016–1021.
- Sergey Zagoruyko and Nikos Komodakis. 2017. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, (ICLR).
- Online semantic parsing for latency reduction in task-oriented dialogue. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1554–1576. Association for Computational Linguistics.
- Heegon Jin (3 papers)
- Seonil Son (5 papers)
- Jemin Park (3 papers)
- Youngseok Kim (31 papers)
- Hyungjong Noh (5 papers)
- Yeonsoo Lee (9 papers)