Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation (2312.15872v1)
Abstract: Although the Transformer is currently the best-performing architecture in the homogeneous configuration (self-attention only) in Neural Machine Translation, many State-of-the-Art models in Natural Language Processing are made of a combination of different Deep Learning approaches. However, these models often focus on combining a couple of techniques only and it is unclear why some methods are chosen over others. In this work, we investigate the effectiveness of integrating an increasing number of heterogeneous methods. Based on a simple combination strategy and performance-driven synergy criteria, we designed the Multi-Encoder Transformer, which consists of up to five diverse encoders. Results showcased that our approach can improve the quality of the translation across a variety of languages and dataset sizes and it is particularly effective in low-resource languages where we observed a maximum increase of 7.16 BLEU compared to the single-encoder model.
- Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural machine translation by jointly learning to align and translate” In arXiv preprint arXiv:1409.0473, 2014
- Iz Beltagy, Matthew E Peters and Arman Cohan “Longformer: The long-document transformer” In arXiv preprint arXiv:2004.05150, 2020
- “Quasi-recurrent neural networks” In arXiv preprint arXiv:1611.01576, 2016
- “The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 76–86 DOI: 10.18653/v1/P18-1008
- “Coatnet: Marrying convolution and attention for all data sizes” In Advances in Neural Information Processing Systems 34, 2021, pp. 3965–3977
- “Convolutional sequence to sequence learning” In International Conference on Machine Learning, 2017, pp. 1243–1252 PMLR
- “Conformer: Convolution-augmented transformer for speech recognition” In arXiv preprint arXiv:2005.08100, 2020
- “Modeling Recurrence for Transformer” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 1198–1207 DOI: 10.18653/v1/N19-1122
- “Long short-term memory” In Neural computation 9.8 MIT Press, 1997, pp. 1735–1780
- Jia Cheng Hu, Roberto Cavicchioli and Alessandro Capotondi “ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning” In arXiv preprint arXiv:2208.06551, 2022
- Jia Cheng Hu, Roberto Cavicchioli and Alessandro Capotondi “Exploring the sequence length bottleneck in the Transformer for Image Captioning” In arXiv preprint arXiv:2207.03327, 2022
- “Attention on attention for image captioning” In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4634–4643
- “Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring” In arXiv preprint arXiv:1905.01969, 2019
- “Sparse is enough in scaling transformers” In Advances in Neural Information Processing Systems 34, 2021, pp. 9895–9907
- “Recurrent continuous translation models” In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1700–1709
- “Squeezeformer: An Efficient Transformer for Automatic Speech Recognition” In arXiv preprint arXiv:2206.00888, 2022
- “Fnet: Mixing tokens with fourier transforms” In arXiv preprint arXiv:2105.03824, 2021
- Jindřich Libovickỳ, Jindřich Helcl and David Mareček “Input combination strategies for multi-source transformer decoder” In arXiv preprint arXiv:1811.04716, 2018
- “Very deep transformers for neural machine translation” In arXiv preprint arXiv:2008.07772, 2020
- “Convtransformer: A convolutional transformer network for video frame synthesis” In arXiv preprint arXiv:2011.10185, 2020
- Thang Luong, Hieu Pham and Christopher D. Manning “Effective Approaches to Attention-based Neural Machine Translation” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing Lisbon, Portugal: Association for Computational Linguistics, 2015, pp. 1412–1421 DOI: 10.18653/v1/D15-1166
- Alexandre Magueresse, Vincent Carles and Evan Heetderks “Low-resource languages: A review of past work and future challenges” In arXiv preprint arXiv:2006.07264, 2020
- Toan Q. Nguyen and Julian Salazar “Transformers without Tears: Improving the Normalization of Self-Attention” In Proceedings of the 16th International Conference on Spoken Language Translation Hong Kong: Association for Computational Linguistics, 2019 URL: https://aclanthology.org/2019.iwslt-1.17
- “Bleu: a method for automatic evaluation of machine translation” In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
- “Very deep self-attention networks for end-to-end speech recognition” In arXiv preprint arXiv:1904.13377, 2019
- “Training Tips for the Transformer Model” In The Prague Bulletin of Mathematical Linguistics 110, 2018 DOI: 10.2478/pralin-2018-0002
- Matt Post “A call for clarity in reporting BLEU scores” In arXiv preprint arXiv:1804.08771, 2018
- Ivan Provilkov, Dmitrii Emelianenko and Elena Voita “BPE-Dropout: Simple and Effective Subword Regularization” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online: Association for Computational Linguistics, 2020, pp. 1882–1892 DOI: 10.18653/v1/2020.acl-main.170
- Rico Sennrich, Barry Haddow and Alexandra Birch “Neural Machine Translation of Rare Words with Subword Units” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Berlin, Germany: Association for Computational Linguistics, 2016, pp. 1715–1725 DOI: 10.18653/v1/P16-1162
- “Multi-encoder transformer network for automatic post-editing” In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018, pp. 840–845
- “Attention is all you need” In Advances in neural information processing systems, 2017, pp. 5998–6008
- “Deepnet: Scaling transformers to 1,000 layers” In arXiv preprint arXiv:2203.00555, 2022
- “Pay less attention with lightweight and dynamic convolutions” In arXiv preprint arXiv:1901.10430, 2019
- “Google’s neural machine translation system: Bridging the gap between human and machine translation” In arXiv preprint arXiv:1609.08144, 2016
- “Convolutional self-attention networks” In arXiv preprint arXiv:1904.03107, 2019
- “Qanet: Combining local convolution with global self-attention for reading comprehension” In arXiv preprint arXiv:1804.09541, 2018
- Xiang Zhang, Junbo Zhao and Yann LeCun “Character-level convolutional networks for text classification” In Advances in neural information processing systems, 2015, pp. 649–657
- Jia Cheng Hu (6 papers)
- Roberto Cavicchioli (7 papers)
- Giulia Berardinelli (1 paper)
- Alessandro Capotondi (15 papers)