Rewiring the Transformer with Depth-Wise LSTMs (2007.06257v2)
Abstract: Stacking non-linear layers allows deep neural networks to model complicated functions, and including residual connections in Transformer layers is beneficial for convergence and performance. However, residual connections may make the model "forget" distant layers and fail to fuse information from previous layers effectively. Selectively managing the representation aggregation of Transformer layers may lead to better performance. In this paper, we present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. We show that layer normalization and feed-forward computation within a Transformer layer can be absorbed into depth-wise LSTMs connecting pure Transformer attention layers. Our experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task, and our deep Transformer experiments demonstrate the effectiveness of depth-wise LSTM on the convergence and performance of deep Transformers.
- Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
- Massively multilingual neural machine translation in the wild: Findings and challenges. CoRR, abs/1907.05019.
- Training deeper neural machine translation models with transparent attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3028–3033. Association for Computational Linguistics.
- Highway transformer: Self-gating enhanced self-attentive networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6887–6900, Online. Association for Computational Linguistics.
- The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86, Melbourne, Australia. Association for Computational Linguistics.
- Exploiting deep representations for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4253–4262, Brussels, Belgium. Association for Computational Linguistics.
- Dynamic layer aggregation for neural machine translation with routing-by-agreement. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, pages 86–93.
- Ronen Eldan and Ohad Shamir. 2016. The power of depth for feedforward neural networks. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 907–940, Columbia University, New York, New York, USA. PMLR.
- Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California. Association for Computational Linguistics.
- Optimizing deep transformers for chinese-thai low-resource translation. In Machine Translation, pages 117–126, Singapore. Springer Nature Singapore.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
- Improving transformer optimization through better initialization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4475–4483. PMLR.
- Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. In International Conference on Learning Representations.
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.
- ODE transformer: An ordinary differential equation-inspired model for sequence generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8335–8351, Dublin, Ireland. Association for Computational Linguistics.
- Learning light-weight translation models from deep transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13217–13225.
- Shallow-to-deep training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 995–1005, Online. Association for Computational Linguistics.
- What works and doesn’t work, a deep decoder for neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 459–471, Dublin, Ireland. Association for Computational Linguistics.
- Luna: Linear unified nested attention. In Advances in Neural Information Processing Systems, volume 34, pages 2441–2453. Curran Associates, Inc.
- Delight: Deep and light-weight transformer. In International Conference on Learning Representations.
- When and why are deep networks better than shallow ones? In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 2343–2348.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
- Noam Shazeer. 2020. GLU variants improve transformer. CoRR, abs/2002.05202.
- Dense information flow for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1294–1303, New Orleans, Louisiana. Association for Computational Linguistics.
- Highway networks. CoRR, abs/1505.00387.
- Matus Telgarsky. 2016. benefits of depth in neural networks. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1517–1539, Columbia University, New York, New York, USA. PMLR.
- Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
- Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Deep neural machine translation with linear associative unit. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145, Vancouver, Canada. Association for Computational Linguistics.
- Learning deep transformer models for machine translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1810–1822, Florence, Italy. Association for Computational Linguistics.
- Multi-layer representation fusion for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3015–3026, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Multiscale collaborative deep models for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 414–426, Online. Association for Computational Linguistics.
- Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations.
- Depth growing for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5558–5563, Florence, Italy. Association for Computational Linguistics.
- On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10524–10533. PMLR.
- Hongfei Xu and Qiuhui Liu. 2019. Neutron: An Implementation of the Transformer Translation Model and its Variants. arXiv preprint arXiv:1903.07402.
- Modeling task-aware MIMO cardinality for efficient multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 361–367, Online. Association for Computational Linguistics.
- Lipschitz constrained parameter initialization for deep transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 397–402, Online. Association for Computational Linguistics.
- Multi-head highly parallelized LSTM decoder for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 273–282, Online. Association for Computational Linguistics.
- Probing word translations in the transformer and trading decoder for encoder layers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74–85, Online. Association for Computational Linguistics.
- Dynamically adjusting transformer batch size by monitoring gradient direction change. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3519–3524, Online. Association for Computational Linguistics.
- Optimizing deeper transformers on small datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2089–2102, Online. Association for Computational Linguistics.
- Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 898–909, Hong Kong, China. Association for Computational Linguistics.
- Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online. Association for Computational Linguistics.
- Deep recurrent models with fast-forward connections for neural machine translation. Transactions of the Association for Computational Linguistics, 4:371–383.
- Hongfei Xu (13 papers)
- Yang Song (298 papers)
- Qiuhui Liu (8 papers)
- Josef van Genabith (43 papers)
- Deyi Xiong (103 papers)