LegoNN: Building Modular Encoder-Decoder Models (2206.03318v2)
Abstract: State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts, e.g. a high resourced decoder, across tasks. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning. To achieve this reusability, the interface between encoder and decoder modules is grounded to a sequence of marginal distributions over a pre-defined discrete vocabulary. We present two approaches for ingesting these marginals; one is differentiable, allowing the flow of gradients across the entire network, and the other is gradient-isolating. To enable the portability of decoder modules between MT tasks for different source languages and across other tasks like ASR, we introduce a modality agnostic encoder which consists of a length control mechanism to dynamically adapt encoders' output lengths in order to match the expected input length range of pre-trained decoders. We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused without any fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks, matching or beating the performance of baseline. After fine-tuning, LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5% relative WER reduction on the Europarl ASR task. To show how the approach generalizes, we compose a LegoNN ASR model from three modules -- each has been learned within different end-to-end trained models on three different datasets -- achieving an overall WER reduction of 19.5%.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NeurIPS), 2014.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS), 2017.
- W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
- D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-End Attention-based Large Vocabulary Speech Recognition,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
- N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
- J. Bradbury, R. Frostig, P. Hawkins et al., “JAX: composable transformations of Python+NumPy programs,” 2018. [Online]. Available: http://github.com/google/jax
- A. Paszke, S. Gross, F. Massa et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
- T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, 2018.
- A. Hannun, “The Label Bias Problem,” https://awni.github.io/label-bias, 2019.
- L. Bottou, Y. Bengio, and Y. Le Cun, “Global training of document processing systems using graph transformer networks,” in Proceedings of the 11th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1997.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech 2020, pp. 5036–5040, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
- J. Libovický and J. Helcl, “End-to-end non-autoregressive neural machine translation with connectionist temporal classification,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
- J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
- N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proceedings of 37th Annual Allerton Conference on Communication, Control and Computing (Alletron), 1999.
- A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress test evaluation for natural language inference,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018, pp. 2340–2353.
- M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (EMNLP): Demonstrations, 2019.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- J. Godfrey and E. Holliman, “Switchboard-1 Release 2 LDC97S62,” Web Download. Philadelphia: Linguistic Data Consortium, 1993.
- L. D. Consortium, “2000 HUB5 English Evaluation Speech LDC2002S09,” Web Download. Philadelphia: Linguistic Data Consortium, 2002.
- ——, “2000 HUB5 English Evaluation Transcripts LDC2002T43,” Web Download. Philadelphia: Linguistic Data Consortium, 2002.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno et al., “ESPnet: End-to-End Speech Processing Toolkit,” in Proceedings of the 19th Annual Conference of the International Speech Communication Association (InterSpeech), 2018.
- A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the ted-lium corpus with selected data for language modeling and more ted talks,” in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), 2014.
- J. Iranzo-Sánchez, J. A. Silvestre-Cerdà, J. Jorge, N. Roselló, A. Giménez, A. Sanchis, J. Civera, and A. Juan, “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
- K. Park and J. Kim, “g2pe,” https://github.com/Kyubyong/g2p, 2019.
- A. Mohamed, D. Okhonko, and L. Zettlemoyer, “Transformers with convolutional context for ASR,” in arXiv preprint arXiv:1904.11660, 2019.
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proceedings of the 20th Annual Conference of the International Speech Communication Association (InterSpeech), 2019.
- M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” in Proceedings of the 3rd Conference on Machine Translation (WMT), 2018.
- Moses-SMT, “multi-bleu.perl,” https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl, 2018.
- J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
- M. Ghazvininejad, V. Karpukhin, L. Zettlemoyer, and O. Levy, “Aligned cross entropy for non-autoregressive machine translation,” in Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- C. Saharia, W. Chan, S. Saxena, and M. Norouzi, “Non-autoregressive machine translation with latent alignments,” arXiv preprint arXiv:2004.07437, 2020.
- J. Gu and X. Kong, “Fully non-autoregressive neural machine translation: Tricks of the trade,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 120–133.
- Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proceedings of the 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.
- D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning. PMLR, 2016, pp. 173–182.
- Z. Tüske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Single headed attention based sequence-to-sequence model for state-of-the-art results on switchboard-300,” arXiv preprint arXiv:2001.07263, 2020.
- S. Karita, N. Chen, T. Hayashi, and other, “A Comparative Study on Transformer vs RNN in Speech Applications,” in Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
- D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI,” in Proceedings of the 17th Annual Conference of the International Speech Communication Association (InterSpeech), 2016.
- nvidia, “cuDNN CTC loss,” https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnCTCLoss_v8, 2022.
- R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 86–96.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- T. Hayashi and S. Watanabe, “Discretalk: Text-to-speech as a machine translation problem,” arXiv preprint arXiv:2005.05525, 2020.
- Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “Audiolm: a language modeling approach to audio generation,” arXiv preprint arXiv:2209.03143, 2022.
- B. Yan, S. Dalmia, Y. Higuchi, G. Neubig, F. Metze, A. W. Black, and S. Watanabe, “Ctc alignments improve autoregressive translation,” in EACL, 2023.
- M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.
- P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. Lafferty, R. L. Mercer, and P. S. Roossin, “A statistical approach to machine translation,” Computational linguistics, vol. 16, no. 2, 1990.
- M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,” Neural computation, vol. 6, no. 2, 1994.
- A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013.
- H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” arXiv preprint arXiv:1402.1128, 2014.
- A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
- W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly, “Imputer: Sequence modelling via imputation and dynamic programming,” in Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
- J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, “Non-autoregressive neural machine translation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
- N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Non-autoregressive transformer automatic speech recognition,” arXiv preprint arXiv:1911.04908, 2019.
- S. Kim, T. Hori, and S. Watanabe, “Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
- J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural Module Networks,” in Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017.
- S. Purushwalkam, M. Nickel, A. Gupta, and M. Ranzato, “Task-driven modular networks for zero-shot compositional learning,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
- J. Andreas, D. Klein, and S. Levine, “Modular multitask Reinforcement Learning with Policy Sketches,” in Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
- R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-Shot Learning Through Cross-Modal Transfer,” in Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS), 2013.
- N. Gupta, K. Lin, D. Roth, S. Singh, and M. Gardner, “Neural module networks for reasoning over text,” in Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, “Learning to control self-assembling morphologies: a study of generalization via modularity,” in Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
- M. Post, “A call for clarity in reporting bleu scores,” in Proceedings of the 3rd Conference on Machine Translation (WMT), 2018.