Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing (2309.15826v1)
Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.
- “Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade” In Proc. IWSLT, 2019
- Hirofumi Inaguma, Tatsuya Kawahara and Shinji Watanabe “Source and target bidirectional knowledge distillation for end-to-end speech translation” In Proc. NAACL, 2021
- “Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation” In Interspeech, 2022
- Sebastian Ruder “An overview of multi-task learning in deep neural networks” In arXiv preprint arXiv:1706.05098, 2017
- “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing” In Proc. ACL, 2021
- “mslam: Massively multilingual joint pre-training for speech and text” In arXiv preprint arXiv:2202.01374, 2022
- “Unified Speech-Text Pre-training for Speech Translation and Recognition” In Proc. ACL, 2022
- Rong Ye, Mingxuan Wang and Lei Li “Cross-modal Contrastive Learning for Speech Translation” In Proc. NAACL, 2022
- “STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation” In Proc. ACL, 2022
- “MAESTRO: Matched Speech Text Representations through Modality Matching” In Proc. Interspeech, 2022
- “M 3 st: Mix at three levels for speech translation” In Proc. ICASSP, 2023
- Siqi Ouyang, Rong Ye and Lei Li “WACO: Word-Aligned Contrastive Learning for Speech Translation” In Proc. ACL, 2023
- “On decoder-only architecture for speech-to-text and large language model integration” In arXiv:2307.03917, 2023
- “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In Proc. Interspeech, 2023
- “AudioPaLM: A Large Language Model That Can Speak and Listen” In arXiv preprint arXiv:2306.12925, 2023
- “Discretalk: Text-to-speech as a machine translation problem” In arXiv preprint arXiv:2005.05525, 2020
- “On generative spoken language modeling from raw audio” In TACL MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021
- “Neural codec language models are zero-shot text to speech synthesizers” In arXiv preprint arXiv:2301.02111, 2023
- “Textually Pretrained Speech Language Models” In arXiv preprint arXiv:2305.13009, 2023
- “Direct Speech-to-Speech Translation With Discrete Units” In Proc. ACL, 2022
- Xinjian Li, Ye Jia and Chung-Cheng Chiu “Textless direct speech-to-speech translation with discrete speech representation” In Proc. ICASSP, 2023
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing” In J-STSP, 2022
- “Must-c: a multilingual speech translation corpus” In Proc. NAACL, 2019
- “Findings of the 2016 conference on machine translation” In Proc. WMT, 2016
- “Multilingual denoising pre-training for neural machine translation” In TACL, 2020
- “Multilingual translation with extensible multilingual pretraining and finetuning” In arXiv preprint arXiv:2008.00401, 2020
- Aaron Van Den Oord and Oriol Vinyals “Neural discrete representation learning” In Proc. Neurips, 2017
- “Soundstream: An end-to-end neural audio codec” In TASLP, 2021
- “High fidelity neural audio compression” In arXiv preprint arXiv:2210.13438, 2022
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units” In TASLP, 2021
- Ankita Pasad, Bowen Shi and Karen Livescu “Comparative layer-wise analysis of self-supervised speech models” In Proc. ICASSP, 2023
- “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing” In Proc. EMNLP, 2018
- Ivan Provilkov, Dmitrii Emelianenko and Elena Voita “BPE-Dropout: Simple and Effective Subword Regularization” In Proc. ACL, 2020
- “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
- Felix Stahlberg “Neural machine translation: A review” In Journal of Artificial Intelligence Research, 2020
- “Specaugment: simple data augmentation for automatic speech recognition” In Proc. Interspeech, 2019
- “Multi-Modal Data Augmentation for End-to-end ASR” In Proc. Interspeech, 2018
- “Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models” In Proc. ICASSP, 2022
- Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural Machine Translation by Jointly Learning to Align and Translate” In Proc. ICLR, 2015
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In ICASSP, 2016
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks” In Proc. ICML, 2006
- A. Graves “Sequence Transduction with Recurrent Neural Networks” In Proc. ICML, 2012
- “Hybrid CTC/attention architecture for end-to-end speech recognition” In JSTSP, 2017
- “CTC Alignments Improve Autoregressive Translation” In Proc. EACL, 2023
- “ESPnet-ST: All-in-One Speech Translation Toolkit” In Proc. ACL, 2020
- “CTC-based Compression for Direct Speech Translation” In Proc. EACL, 2021
- Biao Zhang, Barry Haddow and Rico Sennrich “Revisiting end-to-end speech-to-text translation from scratch” In Proc. ICML, 2022
- “ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit” In Proc. ACL, 2023
- “E-branchformer: Branchformer with enhanced merging for speech recognition” In Proc. SLT, 2023
- “Multilingual Speech Translation from Efficient Finetuning of Pretrained Models” In Proc. ACL, 2021
- “CMU’s IWSLT 2023 Simultaneous Speech Translation System” In Proc. IWSLT, 2023
- Matt Post “A Call for Clarity in Reporting BLEU Scores” In Proc. WMT, 2018
- “Effective combination of pretrained models-KIT@ IWSLT2022” In IWSLT 2022, 2022
- “The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task” In Proc. IWSLT, 2022
- Oscar Day and Taghi M Khoshgoftaar “A survey on heterogeneous transfer learning” In Journal of Big Data, 2017
- “UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units” In Proc. ACL, 2023
- Brian Yan (40 papers)
- Xuankai Chang (61 papers)
- Antonios Anastasopoulos (111 papers)
- Yuya Fujita (16 papers)
- Shinji Watanabe (416 papers)