Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing (2309.15826v1)

Published 27 Sep 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. “Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade” In Proc. IWSLT, 2019
  2. Hirofumi Inaguma, Tatsuya Kawahara and Shinji Watanabe “Source and target bidirectional knowledge distillation for end-to-end speech translation” In Proc. NAACL, 2021
  3. “Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation” In Interspeech, 2022
  4. Sebastian Ruder “An overview of multi-task learning in deep neural networks” In arXiv preprint arXiv:1706.05098, 2017
  5. “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing” In Proc. ACL, 2021
  6. “mslam: Massively multilingual joint pre-training for speech and text” In arXiv preprint arXiv:2202.01374, 2022
  7. “Unified Speech-Text Pre-training for Speech Translation and Recognition” In Proc. ACL, 2022
  8. Rong Ye, Mingxuan Wang and Lei Li “Cross-modal Contrastive Learning for Speech Translation” In Proc. NAACL, 2022
  9. “STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation” In Proc. ACL, 2022
  10. “MAESTRO: Matched Speech Text Representations through Modality Matching” In Proc. Interspeech, 2022
  11. “M 3 st: Mix at three levels for speech translation” In Proc. ICASSP, 2023
  12. Siqi Ouyang, Rong Ye and Lei Li “WACO: Word-Aligned Contrastive Learning for Speech Translation” In Proc. ACL, 2023
  13. “On decoder-only architecture for speech-to-text and large language model integration” In arXiv:2307.03917, 2023
  14. “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In Proc. Interspeech, 2023
  15. “AudioPaLM: A Large Language Model That Can Speak and Listen” In arXiv preprint arXiv:2306.12925, 2023
  16. “Discretalk: Text-to-speech as a machine translation problem” In arXiv preprint arXiv:2005.05525, 2020
  17. “On generative spoken language modeling from raw audio” In TACL MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021
  18. “Neural codec language models are zero-shot text to speech synthesizers” In arXiv preprint arXiv:2301.02111, 2023
  19. “Textually Pretrained Speech Language Models” In arXiv preprint arXiv:2305.13009, 2023
  20. “Direct Speech-to-Speech Translation With Discrete Units” In Proc. ACL, 2022
  21. Xinjian Li, Ye Jia and Chung-Cheng Chiu “Textless direct speech-to-speech translation with discrete speech representation” In Proc. ICASSP, 2023
  22. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing” In J-STSP, 2022
  23. “Must-c: a multilingual speech translation corpus” In Proc. NAACL, 2019
  24. “Findings of the 2016 conference on machine translation” In Proc. WMT, 2016
  25. “Multilingual denoising pre-training for neural machine translation” In TACL, 2020
  26. “Multilingual translation with extensible multilingual pretraining and finetuning” In arXiv preprint arXiv:2008.00401, 2020
  27. Aaron Van Den Oord and Oriol Vinyals “Neural discrete representation learning” In Proc. Neurips, 2017
  28. “Soundstream: An end-to-end neural audio codec” In TASLP, 2021
  29. “High fidelity neural audio compression” In arXiv preprint arXiv:2210.13438, 2022
  30. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units” In TASLP, 2021
  31. Ankita Pasad, Bowen Shi and Karen Livescu “Comparative layer-wise analysis of self-supervised speech models” In Proc. ICASSP, 2023
  32. “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing” In Proc. EMNLP, 2018
  33. Ivan Provilkov, Dmitrii Emelianenko and Elena Voita “BPE-Dropout: Simple and Effective Subword Regularization” In Proc. ACL, 2020
  34. “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
  35. Felix Stahlberg “Neural machine translation: A review” In Journal of Artificial Intelligence Research, 2020
  36. “Specaugment: simple data augmentation for automatic speech recognition” In Proc. Interspeech, 2019
  37. “Multi-Modal Data Augmentation for End-to-end ASR” In Proc. Interspeech, 2018
  38. “Integrating Text Inputs for Training and Adapting RNN Transducer ASR Models” In Proc. ICASSP, 2022
  39. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural Machine Translation by Jointly Learning to Align and Translate” In Proc. ICLR, 2015
  40. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In ICASSP, 2016
  41. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks” In Proc. ICML, 2006
  42. A. Graves “Sequence Transduction with Recurrent Neural Networks” In Proc. ICML, 2012
  43. “Hybrid CTC/attention architecture for end-to-end speech recognition” In JSTSP, 2017
  44. “CTC Alignments Improve Autoregressive Translation” In Proc. EACL, 2023
  45. “ESPnet-ST: All-in-One Speech Translation Toolkit” In Proc. ACL, 2020
  46. “CTC-based Compression for Direct Speech Translation” In Proc. EACL, 2021
  47. Biao Zhang, Barry Haddow and Rico Sennrich “Revisiting end-to-end speech-to-text translation from scratch” In Proc. ICML, 2022
  48. “ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit” In Proc. ACL, 2023
  49. “E-branchformer: Branchformer with enhanced merging for speech recognition” In Proc. SLT, 2023
  50. “Multilingual Speech Translation from Efficient Finetuning of Pretrained Models” In Proc. ACL, 2021
  51. “CMU’s IWSLT 2023 Simultaneous Speech Translation System” In Proc. IWSLT, 2023
  52. Matt Post “A Call for Clarity in Reporting BLEU Scores” In Proc. WMT, 2018
  53. “Effective combination of pretrained models-KIT@ IWSLT2022” In IWSLT 2022, 2022
  54. “The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task” In Proc. IWSLT, 2022
  55. Oscar Day and Taghi M Khoshgoftaar “A survey on heterogeneous transfer learning” In Journal of Big Data, 2017
  56. “UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units” In Proc. ACL, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Brian Yan (40 papers)
  2. Xuankai Chang (61 papers)
  3. Antonios Anastasopoulos (111 papers)
  4. Yuya Fujita (16 papers)
  5. Shinji Watanabe (416 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com