Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pushing the Limits of Zero-shot End-to-End Speech Translation (2402.10422v2)

Published 16 Feb 2024 in cs.CL

Abstract: Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results.

Bridging Speech and Text: A Zero-Shot Approach to End-to-End Speech Translation

Introduction to ZeroSwot

In the continuous pursuit to refine Speech Translation (ST) systems, the scholarly community has increasingly focused on end-to-end models due to their promising efficiency and reduced error propagation. Amidst this shift, a significant challenge that surfaces is the data scarcity in parallel ST corpora, compounded by the modality gap between speech and text representations. Addressing these issues, the work at hand introduces ZeroSwot, a groundbreaking methodology that facilitates zero-shot ST by adaptively aligning a speech encoder with the representation space of a pre-trained, massively multilingual Machine Translation (MT) model.

Addressing Data Scarcity and Modality Gap

ZeroSwot is situated within a context where the conventional cascade model for ST is being superseded by end-to-end approaches for their compactness and streamlined performance. Despite these advantages, end-to-end models are hamstrung by the need for parallel ST data – a requirement ZeroSwot sidesteps by leveraging Automatic Speech Recognition (ASR) data and external MT models.

The methodology employs a novel combination of Connectionist Temporal Classification (CTC) compression and Optimal Transport to map speech embeddings directly onto a target MT model's embedding space. This approach not only obviates the need for ST data but also demonstrates superlative performance across multiple languages and datasets, setting new benchmarks both in zero-shot scenarios and against supervised models.

Technical Insights

The core of ZeroSwot lies in its sophisticated model architecture and training regimen, which holistically addresses the modality gap issue:

  • Model Architecture: ZeroSwot employs a dual-branch design featuring a speech and a text branch, with the former transforming speech signals into embeddings close to the latter's embeddings representing targeted text translations. The speech branch utilizes wav2vec 2.0 for initial encoding, followed by a CTC-based compression mechanism and a novel compression adapter to ensure compatibility with the MT model's subword tokenization.
  • Optimal Transport for Modality Bridging: The methodology applies Optimal Transport to iteratively minimize the Wasserstein distance between the speech and text representation spaces during training. This step is crucial for aligning the high-dimensional representations of the two modalities.
  • Zero-Shot ST Inference: At inference, the trained speech encoder supplants the embedding layer of the MT model, enabling direct translation from speech to text across any language pair supported by the MT model.

Experiments and Results

ZeroSwot's efficacy is rigorously validated across several benchmarks, including MuST-C and CoVoST, where it not only surpasses existing zero-shot models but also outperforms supervised ST models in most languages tested. Furthermore, ZeroSwot demonstrates considerable capability in massively multilingual ST, and its efficiency in bridging the modality gap is substantiated through targeted retrieval experiments.

The Path Ahead

ZeroSwot represents a significant leap forward in the ST landscape, particularly in addressing the perennial challenges of data scarcity and modality gaps. The method's capacity to perform competitively without direct ST data hints at the broader applicability and potential of zero-shot learning paradigms in natural language processing and beyond. Looking forward, the exploration of low-resource languages and spoken-only languages presents an exciting frontier for ST research, further propelled by frameworks such as ZeroSwot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Ashkan Alinejad and Anoop Sarkar. 2020. Effectively pretraining a speech translation decoder with Machine Translation data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8014–8020, Online. Association for Computational Linguistics.
  2. Using Optimal Transport as Alignment Objective for fine-tuning Multilingual Contextualized Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3904–3919, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  3. Antonios Anastasopoulos and David Chiang. 2018. Tied Multitask Learning for Neural Speech Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 82–91, New Orleans, Louisiana. Association for Computational Linguistics.
  4. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215.
  5. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv preprint arXiv:2111.09296.
  6. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.
  7. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 58–68, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Cascade versus direct speech translation: Do the differences still make a difference? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2873–2887, Online. Association for Computational Linguistics.
  9. End-to-End Automatic Speech Translation of Audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 6224–6228. IEEE Press.
  10. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. In NIPS Workshop on end-to-end learning for speech and audio processing, Barcelona, Spain.
  11. On the Opportunities and Risks of Foundation Models.
  12. MuST-C: A multilingual Corpus for End-to-end Speech Translation. Computer Speech & Language, 66:101155.
  13. Improving Sequence-to-Sequence Learning via Optimal Transport. In International Conference on Learning Representations.
  14. Worse WER, but Better BLEU? Leveraging Word Embedding as Intermediate in Multitask End-to-End Speech Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5998–6003, Online. Association for Computational Linguistics.
  15. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805.
  16. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  17. Tackling Data Scarcity in Speech Translation Using Zero-Shot Multilingual Machine Translation Techniques. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6222–6226.
  18. T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5794–5806, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 944–948, Online. Association for Computational Linguistics.
  20. Enabling Zero-Shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 694–701.
  21. Qingkai Fang and Yang Feng. 2023. Understanding and Bridging the Modality Gap for Speech Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15864–15881, Toronto, Canada. Association for Computational Linguistics.
  22. STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7050–7062, Dublin, Ireland. Association for Computational Linguistics.
  23. Learning with a Wasserstein Loss. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2053–2061, Cambridge, MA, USA. MIT Press.
  24. CTC-based Compression for Direct Speech Translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 690–696, Online. Association for Computational Linguistics.
  25. End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 80–88, Online. Association for Computational Linguistics.
  26. End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 110–119, Bangkok, Thailand (online). Association for Computational Linguistics.
  27. Improving Zero-shot Multilingual Neural Machine Translation by Leveraging Cross-lingual Consistency Regularization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12103–12119, Toronto, Canada. Association for Computational Linguistics.
  28. An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation.
  29. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery.
  30. Learning shared semantic space for speech-to-text translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2214–2225, Online. Association for Computational Linguistics.
  31. Dan Hendrycks and Kevin Gimpel. 2020. Gaussian Error Linear Units (GELUs).
  32. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  33. Task Aware Multi-Task Learning for Speech to Text Tasks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7723–7727.
  34. Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7180–7184.
  35. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. https://github.com/facebookresearch/libri-light.
  36. Paul Knopp and Richard Sinkhorn. 1967. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343 – 348.
  37. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  38. Sample, Translate, Recombine: Leveraging Audio Alignments for Data Augmentation in End-to-end Speech Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 245–254, Dublin, Ireland. Association for Computational Linguistics.
  39. Pre-training for Speech Translation: CTC Meets Optimal Transport.
  40. Multilingual Speech Translation from Efficient Finetuning of Pretrained Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 827–838, Online. Association for Computational Linguistics.
  41. End-to-End Speech Translation with Knowledge Distillation. In Proc. Interspeech 2019, pages 1128–1132.
  42. Bridging the Modality Gap for Speech-to-Text Translation.
  43. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  44. L. Mathias and W. Byrne. 2006. Statistical Phrase-Based Speech Translation. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pages I–I.
  45. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  46. H. Ney. 1999. Speech translation: coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), volume 1, pages 517–520 vol.1.
  47. No Language Left Behind: Scaling Human-Centered Machine Translation.
  48. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  49. WACO: Word-Aligned Contrastive Learning for Speech Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3891–3907, Toronto, Canada. Association for Computational Linguistics.
  50. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  51. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  52. Gabriel Peyré and Marco Cuturi. 2019. Computational Optimal Transport: With Applications to Data Science.
  53. Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade. In Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong. Association for Computational Linguistics.
  54. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  55. Seamless-Communication. 2023a. SeamlessM4T—Massively Multilingual & Multimodal Machine Translation. ArXiv.
  56. Seamless-Communication. 2023b. Seamless: Multilingual Expressive and Streaming Speech Translation.
  57. Matthias Sperber and Matthias Paulik. 2020. Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7409–7421, Online. Association for Computational Linguistics.
  58. Unified Speech-Text Pre-training for Speech Translation and Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1488–1499, Dublin, Ireland. Association for Computational Linguistics.
  59. Improving Speech Translation by Understanding and Learning from the Auxiliary Text Translation Task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4252–4261, Online. Association for Computational Linguistics.
  60. A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6209–6213.
  61. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning.
  62. Llama 2: Open Foundation and Fine-Tuned Chat Models.
  63. SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations.
  64. Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: UPC at IWSLT 2022. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 265–276, Dublin, Ireland (in-person and online). Association for Computational Linguistics.
  65. Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 397–410, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  66. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  67. Fairseq S2T: Fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pages 33–39, Suzhou, China. Association for Computational Linguistics.
  68. CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus.
  69. Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5291–5302, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  70. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  71. Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2619–2630, Online. Association for Computational Linguistics.
  72. Self-training and pre-training are complementary for speech recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3030–3034.
  73. End-to-End Speech Translation via Cross-Modal Progressive Training. In Proc. Interspeech 2021, pages 2267–2271.
  74. Cross-modal Contrastive Learning for Speech Translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5099–5113, Seattle, United States. Association for Computational Linguistics.
  75. Tuning Large language model for End-to-end Speech Translation.
  76. SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1663–1676, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  77. RedApt: An Adaptor for wav2vec 2 Encoding Faster and Smaller Speech Translation without Quality Compromise. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1960–1967, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  78. CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7873–7887, Toronto, Canada. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ioannis Tsiamas (12 papers)
  2. Gerard I. Gállego (19 papers)
  3. José A. R. Fonollosa (23 papers)
  4. Marta R. Costa-jussà (73 papers)
Citations (6)