Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WACO: Word-Aligned Contrastive Learning for Speech Translation (2212.09359v3)

Published 19 Dec 2022 in cs.CL, cs.SD, and eess.AS

Abstract: End-to-end Speech Translation (E2E ST) aims to directly translate source speech into target text. Existing ST methods perform poorly when only extremely small speech-text data are available for training. We observe that an ST model's performance closely correlates with its embedding similarity between speech and source transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation. Our key idea is bridging word-level representations for both speech and text modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU points with only 1-hour parallel ST data. Code is available at https://github.com/owaski/WACO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Ashkan Alinejad and Anoop Sarkar. 2020. Effectively pretraining a speech translation decoder with machine translation data. In Proc. of EMNLP, pages 8014–8020.
  2. SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5723–5738, Dublin, Ireland. Association for Computational Linguistics.
  3. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proc. of NAACL-HLT, pages 58–68.
  6. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training.
  7. End-to-end automatic speech translation of audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 6224–6228. IEEE Press.
  8. End-to-end automatic speech translation of audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pages 6224–6228.
  9. Listen and translate: A proof of concept for end-to-end speech-to-text translation. In NIPS workshop on End-to-end Learning for Speech and Audio Processing.
  10. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198.
  11. Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech and Language, 66:101155.
  12. No language left behind: Scaling human-centered machine translation. ArXiv, abs/2207.04672.
  13. MuST-C: a Multilingual Speech Translation Corpus. In Proc. of NAACL-HLT, pages 2012–2017.
  14. Consecutive decoding for speech-to-text translation. In Proc. of AAAI.
  15. Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14):12749–12759.
  16. Non-parametric domain adaptation for end-to-end speech translation.
  17. An attentional model for speech translation without transcription. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 949–959, San Diego, California. Association for Computational Linguistics.
  18. STEMM: Self-learning with speech-text manifold mixup for speech translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7050–7062, Dublin, Ireland. Association for Computational Linguistics.
  19. End-to-end speech-translation with knowledge distillation: FBK@IWSLT2020. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 80–88, Online. Association for Computational Linguistics.
  20. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML, volume 148 of ACM International Conference Proceeding Series, pages 369–376.
  21. Learning shared semantic space for speech-to-text translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2214–2225.
  22. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7180–7184.
  23. Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proc. of EMNLP, pages 1317–1327.
  24. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proc. of ICLR.
  25. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. of EMNLP, pages 66–71.
  26. Sample, translate, recombine: Leveraging audio alignments for data augmentation in end-to-end speech translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 245–254, Dublin, Ireland. Association for Computational Linguistics.
  27. Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3520–3533.
  28. Bridging the modality gap for speech-to-text translation. ArXiv preprint, abs/2010.14920.
  29. Skinaugment: Auto-encoding speaker conversions for automatic speech translation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7924–7928.
  30. Improving data augmentation for low resource speech-to-text translation with diverse paraphrasing. Neural Netw., 148(C):194–205.
  31. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210.
  32. Specaugment: A simple data augmentation method for automatic speech recognition. In Proc. of INTERSPEECH, pages 2613–2617.
  33. Self-training for end-to-end speech translation. In Proc. of INTERSPEECH, pages 1476–1480.
  34. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191.
  35. End-to-End Spoken Language Understanding for Generalized Voice Assistants. In Proc. Interspeech 2021, pages 4738–4742.
  36. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  37. Unified speech-text pre-training for speech translation and recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1488–1499, Dublin, Ireland. Association for Computational Linguistics.
  38. Improving speech translation by understanding and learning from the auxiliary text translation task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4252–4261, Online. Association for Computational Linguistics.
  39. A general multi-task learning framework to leverage text data for speech to text tasks. In Proc. of ICASSP, pages 6209–6213. IEEE.
  40. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  41. Covost 2 and massively multilingual speech translation. In Proc. Interspeech 2021, pages 2247–2251.
  42. Discrete cross-modal alignment enables zero-shot speech translation.
  43. Curriculum pre-training for end-to-end speech translation. In Proc. of ACL, pages 3728–3738.
  44. Sequence-to-Sequence Models Can Directly Translate Foreign Speech. In Proc. Interspeech 2017, pages 2625–2629.
  45. Self-supervised representations improve end-to-end speech translation. In Proc. of INTERSPEECH, pages 1491–1495.
  46. Stacked acoustic-and-textual encoding: Integrating the pre-trained models into speech translation encoders. In Proc. of ACL, pages 2619–2630.
  47. End-to-end speech translation via cross-modal progressive training. In Proc. of INTERSPEECH, pages 2267–2271.
  48. Cross-modal contrastive learning for speech translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
  49. Cross-modal contrastive learning for speech translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5099–5113, Seattle, United States. Association for Computational Linguistics.
  50. Revisiting end-to-end speech-to-text translation from scratch. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 26193–26205. PMLR.
  51. Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. In Proc. of EMNLP.
  52. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In Proc. of ICML, volume 139 of Proceedings of Machine Learning Research, pages 12736–12746.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Siqi Ouyang (15 papers)
  2. Rong Ye (20 papers)
  3. Lei Li (1293 papers)
Citations (22)
Github Logo Streamline Icon: https://streamlinehq.com