Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Simultaneous Speech Translation with Differentiable Segmentation (2305.16093v2)

Published 25 May 2023 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model. DiSeg turns hard segmentation into differentiable through the proposed expectation training, enabling it to be jointly trained with the translation model and thereby learn translation-beneficial segmentation. Experimental results demonstrate that DiSeg achieves state-of-the-art performance and exhibits superior segmentation capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Antonios Anastasopoulos and David Chiang. 2018. Tied multitask learning for neural speech translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 82–91, New Orleans, Louisiana. Association for Computational Linguistics.
  2. Monotonic Infinite Lookback Attention for Simultaneous Machine Translation. pages 1313–1323.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.
  4. Unsupervised speech segmentation and variable rate representation learning using segmental contrastive predictive coding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2002–2014.
  5. Direct simultaneous speech-to-text translation assisted by synchronized streaming ASR. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4618–4624, Online. Association for Computational Linguistics.
  6. Kyunghyun Cho and Masha Esipova. 2016. Can neural machine translation do simultaneous translation?
  7. Kris Demuynck and Tom Laureys. 2002. A comparison of different approaches to automatic speech segmentation. In Text, Speech and Dialogue, pages 277–284, Berlin, Heidelberg. Springer Berlin Heidelberg.
  8. MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Saliency-driven word alignment interpretation for neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 1–12, Florence, Italy. Association for Computational Linguistics.
  10. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083.
  11. Learning when to translate for streaming speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 680–694, Dublin, Ireland. Association for Computational Linguistics.
  12. Efficient Wait-k Models for Simultaneous Machine Translation. In Proc. Interspeech 2020, pages 1461–1465.
  13. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  14. Unsupervised word segmentation using k nearest neighbors. arXiv preprint arXiv:2204.13094.
  15. Simultaneous translation of lectures and speeches. Machine Translation, 21(4):209–252.
  16. Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1053–1062, Valencia, Spain. Association for Computational Linguistics.
  17. Turning fixed to adaptive: Integrating post-evaluation into simultaneous machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2264–2278, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Learning optimal policy for simultaneous machine translation via binary search. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  19. End-to-end evaluation in simultaneous translation. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 345–353, Athens, Greece. Association for Computational Linguistics.
  20. From simultaneous to streaming machine translation by leveraging streaming history. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6972–6985, Dublin, Ireland. Association for Computational Linguistics.
  21. A segmental framework for fully-unsupervised large-vocabulary speech recognition. Computer Speech & Language, 46:154–174.
  22. An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 719–726.
  23. Herman Kamper and Benjamin van Niekerk. 2020. Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. arXiv e-prints, page arXiv:2012.07551.
  24. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  25. Ctc-segmentation of large corpora for german end-to-end speech recognition. In Speech and Computer, pages 267–278, Cham. Springer International Publishing.
  26. Contrastive representation learning: A framework and review. IEEE Access, 8:193907–193934.
  27. Transformer-based end-to-end speech recognition with residual gaussian-based self-attention. arXiv preprint arXiv:2103.15722.
  28. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920.
  29. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
  30. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Florence, Italy. Association for Computational Linguistics.
  31. SIMULEVAL: An evaluation toolkit for simultaneous translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 144–150, Online. Association for Computational Linguistics.
  32. SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 582–587, Suzhou, China. Association for Computational Linguistics.
  33. Monotonic multihead attention. In International Conference on Learning Representations.
  34. Streaming simultaneous speech translation with augmented memory transformer. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7523–7527.
  35. Investigating self-supervised pre-training for end-to-end speech translation. In ICML 2020 Workshop on Self-supervision in Audio and Speech.
  36. An empirical study of end-to-end simultaneous speech translation decoding strategies. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7528–7532.
  37. Optimizing segmentation strategies for simultaneous speech translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 551–556, Baltimore, Maryland. Association for Computational Linguistics.
  38. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  39. On the robust automatic segmentation of spontaneous speech. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 2, pages 913–916 vol.2.
  40. The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45(1):89–95.
  41. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  42. Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
  43. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  44. Online and linear-time attention by enforcing monotonic alignments. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2837–2846. PMLR.
  45. Segmentation strategies for streaming speech translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 230–238, Atlanta, Georgia. Association for Computational Linguistics.
  46. An improved speech segmentation quality measure: the r-value. In 10th Interspeech Conference, Brighton, UK, September 6-10, 2009.
  47. SimulSpeech: End-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3787–3796, Online. Association for Computational Linguistics.
  48. A review: Automatic speech segmentation. International Journal of Computer Science and Mobile Computing, 6(4):308–315.
  49. Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978. Special Section on Graphical Models and Information Retrieval.
  50. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
  51. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  52. Improving speech translation by understanding and learning from the auxiliary text translation task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4252–4261, Online. Association for Computational Linguistics.
  53. A general multi-task learning framework to leverage text data for speech to text tasks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6209–6213.
  54. Cassia Valentini-Botinhao and Simon King. 2021. Detection and analysis of attention errors in sequence-to-sequence text-to-speech. In Interspeech 2021: The 22nd Annual Conference of the International Speech Communication Association, pages 2746–2750. ISCA.
  55. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
  56. Jesse Vig and Yonatan Belinkov. 2019. Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. Association for Computational Linguistics.
  57. Fairseq S2T: Fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pages 33–39, Suzhou, China. Association for Computational Linguistics.
  58. Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504.
  59. Sequence-to-sequence models can directly translate foreign speech. In Proc. Interspeech 2017, pages 2625–2629.
  60. Temporally correlated task scheduling for sequence learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11274–11284. PMLR.
  61. Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4449–4458, Brussels, Belgium. Association for Computational Linguistics.
  62. On the localness modeling for the self-attention based end-to-end speech synthesis. Neural Networks, 125:121–130.
  63. Incremental segmentation and decoding strategies for simultaneous translation. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1032–1036, Nagoya, Japan. Asian Federation of Natural Language Processing.
  64. Cross-modal contrastive learning for speech translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5099–5113, Seattle, United States. Association for Computational Linguistics.
  65. Cross-Modal Decision Regularization for Simultaneous Speech Translation. In Proc. Interspeech 2022, pages 116–120.
  66. RealTranS: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2461–2474, Online. Association for Computational Linguistics.
  67. Learning adaptive segmentation policy for end-to-end simultaneous translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7862–7874, Dublin, Ireland. Association for Computational Linguistics.
  68. Shaolei Zhang and Yang Feng. 2021a. ICT’s system for AutoSimTrans 2021: Robust char-level simultaneous translation. In Proceedings of the Second Workshop on Automatic Simultaneous Translation, pages 1–11, Online. Association for Computational Linguistics.
  69. Shaolei Zhang and Yang Feng. 2021b. Modeling concentrated cross-attention for neural machine translation with Gaussian mixture model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1401–1411, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  70. Shaolei Zhang and Yang Feng. 2021c. Universal simultaneous machine translation with mixture-of-experts wait-k policy. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7306–7317, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  71. Shaolei Zhang and Yang Feng. 2022a. Gaussian multi-head attention for simultaneous machine translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3019–3030, Dublin, Ireland. Association for Computational Linguistics.
  72. Shaolei Zhang and Yang Feng. 2022b. Information-transport-based policy for simultaneous translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 992–1013, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  73. Shaolei Zhang and Yang Feng. 2022c. Modeling dual read/write paths for simultaneous machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2461–2477, Dublin, Ireland. Association for Computational Linguistics.
  74. Shaolei Zhang and Yang Feng. 2022d. Reducing position bias in simultaneous machine translation with length-aware framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6775–6788, Dublin, Ireland. Association for Computational Linguistics.
  75. Shaolei Zhang and Yang Feng. 2023. Hidden markov transformer for simultaneous machine translation. In The Eleventh International Conference on Learning Representations.
  76. Future-guided incremental transformer for simultaneous translation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14428–14436.
  77. Wait-info policy: Balancing source and target at information level for simultaneous machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2249–2263, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  78. Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6734–6738.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shaolei Zhang (36 papers)
  2. Yang Feng (231 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.