Improving Speech Translation Accuracy and Time Efficiency with Fine-tuned wav2vec 2.0-based Speech Segmentation (2304.12659v2)
Abstract: Speech translation (ST) automatically converts utterances in a source language into text in another language. Splitting continuous speech into shorter segments, known as speech segmentation, plays an important role in ST. Recent segmentation methods trained to mimic the segmentation of ST corpora have surpassed traditional approaches. Tsiamas et al. proposed a segmentation frame classifier (SFC) based on a pre-trained speech encoder called wav2vec 2.0. Their method, named SHAS, retains 95-98% of the BLEU score for ST corpus segmentation. However, the segments generated by SHAS are very different from ST corpus segmentation and tend to be longer with multiple combined utterances. This is due to SHAS's reliance on length heuristics, i.e., it splits speech into segments of easily translatable length without fully considering the potential for ST improvement by splitting them into even shorter segments. Longer segments often degrade translation quality and ST's time efficiency. In this study, we extended SHAS to improve ST translation accuracy and efficiency by splitting speech into shorter segments that correspond to sentences. We introduced a simple segmentation algorithm using the moving average of SFC predictions without relying on length heuristics and explored wav2vec 2.0 fine-tuning for improved speech segmentation prediction. Our experimental results reveal that our speech segmentation method significantly improved the quality and the time efficiency of speech translation compared to SHAS.
- I. Tsiamas, G. I. Gállego, J. A. R. Fonollosa, and M. R. Costa-jussà, “SHAS: Approaching optimal Segmentation for End-to-End Speech Translation,” in Proc. Interspeech 2022, 2022, pp. 106–110.
- M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “MuST-C: a Multilingual Speech Translation Corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 2012–2017. [Online]. Available: https://aclanthology.org/N19-1202
- D. Wan, C. Kedzie, F. Ladhak, E. Turcan, P. Galuščáková, E. Zotkina, Z. P. Jiang, P. Bell, and K. McKeown, “Segmenting subtitles for correcting asr segmentation errors,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 2842–2854.
- M. Sinclair, P. Bell, A. Birch, and F. McInnes, “A semi-markov model for speech segmentation with an utterance-break prior,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- M. Gaido, M. Negri, M. Cettolo, and M. Turchi, “Beyond voice activity detection: Hybrid audio segmentation for direct speech translation,” in Proceedings of the Fourth International Conference on Natural Language and Speech Processing (ICNLSP 2021). Trento, Italy: Association for Computational Linguistics, 12–13 Nov. 2021, pp. 55–62. [Online]. Available: https://aclanthology.org/2021.icnlsp-1.7
- M. Paulik, S. Rao, I. Lane, S. Vogel, and T. Schultz, “Sentence segmentation and punctuation recovery for spoken language translation,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 5105–5108.
- E. Cho, J. Niehues, and A. Waibel, “Segmentation and punctuation prediction in speech language translation using a monolingual translation system,” in Proceedings of the 9th International Workshop on Spoken Language Translation: Papers, Hong Kong, Table of contents, Dec. 6-7 2012, pp. 252–259. [Online]. Available: https://aclanthology.org/2012.iwslt-papers.15
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- S. Mansour, “Morphtagger: Hmm-based arabic segmentation for statistical machine translation,” in Proceedings of the 7th International Workshop on Spoken Language Translation: Papers, 2010.
- T. Nguyen and S. Vogel, “Context-based Arabic morphological analysis for machine translation,” in CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning. Manchester, England: Coling 2008 Organizing Committee, Aug. 2008, pp. 135–142. [Online]. Available: https://aclanthology.org/W08-2118
- W. Lu and H. T. Ng, “Better punctuation prediction with dynamic conditional random fields,” in Proceedings of the 2010 conference on empirical methods in natural language processing, 2010, pp. 177–186.
- M. Diab, K. Hacioglu, and D. Jurafsky, “Automatic tagging of Arabic text: From raw text to base phrase chunks,” in Proceedings of HLT-NAACL 2004: Short Papers. Boston, Massachusetts, USA: Association for Computational Linguistics, May 2 - May 7 2004, pp. 149–152. [Online]. Available: https://aclanthology.org/N04-4038
- F. Sadat and N. Habash, “Combination of Arabic preprocessing schemes for statistical machine translation,” in Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney, Australia: Association for Computational Linguistics, Jul. 2006, pp. 1–8. [Online]. Available: https://aclanthology.org/P06-1001
- E. Matusov, D. Hillard, M. Magimai-Doss, D. Hakkani-Tur, M. Ostendorf, and H. Ney, “Improving speech translation with automatic boundary prediction,” in Proceedings of Interspeech 2007, 2007, pp. 2449–2452.
- V. K. Rangarajan Sridhar, J. Chen, S. Bangalore, A. Ljolje, and R. Chengalvarayan, “Segmentation strategies for streaming speech translation,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics, Jun. 2013, pp. 230–238. [Online]. Available: https://aclanthology.org/N13-1023
- M. Gaido, M. Negri, M. Cettolo, and M. Turchi, “Beyond voice activity detection: Hybrid audio segmentation for direct speech translation,” CoRR, vol. abs/2104.11710, 2021. [Online]. Available: https://arxiv.org/abs/2104.11710
- H. Inaguma, B. Yan, S. Dalmia, P. Guo, J. Shi, K. Duh, and S. Watanabe, “ESPnet-ST IWSLT 2021 offline speech translation system,” in Proceedings of the 18th International Conference on Spoken Language Translation. Bangkok, Thailand (online): Association for Computational Linguistics, Aug. 2021, pp. 100–109. [Online]. Available: https://aclanthology.org/2021.iwslt-1.10
- G. I. Gállego, I. Tsiamas, C. Escolano, J. A. Fonollosa, and M. R. Costa-jussà, “End-to-end speech translation with pre-trained models and adapters: Upc at iwslt 2021,” in Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), 2021, pp. 110–119.
- T. Yoshimura, T. Hayashi, K. Takeda, and S. Watanabe, “End-to-end automatic speech recognition integrated with ctc-based voice activity detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6999–7003.
- E. Cho, J. Niehues, K. Kilgour, and A. Waibel, “Punctuation insertion for real-time spoken language translation,” in Proceedings of the Eleventh International Workshop on Spoken Language Translation, 2015.
- T.-L. Ha, J. Niehues, E. Cho, M. Mediani, and A. Waibel, “The kit translation systems for iwslt 2015,” in Proceedings of the Eleventh International Workshop on Spoken Language Translation, 2015.
- E. Cho, J. Niehues, and A. Waibel, “NMT-Based Segmentation and Punctuation Insertion for Real-Time Spoken Language Translation,” in Proceedings of Interspeech 2017, 2017, pp. 2645–2649.
- A. Stolcke and E. Shriberg, “Automatic linguistic segmentation of conversational speech,” in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, vol. 2. IEEE, 1996, pp. 1005–1008.
- X. Wang, A. Finch, M. Utiyama, and E. Sumita, “An efficient and effective online sentence segmenter for simultaneous interpretation,” in Proceedings of the 3rd Workshop on Asian Translation (WAT2016), 2016, pp. 139–148.
- X. Wang, M. Utiyama, and E. Sumita, “Online sentence segmentation for simultaneous interpretation using multi-shifted recurrent neural network,” in Proceedings of Machine Translation Summit XVII Volume 1: Research Track, 2019, pp. 1–11.
- J. Iranzo-Sánchez, A. Giménez Pastor, J. A. Silvestre-Cerdà, P. Baquero-Arnal, J. Civera Saiz, and A. Juan, “Direct segmentation models for streaming speech translation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 2599–2611. [Online]. Available: https://aclanthology.org/2020.emnlp-main.206
- R. Fukuda, K. Sudoh, and S. Nakamura, “Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation,” in Proc. Interspeech 2022, 2022, pp. 121–125.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
- J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” in International Conference on Learning Representations, 2021.
- Y.-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for parameter and memory efficient transfer learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 991–13 005, 2022.
- K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available: https://aclanthology.org/W14-4012
- G. U. Yule, “The applications of the method of correlation to social and economic statistics,” Journal of the Royal Statistical Society, vol. 72, no. 4, pp. 721–730, 1909.
- I. Tsiamas, G. I. Gállego, C. Escolano, J. Fonollosa, and M. R. Costa-jussà, “Pretrained speech encoders and efficient fine-tuning methods for speech translation: UPC at IWSLT 2022,” in Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022). Dublin, Ireland (in-person and online): Association for Computational Linguistics, May 2022, pp. 265–276. [Online]. Available: https://aclanthology.org/2022.iwslt-1.23
- J. Iranzo-Sánchez, J. A. Silvestre-Cerda, J. Jorge, N. Roselló, A. Giménez, A. Sanchis, J. Civera, and A. Juan, “Europarl-st: A multilingual corpus for speech translation of parliamentary debates,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8229–8233.
- E. Matusov, G. Leusch, O. Bender, and H. Ney, “Evaluating machine translation output with automatic sentence segmentation,” in Proceedings of the Second International Workshop on Spoken Language Translation, 2005.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040
- M. Post, “A call for clarity in reporting BLEU scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 186–191. [Online]. Available: https://aclanthology.org/W18-6319
- T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
- T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning robust metrics for text generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 7881–7892. [Online]. Available: https://aclanthology.org/2020.acl-main.704
- A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Proc. Interspeech 2022, 2022, pp. 2278–2282.
- Y. Tang, J. Pino, X. Li, C. Wang, and D. Genzel, “Improving speech translation by understanding and learning from the auxiliary text translation task,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 4252–4261. [Online]. Available: https://aclanthology.org/2021.acl-long.328
- C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino, “Fairseq S2T: Fast speech-to-text modeling with fairseq,” in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations. Suzhou, China: Association for Computational Linguistics, Dec. 2020, pp. 33–39. [Online]. Available: https://aclanthology.org/2020.aacl-demo.6
- Y. Liu, H. Xiong, J. Zhang, Z. He, H. Wu, H. Wang, and C. Zong, “End-to-end speech translation with knowledge distillation,” Proc. Interspeech 2019, pp. 1128–1132, 2019.
- M. Gaido, M. A. Di Gangi, M. Negri, and M. Turchi, “On knowledge distillation for direct speech translation,” Computational Linguistics CLiC-it 2020, p. 211, 2020.
- M. Gaido, M. A. D. Gangi, M. Negri, M. Cettolo, and M. Turchi, “Contextualized Translation of Automatically Segmented Speech,” in Proc. Interspeech 2020, 2020, pp. 1471–1475. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2860
- B. Zhang, I. Titov, B. Haddow, and R. Sennrich, “Beyond sentence-level end-to-end speech translation: Context helps,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 2566–2578. [Online]. Available: https://aclanthology.org/2021.acl-long.200
- Ryo Fukuda (5 papers)
- Katsuhito Sudoh (35 papers)
- Satoshi Nakamura (94 papers)