Direct Punjabi to English speech translation using discrete units (2402.15967v1)
Abstract: Speech-to-speech translation is yet to reach the same level of coverage as text-to-text translation systems. The current speech technology is highly limited in its coverage of over 7000 languages spoken worldwide, leaving more than half of the population deprived of such technology and shared experiences. With voice-assisted technology (such as social robots and speech-to-text apps) and auditory content (such as podcasts and lectures) on the rise, ensuring that the technology is available for all is more important than ever. Speech translation can play a vital role in mitigating technological disparity and creating a more inclusive society. With a motive to contribute towards speech translation research for low-resource languages, our work presents a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English. Additionally, we explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model. The model, abbreviated as Unit-to-Unit Translation (U2UT), takes a sequence of discrete units of the source language (the language being translated from) and outputs a sequence of discrete units of the target language (the language being translated to). Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
- A. W. Black, “Cmu wilderness multilingual speech dataset,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5971–5975.
- X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
- V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
- P.-J. Chen, K. Tran, Y. Yang, J. Du, J. Kao, Y.-A. Chung, P. Tomasello, P.-A. Duquenne, H. Schwenk, H. Gong et al., “Speech-to-speech translation for a real-world unwritten language,” arXiv preprint arXiv:2211.06474, 2022.
- L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman et al., “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
- A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, A. Polyak, Y. Adi, Q. He, Y. Tang et al., “Direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2107.05604, 2021.
- E. Kharitonov, J. Copet, K. Lakhotia, T. A. Nguyen, P. Tomasello, A. Lee, A. Elkahky, W.-N. Hsu, A. Mohamed, E. Dupoux et al., “textless-lib: A library for textless spoken language processing,” arXiv preprint arXiv:2202.07359, 2022.
- J. Gala, P. A. Chitale, R. AK, S. Doddapaneni, V. Gumma, A. Kumar, J. Nawale, A. Sujatha, R. Puduppully, V. Raghavan et al., “Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages,” arXiv preprint arXiv:2305.16307, 2023.
- J. Kaur, A. Singh, and V. Kadyan, “Automatic speech recognition system for tonal languages: State-of-the-art survey,” Archives of Computational Methods in Engineering, vol. 28, pp. 1039–1068, 2021.
- B. Premjith, M. A. Kumar, and K. Soman, “Neural machine translation system for english to indian language translation using mtil parallel corpus,” Journal of Intelligent Systems, vol. 28, no. 3, pp. 387–398, 2019.
- Y. Song, C. Cui, S. Khanuja, P. Liu, F. Faisal, A. Ostapenko, G. I. Winata, A. F. Aji, S. Cahyawijaya, Y. Tsvetkov et al., “Globalbench: A benchmark for global progress in natural language processing,” arXiv preprint arXiv:2305.14716, 2023.
- S. Dua, S. S. Kumar, Y. Albagory, R. Ramalingam, A. Dumka, R. Singh, M. Rashid, A. Gehlot, S. S. Alshamrani, and A. S. AlGhamdi, “Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network,” Applied Sciences, vol. 12, no. 12, p. 6223, 2022.
- P. Kaur, Q. Wang, and W. Shi, “Fall detection from audios with audio transformers,” Smart Health, vol. 26, p. 100340, 2022.
- Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu, “Direct speech-to-speech translation with a sequence-to-sequence model,” arXiv preprint arXiv:1904.06037, 2019.
- X. Chang, B. Yan, K. Choi, J. Jung, Y. Lu, S. Maiti, R. Sharma, J. Shi, J. Tian, S. Watanabe et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” arXiv preprint arXiv:2309.15800, 2023.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- J.-C. Chou, C.-M. Chien, W.-N. Hsu, K. aLivescu, A. Babu, A. Conneau, A. Baevski, and M. Auli, “Toward joint language modeling for speech units and text,” arXiv preprint arXiv:2310.08715, 2023.
- Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation,” in International Conference on Machine Learning. PMLR, 2022, pp. 10 120–10 134.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, J. Pino, J. Gu, and W.-N. Hsu, “Textless speech-to-speech translation on real data,” arXiv preprint arXiv:2112.08352, 2021.
- S. Popuri, P.-J. Chen, C. Wang, J. Pino, Y. Adi, J. Gu, W.-N. Hsu, and A. Lee, “Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation,” arXiv preprint arXiv:2204.02967, 2022.
- H. Gong, N. Dong, S. Popuri, V. Goswami, A. Lee, and J. Pino, “Multilingual speech-to-speech translation into multiple target languages,” arXiv preprint arXiv:2307.08655, 2023.
- H. Inaguma, S. Popuri, I. Kulikov, P.-J. Chen, C. Wang, Y.-A. Chung, Y. Tang, A. Lee, S. Watanabe, and J. Pino, “Unity: Two-pass direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2212.08055, 2022.
- T. Kano, S. Sakti, and S. Nakamura, “Transformer-based direct speech-to-speech translation with transcoder,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 958–965.
- P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
- V. Mujadia and D. Sharma, “Towards speech to speech machine translation focusing on Indian languages,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 161–168. [Online]. Available: https://aclanthology.org/2023.eacl-demo.19
- N. Shaghaghi, S. Ghosh, and R. Kapoor, “Classroute: An english to punjabi educational video translation pipeline for supporting punjabi mother-tongue education,” in 2021 IEEE Global Humanitarian Technology Conference (GHTC), 2021, pp. 342–348.
- A. Singh, V. Kadyan, M. Kumar, and N. Bassan, “Asroil: a comprehensive survey for automatic speech recognition of indian languages,” Artificial Intelligence Review, vol. 53, pp. 3673–3704, 2020.
- A. Jha and H. Y. Patil, “A review of machine transliteration, translation, evaluation metrics and datasets in indian languages,” Multimedia Tools and Applications, pp. 1–32, 2022.
- S. Doddapaneni, R. Aralikatte, G. Ramesh, S. Goyal, M. M. Khapra, A. Kunchukuttan, and P. Kumar, “Towards leaving no indic language behind: Building monolingual corpora, benchmark and models for indic languages,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 12 402–12 426.
- M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
- V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.
- R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
- C. Wang, A. Wu, and J. Pino, “Covost 2 and massively multilingual speech-to-text translation,” arXiv preprint arXiv:2007.10310, 2020.
- M. J. Gales, K. M. Knill, A. Ragni, and S. P. Rath, “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued,” in Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014). International Speech Communication Association (ISCA), 2014, pp. 16–23.
- J. Valk and T. Alumäe, “Voxlingua107: a dataset for spoken language recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 652–658.
- Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” arXiv preprint arXiv:2201.03713, 2022.
- C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.
- M. Z. Boito, W. N. Havard, M. Garnerin, É. L. Ferrand, and L. Besacier, “Mass: A large and clean multilingual corpus of sentence-aligned spoken utterances extracted from the bible,” arXiv preprint arXiv:1907.12895, 2019.
- M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “Must-c: a multilingual speech translation corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2019, pp. 2012–2017.
- P.-A. Duquenne, H. Gong, N. Dong, J. Du, A. Lee, V. Goswani, C. Wang, J. Pino, B. Sagot, and H. Schwenk, “Speechmatrix: A large-scale mined corpus of multilingual speech-to-speech translations,” arXiv preprint arXiv:2211.04508, 2022.
- A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” arXiv preprint arXiv:2205.12446, 2022.
- T. Javed, K. S. Bhogale, A. Raman, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” arXiv preprint arXiv:2208.11761, 2022.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- M. Post, “A call for clarity in reporting bleu scores,” arXiv preprint arXiv:1804.08771, 2018.