Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Translation with Large Language Models: An Industrial Practice (2312.13585v1)

Published 21 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Given the great success of LLMs across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the LLM with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long audio inputs. Furthermore, our findings indicate that the implementation of Chain-of-Thought (CoT) prompting can yield advantages in the context of LLM-ST. Through rigorous experimentation on English and Chinese datasets, we showcase the exceptional performance of LLM-ST, establishing a new benchmark in the field of speech translation. Demo: https://speechtranslation.github.io/LLM-st/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  2. Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
  3. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  4. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  5. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  6. V. Raunak, A. Menezes, M. Post, and H. H. Awadallah, “Do gpts produce less literal translations?” arXiv preprint arXiv:2305.16806, 2023.
  7. B. Zhang, B. Haddow, and A. Birch, “Prompting large language model for machine translation: A case study,” arXiv preprint arXiv:2301.07069, 2023.
  8. T. Kocmi and C. Federmann, “Large language models are state-of-the-art evaluators of translation quality,” arXiv preprint arXiv:2302.14520, 2023.
  9. L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu, “Document-level machine translation with large language models,” arXiv preprint arXiv:2304.02210, 2023.
  10. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  11. Q. Dong, Z. Huang, C. Xu, Y. Zhao, K. Wang, X. Cheng, T. Ko, Q. Tian, T. Li, F. Yue et al., “Polyvoice: Language models for speech to speech translation,” arXiv preprint arXiv:2306.02982, 2023.
  12. T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Unified codec language models for speech recognition, synthesis, and translation,” arXiv preprint arXiv:2305.16107, 2023.
  13. Y. Fathullah, C. Wu, E. Lakomkin, J. Jia, Y. Shangguan, K. Li, J. Guo, W. Xiong, J. Mahadeokar, O. Kalinli et al., “Prompting large language models with speech recognition abilities,” arXiv preprint arXiv:2307.11795, 2023.
  14. E. Tsunoo, H. Futami, Y. Kashiwagi, S. Arora, and S. Watanabe, “Decoder-only architecture for speech recognition with ctc prompts and text data augmentation,” arXiv preprint arXiv:2309.08876, 2023.
  15. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  16. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML.   PMLR, 2023, pp. 28 492–28 518.
  17. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  18. R. Ye, C. Zhao, T. Ko, C. Meng, T. Wang, M. Wang, and J. Cao, “Gigast: A 10,000-hour pseudo speech translation corpus,” in Proc. InterSpeech, 2023.
  19. B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in Proc. of ICASSP.   IEEE, 2022, pp. 6182–6186.
  20. C. Wang, A. Wu, J. Gu, and J. Pino, “Covost 2 and massively multilingual speech translation,” Proc. of Interspeech, pp. 2247–2251, 2021.
  21. R. Cattoni, M. A. Di Gangi, L. Bentivogli, M. Negri, and M. Turchi, “Must-c: A multilingual corpus for end-to-end speech translation,” Computer Speech & Language, vol. 66, p. 101155, 2021.
  22. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  23. J. G. Carbonell, S. Klein, D. Miller, M. Steinbaum, T. Grassiany, and J. Frei, “Context-based machine translation,” in Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 2006, pp. 19–28.
  24. Z. Zheng, Y. Xiang, S. Huang, J. Chen, and A. Birch-Mayne, “Toward making the most of context in neural machine translation,” in Proc. of IJCAI.   International Joint Conferences on Artificial Intelligence Organization, 2020, pp. 3983–3989.
  25. Z. Sun, M. Wang, H. Zhou, C. Zhao, S. Huang, J. Chen, and L. Li, “Rethinking document-level neural machine translation,” in Findings of ACL, 2022, pp. 3537–3548.
  26. B. Zhang, I. Titov, B. Haddow, and R. Sennrich, “Beyond sentence-level end-to-end speech translation: Context helps,” in Proc. of ACL, 2021, pp. 2566–2578.
  27. L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman et al., “Seamlessm4t-massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.
  28. P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
  29. R. Ye, M. Wang, and L. Li, “End-to-end speech translation via cross-modal progressive training,” in Proc. of InterSpeech, 2021.
  30. R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “Comet: A neural framework for mt evaluation,” in Proc. of EMNLP, 2020, pp. 2685–2702.
  31. T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys-Dowmunt, H. Matsushita, and A. Menezes, “To ship or not to ship: An extensive evaluation of automatic metrics for machine translation,” in Proceedings of the Sixth Conference on Machine Translation, 2021, pp. 478–494.
  32. W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,” Advances in Neural Information Processing Systems, vol. 34, pp. 27 263–27 277, 2021.
  33. A. B. Sai, A. K. Mohankumar, and M. M. Khapra, “A survey of evaluation metrics used for nlg systems,” ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–39, 2022.
  34. Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhichao Huang (17 papers)
  2. Rong Ye (20 papers)
  3. Tom Ko (31 papers)
  4. Qianqian Dong (19 papers)
  5. Shanbo Cheng (23 papers)
  6. Mingxuan Wang (83 papers)
  7. Hang Li (277 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub