Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation (2403.12408v1)

Published 19 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech LLM (MSLM), which is a decoder-only speech LLM trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  2. Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533.
  3. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518.
  4. High fidelity neural audio compression. CoRR, abs/2210.13438.
  5. Polyvoice: Language models for speech to speech translation. CoRR, abs/2306.02982.
  6. SpeechMatrix: A large-scale mined corpus of multilingual speech-to-speech translations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16251–16269, Toronto, Canada. Association for Computational Linguistics.
  7. Multilingual speech-to-speech translation into multiple target languages. CoRR, abs/2307.08655.
  8. Textually pretrained speech language models. CoRR, abs/2305.13009.
  9. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460.
  10. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 10120–10134. PMLR.
  11. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics.
  12. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  13. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  14. Voicebox: Text-guided multilingual universal speech generation at scale. CoRR, abs/2306.15687.
  15. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States. Association for Computational Linguistics.
  16. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 860–872.
  17. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. CoRR, abs/2309.07937.
  18. Speechlmscore: Evaluating speech generation using speech language model. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  19. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
  20. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021.
  21. Audiopalm: A large language model that can speak and listen. CoRR, abs/2306.12925.
  22. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111.
  23. Speech-to-speech translation with discrete-unit-based style transfer. CoRR, abs/2309.07566.
  24. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507.
  25. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  26. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. CoRR, abs/2303.03926.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com