Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Large Language Model for Speech Synthesis: An Empirical Study (2401.00246v1)

Published 30 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: LLMs have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
  3. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673, 2023.
  4. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  5. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  6. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  7. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  8. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
  9. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pages 369–376, 2006.
  10. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  13. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. doi: 10.1109/ICASSP40776.2020.9052942.
  14. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  15. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  16. Deep Learning Based Assessment of Synthetic Speech Naturalness. In Proc. Interspeech 2020, pages 1748–1752, 2020. doi: 10.21437/Interspeech.2020-2382.
  17. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  18. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023a. URL https://api.semanticscholar.org/CorpusID:257532815.
  19. OpenAI. Gpt-4v(ision) system card. 2023b. URL https://cdn.openai.com/papers/GPTV_System_Card.pdf.
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  21. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  22. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech 2020, pages 2757–2761, 2020. doi: 10.21437/Interspeech.2020-2826.
  23. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  24. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930, 2023.
  25. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  27. Neural codec language models are zero-shot text to speech synthesizers. axXiv preprint arXiv:2301.02111, 2023a. URL https://arxiv.org/abs/2301.02111.
  28. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023b.
  29. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023.
  30. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023a.
  31. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023b.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  33. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. axXiv preprint arXiv:2303.03926, 2023c. URL https://arxiv.org/abs/2303.03926.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hongkun Hao (11 papers)
  2. Long Zhou (57 papers)
  3. Shujie Liu (101 papers)
  4. Jinyu Li (164 papers)
  5. Shujie Hu (36 papers)
  6. Rui Wang (996 papers)
  7. Furu Wei (291 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com