Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness (2404.06714v3)

Published 10 Apr 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent advancements in NLP have seen Large-scale LLMs excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Expressive, variable, and controllable duration modelling in tts.
  2. The falcon series of open language models.
  3. Statistical parametric speech synthesis. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 4, pages IV–1229–IV–1232.
  4. Audiolm: a language modeling approach to audio generation.
  5. Soundstorm: Efficient parallel audio generation.
  6. Language models are few-shot learners.
  7. BSI. 1973a. Natural Fibre Twines, 3rd edition. British Standards Institution, London. BS 2570.
  8. BSI. 1973b. Natural fibre twines. BS 2570, British Standards Institution, London. 3rd. edn.
  9. The use of user modelling to guide inference and learning. Applied Intelligence, 2(1):37–53.
  10. Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis. In Proceedings of the 29th International Conference on Computational Linguistics, pages 7193–7202, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  11. J.L. Chercheur. 1994. Case-Based Reasoning, 2nd edition. Morgan Kaufman Publishers, San Mateo, CA.
  12. N. Chomsky. 1973. Conditions on transformations. In A festschrift for Morris Halle, New York. Holt, Rinehart & Winston.
  13. Palm: Scaling language modeling with pathways.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding.
  15. Umberto Eco. 1990. The Limits of Interpretation. Indian University Press.
  16. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models.
  17. Prompttts: Controllable text-to-speech with text descriptions.
  18. Pre-trained text embeddings for enhanced text-to-speech synthesis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September:4430–4434.
  19. Paul Gerhard Hoel. 1971a. Elementary Statistics, 3rd edition. Wiley series in probability and mathematical statistics. Wiley, New York, Chichester. ISBN 0 471 40300.
  20. Paul Gerhard Hoel. 1971b. Elementary Statistics, 3rd edition, Wiley series in probability and mathematical statistics, pages 19–33. Wiley, New York, Chichester. ISBN 0 471 40300.
  21. Otto Jespersen. 1922. Language: Its Nature, Development, and Origin. Allen and Unwin.
  22. Mixtral of experts.
  23. Improving prosody of rnn-based english text-to-speech synthesis by incorporating a bert model.
  24. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech.
  25. Phoneme-level bert for enhanced prosody of text-to-speech with grapheme predictions.
  26. Improving prosody for unseen texts in speech synthesis by utilizing linguistic information and noisy data.
  27. Emotional voice conversion with cycle-consistent adversarial network.
  28. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland. Association for Computational Linguistics.
  29. Exploring effectiveness of gpt-3 in grammatical error correction: A study on performance and controllability in prompt-based methods.
  30. Text aware Emotional Text-to-speech with BERT. In Proc. Interspeech 2022, pages 4601–4605.
  31. Translatotron 3: Speech to speech translation with monolingual data.
  32. Gpt-4 technical report.
  33. Robust speech recognition via large-scale weak supervision.
  34. Improving language understanding by generative pre-training.
  35. Whispering llama: A cross-modal generative error correction framework for speech recognition.
  36. Fastspeech 2: Fast and high-quality end-to-end text to speech.
  37. Fastspeech: Fast, robust and controllable text to speech.
  38. Statistical parametric speech synthesis incorporating generative adversarial networks.
  39. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.
  40. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.
  41. A history of technology. Oxford University Press, London. 5 vol.
  42. Char2wav: End-to-end speech synthesis. In International Conference on Learning Representations.
  43. Alternate endings: Improving prosody for incremental neural tts with predicted future text input.
  44. Jannik Strötgen and Michael Gertz. 2012. Temporal tagging on different domains: Challenges, strategies, and gold standards. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 3746–3753, Istanbul, Turkey. European Language Resource Association (ELRA).
  45. Superheroes experiences with books, 20th edition. The Phantom Editors Associates, Gotham City.
  46. Paul Taylor. 2009. Text-to-Speech Synthesis. Cambridge University Press.
  47. Llama 2: Open foundation and fine-tuned chat models.
  48. Prompt-tuning can be much better than fine-tuning on cross-lingual understanding with multilingual language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5478–5485, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  49. Downstream task performance of BERT models pre-trained using automatically de-identified clinical data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4245–4252, Marseille, France. European Language Resources Association.
  50. Better zero-shot reasoning with self-adaptive prompting. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada. Association for Computational Linguistics.
  51. Neural codec language models are zero-shot text to speech synthesizers.
  52. Tacotron: Towards end-to-end speech synthesis. pages 4006–4010.
  53. SIMMC-VR: A task-oriented multimodal dialog dataset with situated and immersive VR streams. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6273–6291, Toronto, Canada. Association for Computational Linguistics.
  54. Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis. In Proc. Interspeech 2019, pages 4480–4484.
  55. Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7962–7966.
  56. Video-llama: An instruction-tuned audio-visual language model for video understanding.
  57. Multi-speaker expressive speech synthesis via multiple factors decoupling.
  58. Adrian Łańcucki. 2021. Fastpitch: Parallel text-to-speech with pitch prediction.
  59. Adaeze Adigwe and Noé Tits and Kevin El Haddad and Sarah Ostadabbas and Thierry Dutoit. 2018. The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems.
  60. Tomoki Hayashi and Ryuichi Yamamoto and Katsuki Inoue and Takenori Yoshimura and Shinji Watanabe and Tomoki Toda and Kazuya Takeda and Yu Zhang and Xu Tan. 2020. ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit.
  61. Tomoki Hayashi and Ryuichi Yamamoto and Takenori Yoshimura and Peter Wu and Jiatong Shi and Takaaki Saeki and Yooncheol Ju and Yusuke Yasuda and Shinnosuke Takamichi and Shinji Watanabe. 2021. ESPnet2-TTS: Extending the Edge of TTS Research.
  62. Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari. 2022. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Xincan Feng (4 papers)
  2. Akifumi Yoshimoto (2 papers)
Citations (2)