Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Enhancing TTS Stability in Hebrew using Discrete Semantic Units (2410.21502v1)

Published 28 Oct 2024 in cs.SD and eess.AS

Abstract: This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the LLMing process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page pages.cs.huji.ac.il/adiyoss-lab/LoTHM .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  3. Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers. arXiv preprint arXiv:2406.05370, 2024.
  4. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  5. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
  6. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
  7. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36, 2024.
  8. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021.
  9. textless-lib: A library for textless spoken language processing. arXiv preprint arXiv:2202.07359, 2022.
  10. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718, 2023.
  11. Transduce and speak: Neural transducer for text-to-speech with semantic token prediction. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–7. IEEE, 2023.
  12. Transentence: speech-to-speech translation via language-agnostic sentence-level speech encoding without language-parallel data. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12722–12726. IEEE, 2024.
  13. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
  14. Textless speech emotion conversion using discrete and decomposed representations. arXiv preprint arXiv:2111.07402, 2021.
  15. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
  16. Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604, 2021.
  17. ivrit. ai: A comprehensive dataset of hebrew speech for ai research and development. arXiv preprint arXiv:2307.08720, 2023.
  18. Dasb–discrete audio and speech benchmark. arXiv preprint arXiv:2406.14294, 2024.
  19. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023.
  20. Spirit-lm: Interleaved spoken and written language model. arXiv preprint arXiv:2402.05755, 2024.
  21. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  22. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52, 2024.
  23. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023.
  24. A language modeling approach to diacritic-free hebrew tts. arXiv preprint arXiv:2407.12206, 2024.
  25. Alephbert: A hebrew large pre-trained language model to start-off your hebrew nlp application with. arXiv preprint arXiv:2104.04052, 2021.
  26. Saspeech: A hebrew single speaker dataset for text to speech and voice conversion. In Proc. Interspeech, 2023.
  27. Nakdan: Professional hebrew diacritizer. arXiv preprint arXiv:2005.03312, 2020.
  28. Analysing discrete self supervised speech representation for spoken language modeling. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  29. Spoken language recognition using x-vectors. In Odyssey, volume 2018, pages 105–111, 2018.
  30. A systematic comparison of phonetic aware techniques for speech enhancement. arXiv preprint arXiv:2206.11000, 2022.
  31. Hebdb: a weakly supervised dataset for hebrew speech processing. arXiv preprint arXiv:2407.07566, 2024.
  32. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), pages 1–4. IEEE, 2013.
  33. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  34. Zipformer: A faster and better encoder for automatic speech recognition. In The Twelfth International Conference on Learning Representations, 2023.
  35. Codec does matter: Exploring the semantic shortcoming of codec for audio language model. arXiv preprint arXiv:2408.17175, 2024.
  36. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  37. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube