Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization (2407.02243v1)

Published 2 Jul 2024 in cs.CL, cs.SD, and eess.AS

Abstract: In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  5. James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
  6. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Enhancing zero-shot text-to-speech synthesis with human feedback. arXiv preprint arXiv:2406.00654, 2024.
  9. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021.
  10. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  11. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  12. Generalization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442–8446. IEEE, 2022.
  13. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
  14. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023.
  15. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  16. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024.
  17. Context-aware selective label smoothing for calibrating sequence recognition model. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4591–4599, 2021.
  18. Repcodec: A speech representation codec for speech tokenization. arXiv preprint arXiv:2309.00169, 2023.
  19. Textrolspeech: A text style control speech corpus with codec language text-to-speech models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10301–10305. IEEE, 2024.
  20. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  21. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718, 2023.
  22. Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 2023.
  23. Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093, 2024.
  24. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. arXiv preprint arXiv:2305.19522, 2023.
  25. Enhancing llm safety via constrained direct preference optimization. arXiv preprint arXiv:2403.02475, 2024.
  26. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  27. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. arXiv preprint arXiv:2402.01912, 2024.
  28. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022.
  29. OpenAI. Introducing chatgpt. OpenAI Blog, 2022.
  30. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  31. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  32. Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021.
  33. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024.
  34. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  35. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  39. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023.
  40. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  41. Rall-e: Robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis. arXiv preprint arXiv:2404.03204, 2024.
  42. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023.
  43. Adaptation of context-dependent deep neural networks for automatic speech recognition. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 366–369. IEEE, 2012.
  44. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  45. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  46. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
  47. Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024.
  48. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024.
  49. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.
  50. Speechalign: Aligning speech generation to human preferences. arXiv preprint arXiv:2404.05600, 2024.
  51. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023.
  52. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuchen Hu (60 papers)
  2. Chen Chen (753 papers)
  3. Siyin Wang (19 papers)
  4. Eng Siong Chng (112 papers)
  5. Chao Zhang (907 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com