Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 133 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 61 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973v3)

Published 25 Mar 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: We introduce VoiceCraft, a token infilling neural codec LLM, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage readers to listen to the demos at https://jasonppy.github.io/VoiceCraft_web.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Cm3: A causal masked multimodal model of the internet. ArXiv, abs/2201.07520.
  2. Musiclm: Generating music from text. ArXiv, abs/2301.11325.
  3. A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning.
  4. Mathieu Bernard and Hadrien Titeux. 2021. Phonemizer: Text to phones transcription for multiple languages in python. Journal of Open Source Software, 6(68):3958.
  5. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
  6. Speechpainter: Text-conditioned speech inpainting. In Interspeech.
  7. Soundstorm: Efficient parallel audio generation. ArXiv, abs/2305.09636.
  8. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning.
  9. Wavmark: Watermarking for audio generation. ArXiv, abs/2308.12770.
  10. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. In Proc. Interspeech 2021.
  11. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16:1505–1518.
  12. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5903–5917, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  13. Simple and controllable music generation. ArXiv, abs/2306.05284.
  14. COQUI. 2023. Xtts v2. https://huggingface.co/coqui/XTTS-v2.
  15. High fidelity neural audio compression. ArXiv, abs/2210.13438.
  16. Singsong: Generating musical accompaniments from singing. ArXiv, abs/2301.12662.
  17. Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. In AAAI.
  18. Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech.
  19. Vampnet: Music generation via masked acoustic token modeling. ArXiv, abs/2307.04686.
  20. Prompttts: Controllable text-to-speech with text descriptions. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  21. The curious case of neural text degeneration. In International Conference on Learning Representations.
  22. Text-free image-to-speech synthesis using learned segmental units. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021.
  23. Keith Ito and Linda Johnson. 2017. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
  24. Textrolspeech: A text style control speech corpus with codec language text-to-speech models. ArXiv, abs/2308.14430.
  25. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. ArXiv, abs/2306.03509.
  26. Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models. In Annual Meeting of the Association for Computational Linguistics.
  27. Voco: text-based insertion and replacement in audio narration. In International Conference on Computer Graphics and Interactive Techniques.
  28. Text-free prosody-aware generative spoken language modeling. ArXiv, abs/2109.03264.
  29. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
  30. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ArXiv, abs/2106.06103.
  31. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  32. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. ArXiv, abs/2010.05646.
  33. Audiogen: Textually guided audio generation. ArXiv, abs/2209.15352.
  34. Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1:125–128 vol.1.
  35. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  36. Voicebox: Text-guided multilingual universal speech generation at scale. ArXiv, abs/2306.15687.
  37. Feiteng Li. 2023. An unofficial pytorch implementation of vall-e. https://github.com/lifeiteng/vall-e.
  38. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. ArXiv, abs/2305.19522.
  39. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
  40. Daniel Lyth and Simon King. 2024. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. ArXiv, abs/2402.01912.
  41. Matthias Mauch and Simon Dixon. 2014. Pyin: A fundamental frequency estimator using probabilistic threshold distributions. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659–663.
  42. librosa: Audio and music signal analysis in python. In SciPy.
  43. Context-aware prosody correction for text-based speech editing. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7038–7042.
  44. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
  45. Robust speech recognition via large-scale weak supervision. ArXiv, abs/2212.04356.
  46. Language models are unsupervised multitask learners.
  47. Proactive detection of voice cloning with localized watermarking. ArXiv, abs/2401.17264.
  48. Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering.
  49. Editspeech: A text based speech editing system using partial inference and bidirectional fusion. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 626–633.
  50. Neural discrete representation learning. ArXiv, abs/1711.00937.
  51. Attention is all you need. In Neural Information Processing Systems.
  52. Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111.
  53. Context-aware mask prediction network for end-to-end text-based speech editing. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6082–6086.
  54. Viola: Unified codec language models for speech recognition, synthesis, and translation. ArXiv, abs/2305.16107.
  55. Speechx: Neural codec language model as a versatile speech transformer. ArXiv, abs/2308.06873.
  56. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92).
  57. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. ArXiv, abs/2109.00537.
  58. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. ArXiv, abs/2301.13662.
  59. Zipformer: A faster and better encoder for automatic speech recognition. In ICLR.
  60. Retrievertts: Modeling decomposed factors for text-based speech insertion. In Interspeech.
  61. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
  62. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech.
  63. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters, 28:937–941.
  64. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. ArXiv, abs/2303.03926.
Citations (40)

Summary

  • The paper introduces VoiceCraft, a novel token infilling neural codec language model for zero-shot speech editing and TTS.
  • It employs a Transformer-based architecture with causal masking and delayed stacking to enhance sequence generation and naturalness.
  • VoiceCraft achieves state-of-the-art performance on the RealEdit dataset with 48% listener preference and superior WER metrics.

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

The paper "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild" focuses on advancing neural codec LLMs (NCLMs) for performing state-of-the-art speech editing and zero-shot text-to-speech (TTS) in a variety of challenging real-world conditions. It specifically targets applications involving diverse accents, speaking styles, and intricate background conditions, utilizing a novel token infilling model to achieve its objectives.

Model Architecture and Methodology

VoiceCraft employs a token infilling neural codec LLM based on a Transformer decoder architecture. Its core innovation lies in the token rearrangement procedure which comprises two main steps: causal masking and delayed stacking.

  1. Causal Masking: This step allows for autoregressive generation by realigning tokens to ensure the model can condition on both past and future tokens within the sequence. This adjustment is crucial for effective infilling operations required in speech editing and TTS tasks. Figure 1

    Figure 1: An example of the token rearrangement procedure and modeling framework. The rearrangement procedure involves two steps: (1) Causal masking, where masked spans are replaced with mask tokens and moved to the end, and (2) Delayed stacking, where tokens are shifted in the time dimension based on their codebook index.

  2. Delayed Stacking: This step improves multi-codebook modeling by rearranging and conditioning codebooks more efficiently, facilitating effective sequence generation. This is achieved by stacking the resultant vectors with specific delay patterns.

The model is trained autoregressively on sequences, optimized using a multi-codebook loss function that emphasizes the initial codebook layers more substantially than later layers.

Evaluation and Dataset

The paper introduces a novel dataset called RealEdit, designed to evaluate the practicality and robustness of speech editing models. RealEdit includes 310 speech editing examples constructed from a diverse range of sources such as audiobooks, YouTube videos, and Spotify podcasts, making it a highly representative and challenging dataset.

Speech Editing Performance

VoiceCraft's implementation on RealEdit reveals significant improvements over prior state-of-the-art models. Human evaluations show its edited speech is nearly indistinguishable from original recordings in naturalness. The side-by-side comparisons reveal human listeners prefer VoiceCraft over original unedited speech 48% of the time. Figure 2

Figure 2: Speech editing with VoiceCraft. Human listeners prefer VoiceCraft edited speech over the original real recording 48% of the time in side-by-side naturalness comparison.

Zero-Shot Text-to-Speech Synthesis

VoiceCraft's capability extends to zero-shot TTS, outperforming competing state-of-the-art models, such as VALL-E and the commercial XTTS v2. The model demonstrates superior performance across objective metrics like WER, as well as human-rated naturalness and intelligibility, without needing additional fine-tuning for voices unseen during training.

Experimental Results and Comparative Analysis

The paper presents thorough experimental results highlighting VoiceCraft's strengths in both speech editing and zero-shot TTS. The MOS (Mean Opinion Score) evaluations, alongside automatic metrics, underline the superiority of VoiceCraft in various scenarios and editing types. Figure 3

Figure 3: Breakdown of side-by-side human preference on naturalness comparing of VoiceCraft and FluentSpeech on speech editing. Grouped by edit type (left) and edit span length (right).

Implementation Considerations and Challenges

The computational demands of training such a model are managed using optimized training schedules and loss weighting strategies to balance between intelligibility and prosody. Challenges such as long silence and occasional misalignments are addressed, but the paper acknowledges the need for further refinement in these areas.

Conclusion

VoiceCraft advances the field of NCLMs by providing a robust, high-quality solution for speech editing and zero-shot TTS, validated by extensive empirical data on diverse datasets. Its innovative token rearrangement methodology offers a potent toolset for enhancing speech synthesis technologies across numerous applications.

Implications and Future Directions

The paper provides a critical foundation for future research to explore more seamless integrations of speech synthesis and editing technologies while stressing the importance of ethical considerations, such as misuse through voice cloning. Future developments could focus on refining VoiceCraft's inference capabilities and expanding its applicability to additional languages and styles.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 44 likes.

Upgrade to Pro to view all of the tweets about this paper: