Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale (2306.15687v2)

Published 23 Jun 2023 in eess.AS, cs.CL, cs.LG, and cs.SD
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.

Overview of "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale"

The paper focuses on the development and evaluation of Voicebox, a text-guided multilingual speech generation model capable of accomplishing multiple tasks at scale. Voicebox introduces a novel approach to speech synthesis and editing, drawing parallels to the generative capabilities seen in large-scale text and image models like GPT and DALL-E. It seeks to bridge the gap between existing speech generation models and their counterparts in text and vision by implementing an approach that allows for extensive linguistic and contextual adaptability.

Technical Approach

Voicebox is built on a non-autoregressive flow-matching model, trained specifically to perform speech infilling tasks. This training involves predicting masked speech segments using the surrounding audio context and text transcripts. The training dataset spans over 50,000 hours of speech data across multiple languages, a notable augmentation from the typically limited datasets used in the domain, thus enhancing Voicebox's ability to generalize across tasks. By utilizing a conditional flow-matching approach with optimal transport paths, Voicebox is able to efficiently model the distribution of masked speech, allowing for the generation of coherent and intelligible audio.

Task Versatility

The model's versatility is highlighted through its ability to engage in zero-shot text-to-speech (TTS) synthesis, cross-lingual synthesis, noise removal, content editing, and style conversion. These tasks are facilitated via in-context learning, similarly to LLMs, but with additional flexibility due to conditioning on future context. Empirical results highlight that Voicebox outperforms state-of-the-art models such as VALL-E in zero-shot TTS, achieving lower word error rates and improved audio similarity scores while being 20 times faster in inference. Furthermore, Voicebox extends its capabilities to cross-lingual TTS across six languages without relying on style labels or multilingual embeddings, unlike previous models that faced substantial performance loss in cross-lingual scenarios.

Evaluation Metrics

The paper employs a variety of metrics to assess Voicebox's performance, focusing on correctness and intelligibility via word error rates and audio similarity scores using established speaker embeddings. The evaluation also encompasses diversity and quality assessments akin to those used in image generation via Fréchet Inception Distance (FID) adapted for speech as Fréchet Speech Distance (FSD). This metric assesses how closely the distribution of generated samples approximates that of real speech, reflecting both diversity and quality.

Implications and Future Developments

Voicebox's capacity to understand and generate speech conditioned on both prior and subsequent context enhances its practical applications, ranging from real-time synthesis in varied linguistic environments to adaptive voice interfaces in technology. The results also suggest potential for further improvements through scaling of diverse multilinguistic datasets, which could overcome current limitations such as reduced performance in conversational or less-scripted speech scenarios.

The paper's advancements posit a potential shift in the approach to speech synthesis and editing, especially highlighting the capabilities inherent in leveraging large, diverse datasets and advanced flow-matching models. The authors further suggest that future work may focus on disentangling the controls for various stylistic attributes in audio, which would allow even finer manipulation and generation of speech, well beyond the current capabilities.

In summary, "Voicebox" represents a significant stride towards more generalized and effective speech generation models that could parallel the advancements seen in text and image processing, opening new avenues for speech technology applications and research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Scaling laws for generative mixed-modal language models. ArXiv, abs/2301.03728, 2023.
  2. Expressive speech synthesis via modeling expressions with variational autoencoder. ArXiv, abs/1804.02135, 2018.
  3. Common voice: A massively-multilingual speech corpus. In International Conference on Language Resources and Evaluation, 2019.
  4. XLS-R: self-supervised cross-lingual speech representation learning at scale. In H. Ko and J. H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2278–2282. ISCA, 2022.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020.
  6. A3T: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, 2022.
  7. AudioLM: a language modeling approach to audio generation. ArXiv, abs/2209.03143, 2022a.
  8. SpeechPainter: Text-conditioned speech inpainting. In Interspeech, 2022b.
  9. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  10. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  11. YourTTS: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, 2021.
  12. Tts-portuguese corpus: a corpus for speech synthesis in brazilian portuguese. Language Resources and Evaluation, 56(3):1043–1055, 2022.
  13. R. T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq.
  14. Neural ordinary differential equations. In Neural Information Processing Systems, 2018.
  15. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  16. Real time speech enhancement in the waveform domain. ArXiv, abs/2006.12847, 2020.
  17. High fidelity neural audio compression. ArXiv, abs/2210.13438, 2022.
  18. ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification. In Interspeech, 2020.
  19. P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 2021.
  20. Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 1, pages 517–520. IEEE Computer Society, 1992.
  21. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  22. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in neural information processing systems, 2017.
  23. J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  24. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
  25. Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022.
  26. Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019.
  27. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  28. Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech enhancement. arXiv preprint arXiv:2212.11377, 2022.
  29. FastDiff: A fast conditional diffusion model for high-quality speech synthesis. In International Joint Conference on Artificial Intelligence, 2022.
  30. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 2018.
  31. Libri-Light: A benchmark for asr with limited or no supervision. International Conference on Acoustics, Speech and Signal Processing, 2019.
  32. StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. IEEE Spoken Language Technology Workshop, 2018.
  33. Text-free prosody-aware generative spoken language modeling. In Annual Meeting of the Association for Computational Linguistics, 2021.
  34. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision, 2023.
  35. Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In Interspeech, 2019.
  36. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 2020.
  37. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, 2021.
  38. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  39. D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  40. Textless speech emotion conversion using decomposed and discrete representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
  41. R. Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993.
  42. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
  43. A. Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In International Conference on Acoustics, Speech and Signal Processing, 2021.
  44. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019.
  45. Flow matching for generative modeling. In International Conference on Learning Representations, 2023.
  46. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. ArXiv, abs/1804.04262, 2018.
  47. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, 2017.
  48. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2022.
  49. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, 2021.
  50. Librispeech: An asr corpus based on public domain audio books. International Conference on Acoustics, Speech and Signal Processing, 2015.
  51. SpecAugment: A simple data augmentation method for automatic speech recognition. In Interspeech, 2019.
  52. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 2019.
  53. Speech resynthesis from discrete disentangled self-supervised representations. In Interspeech, 2021.
  54. Grad-TTS: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, 2021.
  55. The kaldi speech recognition toolkit. In Workshop on automatic speech recognition and understanding, 2011.
  56. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.
  57. Robust speech recognition via large-scale weak supervision. ArXiv, abs/2212.04356, 2022.
  58. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
  59. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021.
  60. CrowdMOS: An approach for crowdsourcing mean opinion score studies. In International Conference on Acoustics, Speech and Signal Processing, 2011.
  61. Sequence-to-sequence modelling of F0 for speech emotion conversion. In International Conference on Acoustics, Speech and Signal Processing, 2019.
  62. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  63. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
  64. Universal speech enhancement with score-based diffusion. ArXiv, abs/2206.03065, 2022.
  65. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. International Conference on Acoustics, Speech and Signal Processing, 2017.
  66. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  67. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning, pages 4693–4702. PMLR, 2018.
  68. Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  69. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. ArXiv, abs/2205.04421, 2022.
  70. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  71. fairseq s22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: A scalable and integrable speech synthesis toolkit. In Conference on Empirical Methods in Natural Language Processing, 2021.
  72. Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111, 2023.
  73. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, 2018.
  74. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7–19, 2014.
  75. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.
  76. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In International Conference on Acoustics, Speech and Signal Processing, 2020.
  77. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the IEEE/CVF International conference on computer vision, pages 14448–14457, 2021.
  78. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022.
  79. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Matthew Le (7 papers)
  2. Apoorv Vyas (15 papers)
  3. Bowen Shi (82 papers)
  4. Brian Karrer (41 papers)
  5. Rashel Moritz (4 papers)
  6. Mary Williamson (13 papers)
  7. Vimal Manohar (15 papers)
  8. Yossi Adi (96 papers)
  9. Jay Mahadeokar (36 papers)
  10. Wei-Ning Hsu (76 papers)
  11. Leda Sari (15 papers)
Citations (206)