Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering (2401.07333v1)

Published 14 Jan 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: The LLM (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) LLM; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms VALL-E in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanups. Audio samples are available at https://ereboas.github.io/ELLAV/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Flamingo: A visual language model for few-shot learning. Proc. NeurIPS.
  2. Deep Voice: Real-time neural text-to-speech. In Proc. ICML.
  3. UniLMv2: Pseudo-masked language models for unified language model pre-training. In Proc. ICML.
  4. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  5. Language models are few-shot learners. In Proc. NeurIPS.
  6. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
  7. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In Proc. ACM SIGKDD.
  8. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  9. High fidelity neural audio compression. Transactions on Machine Learning Research.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. ACL.
  11. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. INTERSPEECH.
  12. Decoder-only or encoder-decoder? Interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:2304.04052.
  13. Conformer: Convolution-augmented Transformer for speech recognition. In Proc. INTERSPEECH.
  14. Denoising diffusion probabilistic models. Proc. NeurIPS.
  15. The curious case of neural text degeneration. In Proc. ICLR.
  16. Hierarchical generative modeling for controllable speech synthesis. In Proc. ICLR.
  17. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proc. ICML.
  18. ProDiff: Progressive fast diffusion model for high-quality text-to-speech. In Proc. ACM MM.
  19. Comparison of diverse decoding methods from conditional language models. In Proc. ACL.
  20. Diff-TTS: A denoising diffusion model for text-to-speech. In Proc. Interspeech.
  21. Libri-light: A benchmark for asr with limited or no supervision. In Proc. ICASSP. IEEE.
  22. Investigating the utility of surprisal from large language models for speech synthesis prosody. In 12th Speech Synthesis Workshop (SSW).
  23. Speak, Read and Prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540.
  24. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proc. ICML. PMLR.
  25. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proc. NeurIPS.
  26. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. ICML. PMLR.
  27. AudioGen: Textually guided audio generation. In Proc. ICLR.
  28. Voicebox: Text-guided multilingual universal speech generation at scale. In Proc. NeurIPS.
  29. Bidirectional variational inference for non-autoregressive text-to-speech. In Proc. ICLR.
  30. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. ACL.
  31. Flow matching for generative modeling. In Proc. ICLR.
  32. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proc. Interspeech.
  33. Flow-TTS: A non-autoregressive network for text to speech based on flow. In Proc. ICASSP. IEEE.
  34. WaveNet: A generative model for raw audio. In Proc. ICML.
  35. LibriSpeech: An ASR corpus based on public domain audio books. In Proc. ICASSP. IEEE.
  36. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, pages 8599–8608. PMLR.
  37. WaveGlow: A flow-based generative network for speech synthesis. In Proc. ICASSP, pages 3617–3621. IEEE.
  38. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  39. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  40. Hierarchical text-conditional image generation with CLIP latents. arXiv e-prints, pages arXiv–2204.
  41. FastSpeech: Fast, robust and controllable text to speech. Proc. NeurIPS, 32.
  42. High-resolution image synthesis with latent diffusion models. In Proc. CVPR.
  43. AudioPaLM: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  44. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML. PMLR.
  46. Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. Proc. NeurIPS.
  47. Score-based generative modeling through stochastic differential equations. In Proc. ICLR.
  48. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  49. Multimodal few-shot learning with frozen language models. Proc. NeurIPS.
  50. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Speech Synthesis Workshop (SSW), pages 146–152.
  51. Attention is all you need. Proc. NeurIPS.
  52. Attention is all you need. Proc. NeurIPS, 30.
  53. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  54. LauraGPT: Listen, attend, understand, and regenerate audio with GPT. arXiv preprint arXiv:2310.04673.
  55. SpeechX: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873.
  56. Tacotron: Towards end-to-end speech synthesis. In Proc. INTERSPEECH.
  57. Language models with image descriptors are strong few-shot video-language learners. Proc. NeurIPS.
  58. LM-VC: Zero-shot voice conversion via speech generation based on language models. arXiv preprint arXiv:2306.10521.
  59. Shoule Wu and Ziqiang Shi. 2022. ItôWave: Itô stochastic differential equation is all you need for wave generation. In Proc. ICASSP. IEEE.
  60. Zero-shot video question answering via frozen bidirectional language models. Proc. NeurIPS, 35:124–141.
  61. XLNet: Generalized autoregressive pretraining for language understanding. Proc. NeurIPS.
  62. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Sixth European conference on speech communication and technology.
  63. Scaling autoregressive models for content-rich Text-to-Image generation. Transactions on Machine Learning Research.
  64. Heiga Zen and Haşim Sak. 2015. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474. IEEE.
  65. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064.
  66. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yakun Song (9 papers)
  2. Zhuo Chen (319 papers)
  3. Xiaofei Wang (138 papers)
  4. Ziyang Ma (73 papers)
  5. Xie Chen (166 papers)
Citations (26)

Summary

We haven't generated a summary for this paper yet.