ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering (2401.07333v1)
Abstract: The LLM (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) LLM; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms VALL-E in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanups. Audio samples are available at https://ereboas.github.io/ELLAV/.
- Flamingo: A visual language model for few-shot learning. Proc. NeurIPS.
- Deep Voice: Real-time neural text-to-speech. In Proc. ICML.
- UniLMv2: Pseudo-masked language models for unified language model pre-training. In Proc. ICML.
- AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Language models are few-shot learners. In Proc. NeurIPS.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
- Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In Proc. ACM SIGKDD.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- High fidelity neural audio compression. Transactions on Machine Learning Research.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. ACL.
- TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. INTERSPEECH.
- Decoder-only or encoder-decoder? Interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:2304.04052.
- Conformer: Convolution-augmented Transformer for speech recognition. In Proc. INTERSPEECH.
- Denoising diffusion probabilistic models. Proc. NeurIPS.
- The curious case of neural text degeneration. In Proc. ICLR.
- Hierarchical generative modeling for controllable speech synthesis. In Proc. ICLR.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In Proc. ICML.
- ProDiff: Progressive fast diffusion model for high-quality text-to-speech. In Proc. ACM MM.
- Comparison of diverse decoding methods from conditional language models. In Proc. ACL.
- Diff-TTS: A denoising diffusion model for text-to-speech. In Proc. Interspeech.
- Libri-light: A benchmark for asr with limited or no supervision. In Proc. ICASSP. IEEE.
- Investigating the utility of surprisal from large language models for speech synthesis prosody. In 12th Speech Synthesis Workshop (SSW).
- Speak, Read and Prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540.
- Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proc. ICML. PMLR.
- Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proc. NeurIPS.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. ICML. PMLR.
- AudioGen: Textually guided audio generation. In Proc. ICLR.
- Voicebox: Text-guided multilingual universal speech generation at scale. In Proc. NeurIPS.
- Bidirectional variational inference for non-autoregressive text-to-speech. In Proc. ICLR.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. ACL.
- Flow matching for generative modeling. In Proc. ICLR.
- Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Proc. Interspeech.
- Flow-TTS: A non-autoregressive network for text to speech based on flow. In Proc. ICASSP. IEEE.
- WaveNet: A generative model for raw audio. In Proc. ICML.
- LibriSpeech: An ASR corpus based on public domain audio books. In Proc. ICASSP. IEEE.
- Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, pages 8599–8608. PMLR.
- WaveGlow: A flow-based generative network for speech synthesis. In Proc. ICASSP, pages 3617–3621. IEEE.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Hierarchical text-conditional image generation with CLIP latents. arXiv e-prints, pages arXiv–2204.
- FastSpeech: Fast, robust and controllable text to speech. Proc. NeurIPS, 32.
- High-resolution image synthesis with latent diffusion models. In Proc. CVPR.
- AudioPaLM: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
- NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML. PMLR.
- Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. Proc. NeurIPS.
- Score-based generative modeling through stochastic differential equations. In Proc. ICLR.
- LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Multimodal few-shot learning with frozen language models. Proc. NeurIPS.
- Investigating RNN-based speech enhancement methods for noise-robust text-to-speech. In Speech Synthesis Workshop (SSW), pages 146–152.
- Attention is all you need. Proc. NeurIPS.
- Attention is all you need. Proc. NeurIPS, 30.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
- LauraGPT: Listen, attend, understand, and regenerate audio with GPT. arXiv preprint arXiv:2310.04673.
- SpeechX: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873.
- Tacotron: Towards end-to-end speech synthesis. In Proc. INTERSPEECH.
- Language models with image descriptors are strong few-shot video-language learners. Proc. NeurIPS.
- LM-VC: Zero-shot voice conversion via speech generation based on language models. arXiv preprint arXiv:2306.10521.
- Shoule Wu and Ziqiang Shi. 2022. ItôWave: Itô stochastic differential equation is all you need for wave generation. In Proc. ICASSP. IEEE.
- Zero-shot video question answering via frozen bidirectional language models. Proc. NeurIPS, 35:124–141.
- XLNet: Generalized autoregressive pretraining for language understanding. Proc. NeurIPS.
- Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Sixth European conference on speech communication and technology.
- Scaling autoregressive models for content-rich Text-to-Image generation. Transactions on Machine Learning Research.
- Heiga Zen and Haşim Sak. 2015. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470–4474. IEEE.
- Statistical parametric speech synthesis. speech communication, 51(11):1039–1064.
- Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926.
- Yakun Song (9 papers)
- Zhuo Chen (319 papers)
- Xiaofei Wang (138 papers)
- Ziyang Ma (73 papers)
- Xie Chen (166 papers)