UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding (2306.07547v6)
Abstract: The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.
- vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In NeurIPS, volume 33, 12449–12460.
- Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143.
- YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone. In ICML, volume 162, 2709–2720.
- w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. In IEEE ASRU, 244–250.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
- VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature. In ISCA Interspeech, 1596–1600.
- A pitch extraction algorithm tuned for automatic speech recognition. In IEEE ICASSP, 2494–2498.
- Vector Quantized Diffusion Model for Text-to-Image Synthesis. In IEEE/CVF CVPR, 10686–10696.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In ISCA Interspeech, 5036–5040.
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE ACM Trans. Audio Speech Lang. Process., 29: 3451–3460.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540.
- Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In NeurIPS, volume 33, 8067–8077.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In NeurIPS, volume 33, 17022–17033.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 1336–1354.
- DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism. In AAAI, 11020–11028.
- DiffVoice: Text-to-Speech with Latent Diffusion. In IEEE ICASSP, 1–5.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In ISCA Interspeech, 3615–3619.
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. In ICML, volume 139, 8599–8608.
- Robust Speech Recognition via Large-Scale Weak Supervision. In ICML, volume 202, 28492–28518.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
- Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE ICASSP, 4779–4783.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
- X-Vectors: Robust DNN Embeddings for Speaker Recognition. In IEEE ICASSP, 5329–5333.
- EdiTTS: Score-based Editing for Controllable Text-to-Speech. In ISCA Interspeech, 421–425.
- Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111.
- ESPnet: End-to-End Speech Processing Toolkit. In ISCA Interspeech, 2207–2211.
- InstructTTS: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662.
- RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion. In ISCA Interspeech, 1571–1575.
- SoundStream: An End-to-End Neural Audio Codec. IEEE ACM Trans. Audio Speech Lang. Process., 30: 495–507.
- LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In ISCA Interspeech, 1526–1530.
- Chenpeng Du (28 papers)
- Yiwei Guo (29 papers)
- Feiyu Shen (6 papers)
- Zhijun Liu (15 papers)
- Zheng Liang (32 papers)
- Xie Chen (165 papers)
- Shuai Wang (466 papers)
- Hui Zhang (405 papers)
- Kai Yu (201 papers)