Intelli-Z: Toward Intelligible Zero-Shot TTS (2401.13921v1)
Abstract: Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage. In this study, we propose a zero-shot TTS model focused on intelligibility, which we refer to as Intelli-Z. Intelli-Z learns speaker embeddings by using multi-speaker TTS as its teacher and is trained with a cycle-consistency loss to include mismatched text-speech pairs for training. Additionally, it selectively aggregates speaker embeddings along the temporal dimension to minimize the interference of the text content of reference speech at the inference stage. We substantiate the effectiveness of the proposed methods with an ablation study. The Mean Opinion Score (MOS) increases by 9% for unseen speakers when the first two methods are applied, and it further improves by 16% when selective temporal aggregation is applied.
- “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
- “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7748–7759.
- “Nautilus: A versatile voice cloning system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2967–2981, 2020.
- “Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis,” in Interspeech 2022, 2022, pp. 2573–2577.
- “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, Eds. 17–23 Jul 2022, vol. 162 of Proceedings of Machine Learning Research, pp. 2709–2720, PMLR.
- “High fidelity neural audio compression,” in arXiv preprint arXiv:2210.13438, 2022.
- “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Neural codec language models are zero-shot text to speech synthesizers,” in arXiv preprint arXiv:2301.02111, 2023.
- “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” in arXiv preprint arXiv:2302.03540, 2023.
- “Speechx: Neural codec language model as a versatile speech transformer,” in arXiv preprint arXiv:2308.06873, 2023.
- “Episodic training for domain generalization,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2019-Octob, pp. 1446–1455, 2019.
- “cgans with projection discriminator,” in International Conference on Learning Representations, 2018.
- Masanori Morise, “D4c, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57–65, 2016.
- “AIHub,” https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=542, Accessed: 2022-10-19.
- “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- “Jdi-t: Jointly trained duration informed transformer for text-to-speech without explicit alignment,” in Proc. Interspeech 2020, 2020, pp. 4004–4008.
- “UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation,” in Proc. Interspeech 2021, 2021, pp. 2207–2211.
- “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 8588–8592.
- Sunghee Jung (9 papers)
- Won Jang (5 papers)
- Jaesam Yoon (4 papers)
- Bongwan Kim (3 papers)