Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations (2401.02014v2)
Abstract: Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models.
- Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 184–191.
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems (NeurIPS), volume 33, 12449–12460. Curran Associates, Inc.
- Multi-Stream Transformers. In arXiv preprint arXiv:2107.10342.
- Fine-Grained Style Control In Transformer-Based Text-To-Speech Synthesis. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7907–7911.
- AdaSpeech: Adaptive Text to Speech for Custom Voice. In International Conference on Learning Representations (ICLR).
- Sample Efficient Adaptive Text-to-Speech. In International Conference on Learning Representations (ICLR).
- SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech. IEEE Signal Processing Letters, 29: 2502–2506.
- One-shot Voice Conversion by Separating Speaker and COntent Representations with Instance Normalization. In ArXiv e-prints.
- A Learned Representation For Artistic Style. ICLR.
- High Fidelity Neural Audio Compression. In arXiv preprint arXiv:2210.13438.
- Image Style Transfer Using Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2414–2423.
- Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2): 236–243.
- On the impact of alignment on voice conversion performance. In Interspeech.
- Using generative modelling to produce varied intonation for speech synthesis. 10th ISCA Speech Synthesis Workshop (SSW10), abs/1906.04233.
- Voice conversion from non-parallel corpora using variational auto-encoder. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 1–6.
- Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3267–3271.
- Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In International Conference on Computer Vision (ICCV).
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In International Conference on Neural Information Processing Systems (NeurIPS).
- StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks. In IEEE Spoken Langugage Technology Workshop (SLT).
- ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(9): 1432–1443.
- CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In 26th European Signal Processing Conference (EUSIPCO).
- Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Advances in Neural Information Processing Systems (NeurIPS).
- Multi-stage attention for fine-grained expressivity transfer in multispeaker text-to-speech system. In 2022 30th European Signal Processing Conference (EUSIPCO), 180–184.
- MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30: 853–864.
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. In ArXiv e-prints.
- Meta-stylespeech : Multi-speaker adaptive text-to-speech generation. In International Conference on Learning Representations (ICLR).
- Layer-wise Analysis of a Self-supervised Speech Representation Model. In Workshop on Automatic Speech Recognition and Understanding (ASRU).
- A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization. In Proc. Interspeech 2021, 3625–3629.
- Robust Speech Recognition via Large-Scale Weak Supervision. In ArXiv e-prints.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations (ICLR).
- Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4): e1249.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779 – 4783.
- Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
- Attention Is All You Need. In International Conference on Neural Information Processing Systems (NeurIPS).
- Towards Modeling the Style of Translators in Neural Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1193–1199. Online: Association for Computational Linguistics.
- LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Conference of the International Speech Communication Association (Interspeech).
- Learning a Facial Expression Embedding Disentangled From Identity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6759–6768.
- AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents. In 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), 76–80.
- nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4293–4297.
- Yejin Jeon (11 papers)
- Yunsu Kim (40 papers)
- Gary Geunbae Lee (53 papers)