Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction (2401.01498v1)
Abstract: We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP 2018, pages 4779–4783. IEEE, 2018.
- Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6706–6713, 2019.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
- Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
- Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2021.
- Flow-tts: A non-autoregressive network for text to speech based on flow. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7209–7213. IEEE, 2020.
- Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. In Proc. Interspeech 2021, pages 3605–3609, 2021.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- The lj speech dataset. Online: https://keithito. com/LJ-Speech-Dataset, 2017.
- Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15, 2017.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Alex Graves. Sequence transduction with recurrent neural networks. in Representation Learning Worksop, ICML, 2012.
- Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE, 2020.
- Transduce and speak: Neural transducer for text-to-speech with semantic token prediction. arXiv preprint arXiv:2311.02898, 2023.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus. In Proc. Interspeech 2022, pages 788–792, 2022.
- Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis. Advances in Neural Information Processing Systems, 35:16624–16636, 2022.
- NANSY++: Unified voice synthesis with neural analysis and synthesis. In The Eleventh International Conference on Learning Representations, 2023.
- Wavthruvec: Latent speech representation as intermediate features for neural speech synthesis. arXiv preprint arXiv:2203.16930, 2022.
- Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Derya Soydaner. Attention mechanism in neural networks: where it comes and where it goes. Neural Computing and Applications, 34(16):13371–13385, 2022.
- Detection and analysis of attention errors in sequence-to-sequence text-to-speech. In Interspeech 2021: The 22nd Annual Conference of the International Speech Communication Association, pages 2746–2750. ISCA, 2021.
- Enhancing monotonicity for robust autoregressive transformer tts. In INTERSPEECH, pages 3181–3185, 2020.
- Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4784–4788. IEEE, 2018.
- Regotron: Regularizing the tacotron2 architecture via monotonic alignment loss. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 977–983. IEEE, 2023.
- A comparison of sequence-to-sequence models for speech recognition. In Interspeech, pages 939–943, 2017.
- Transformer-transducers for code-switched speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5859–5863. IEEE, 2021.
- Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6381–6385. IEEE, 2019.
- Streaming multi-speaker asr with rnn-t. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6903–6907. IEEE, 2021.
- Multi-turn rnn-t for streaming recognition of multi-party speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8402–8406. IEEE, 2022.
- Initial investigation of encoder-decoder end-to-end TTS using marginalization of monotonic hard alignments. In Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), pages 211–216, 2019.
- Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6724–6728. IEEE, 2020.
- Speech-t: Transducer for text to speech and beyond. Advances in Neural Information Processing Systems, 34:6621–6633, 2021.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040, 2020.
- ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proc. Interspeech 2020, pages 3830–3834, 2020.
- Rnn-transducer with stateless prediction network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7049–7053. IEEE, 2020.
- Pruned RNN-T for fast, memory-efficient ASR training. In Proc. Interspeech 2022, pages 2068–2072, 2022.
- LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In Proc. Interspeech 2019, pages 1526–1530, 2019.
- Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020.
- Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
- Hierarchical neural story generation. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
- Vladimir I Levenshtein et al. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union, 1966.
- Minchan Kim (18 papers)
- Myeonghun Jeong (12 papers)
- Byoung Jin Choi (10 papers)
- Semin Kim (11 papers)
- Joun Yeop Lee (10 papers)
- Nam Soo Kim (47 papers)