SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection (2408.17432v2)

Published 30 Aug 2024 in eess.AS and cs.LG

Abstract: Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A lower-complexity method would enable speech synthesis research with limited computational and data resources to reach to a wider use. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker's speech, SelectTTS enables generalization to unseen speakers with significantly lower model complexity. Compared to baselines such as XTTS-v2 and VALL-E, SelectTTS achieves better speaker similarity while reducing model parameters by over 8x and training data requirements by 270x.

Summary

The paper introduces a two-stage TTS pipeline that decouples semantic unit prediction and frame selection, reducing complexity while maintaining synthesis quality.
The methodology leverages discrete semantic unit prediction along with sub-sequence matching and inverse k-means sampling to accurately capture prosody and speaker attributes.
Experimental results demonstrate that SelectTTS achieves superior speaker similarity and naturalness with significantly fewer parameters and reduced training data compared to baselines.

SelectTTS: Synthesizing Anyone’s Voice via Discrete Unit-Based Frame Selection

SelectTTS, presented by Ismail Rasim Ulgen, Shreeram Suresh Chandra, Junchen Lu, and Berrak Sisman, explores a streamlined approach to multi-speaker text-to-speech (TTS) synthesis for unseen speakers. This research proposes a significant departure from conventional methods which typically increase model complexity through speaker conditioning. Instead, SelectTTS employs a novel strategy of frame selection using self-supervised learning (SSL) features to achieve high-quality voice synthesis with lower model complexity.

Overview of SelectTTS

The core contribution of SelectTTS lies in its two-stage pipeline: semantic unit prediction from text and frame selection from reference speech. This method simplifies the integration of speaker attributes into the TTS process by decoupling the tasks of semantic prediction and speaker modeling.

Text-to-Semantic Unit Prediction: A non-autoregressive model predicts frame-level discrete semantic units from input text. Utilizing discrete units rather than continuous features reduces complexity and aids more effective frame selection.
Frame Selection: The proposed algorithms, sub-sequence matching and inverse k-means sampling, identify and select the suitable frames from reference speech that correspond to the predicted semantic units. Sub-sequence matching ensures accurate segmentation-level prosody by selecting larger chunks of speech segments, while inverse k-means sampling recovers continuous SSL features from their discrete counterparts.
Vocoding: The selected frame-level SSL features are mapped to the speech waveform using a customized HiFi-GAN vocoder.

Experimental Setup

SelectTTS's experimental setup included training the text-to-semantic unit model and vocoder on the LibriSpeech train-clean-100 dataset. The researchers incorporated various baselines for comparison, including XTTS-v2, YourTTS, and an implementation of VALL-E. Objective metrics such as Word Error Rate (WER), UTMOS, and Speaker Encoder Cosine Similarity (SECS) were used to assess model performance, in addition to subjective Mean Opinion Score (MOS) and Speaker Mean Opinion Score (SMOS) evaluations.

Key Results and Findings

Performance and Efficiency: SelectTTS demonstrated competitive performance against larger baselines with significantly reduced model parameters. The model outperformed others in SECS, indicating superior speaker similarity, and achieved near state-of-the-art results for speech naturalness and WER.
Model Complexity: Compared to XTTS-v2 and VALL-E, SelectTTS requires 8x fewer parameters and approximately 270x less training data, showcasing a much more resource-efficient approach without compromising on synthesis quality.
Reference Speech Duration: The robustness of SelectTTS was evident even with limited reference speech duration. The proposed algorithms showed resilience, maintaining competitive SECS and acceptable WER when reference speech was reduced to as short as 30 seconds.
Subjective Evaluation: In listening tests, SelectTTS achieved notable MOS and SMOS results, solidifying its ability to generate highly natural and speaker-similar speech. The incorporation of sub-sequence matching particularly enhanced speech naturalness.

Implications and Future Directions

Theoretical Implications:

The SelectTTS methodology challenges the prevailing paradigm of complex, large-scale models in multi-speaker TTS by demonstrating the viability of simpler, frame selection-based approaches.

Practical Implications:

SelectTTS holds substantial promise in applications where efficient and scalable TTS synthesis is critical, such as conversational AI, content creation, and personalized voice assistants.

Future Directions:

Further research may explore extending the frame selection algorithms to syntactically and semantically align larger contexts from reference speech, potentially enhancing prosodic and intonation accuracy. Additionally, integrating advanced SSL models with more extensive pre-training and fine-tuning could open up new horizons in multi-speaker TTS.

In conclusion, SelectTTS presents a compelling alternative to existing multi-speaker TTS methods, offering a balanced synthesis framework that prioritizes efficiency and simplicity without compromising on quality. This work contributes significantly to reducing the barriers of complexity and data requirements in multi-speaker TTS, paving the way for more reproducible and scalable TTS systems.