- The paper introduces a novel encoder-decoder Transformer TTS model using preference alignment and classifier-free guidance to reduce hallucinations and improve control.
- It employs ASR and speaker verification models to provide reward signals, optimizing transcription accuracy and target speaker similarity.
- Koel-TTS surpasses state-of-the-art models in intelligibility and naturalness, demonstrating robust zero-shot performance even on smaller datasets.
Overview of Koel-TTS: Enhancing LLM-Based Speech Generation
The paper introduces Koel-TTS, a novel suite of encoder-decoder Transformer TTS models, which aim to address the common challenges associated with LLM-based text-to-speech (TTS) systems, such as hallucinations and lack of control. By leveraging preference alignment and Classifier-Free Guidance (CFG), the authors propose an innovative framework that significantly improves the quality of speech synthesis in terms of speaker similarity, intelligibility, and naturalness.
Koel-TTS builds on the transformative impact of LLMs in creating more natural and contextually adaptive speech outputs. However, existing LLM-based TTS systems struggle with issues like repetition and variation in output quality, conditions that mirror text-generation problems seen in LLMs. To mitigate these issues, the authors incorporate preference alignment techniques, a strategy already employed in text-generation contexts, to fine-tune the TTS output more closely aligned with human preferences.
The work explores the use of automatic speech recognition (ASR) and speaker verification (SV) models to provide reward signals for preference alignment. This methodology evaluates generated speech on transcription accuracy and target speaker similarity, ranking outputs to guide model optimization. Preference alignment is conducted through Direct Preference Optimization (DPO) and Reward-aware Preference Optimization (RPO), effectively steering the Koel-TTS outputs towards more desirable qualities.
To enhance the adherence to conditional inputs without separate classifiers, CFG is adapted to the TTS context, traditionally used in diffusion models. CFG manipulates the conditioning dropout during training and blends conditional and unconditional logits during inference, thereby refining the synthesis process.
Significant findings from the experiments reveal that Koel-TTS, even when trained on a much smaller dataset, surpassed state-of-the-art models in target speaker similarity and intelligibility metrics. Particularly, the implementation of CFG and preference alignment techniques advanced speaker similarity and speech naturalness across different languages, enabling robust zero-shot TTS performance. A comprehensive comparative evaluation demonstrates that Koel-TTS models, both English-focused and multilingual versions, achieve competitive, if not superior, intelligibility and naturalness compared to existing models, notwithstanding the limitations in dataset size.
Practically, Koel-TTS has substantial implications for applications demanding high-fidelity, rapid, and adaptable speech synthesis, such as conversational agents and multimedia content creation. Theoretical implications extend to the potential for CFG to be more broadly explored within other LLM-based generative tasks, beyond TTS.
Future research avenues may include delving deeper into inference-time strategies to mitigate hallucinations robustly, especially with complex and challenging inputs. Additionally, the exploration of preference-based alignment with large-scale and diverse datasets could further fine-tune generative model outputs to better match nuanced human auditory preferences.
The paper's contributions are pivotal in steering the development of LLM-based TTS towards higher accuracy and applicability across languages, marking a step forward in the evolution of voice synthesis technologies.