Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis (2406.02009v2)
Abstract: Recent LLM-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive LLMing. In this paper, we propose a phonetic enhanced LLMing method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive LLM. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
- Kun Zhou (217 papers)
- Shengkui Zhao (21 papers)
- Yukun Ma (33 papers)
- Chong Zhang (137 papers)
- Hao Wang (1119 papers)
- Dianwen Ng (21 papers)
- Chongjia Ni (18 papers)
- Nguyen Trung Hieu (3 papers)
- Jia Qi Yip (20 papers)
- Bin Ma (78 papers)