LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes (2406.02897v2)

Published 5 Jun 2024 in cs.SD and eess.AS

Abstract: Prior works have demonstrated zero-shot text-to-speech by using a generative LLM on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive LLM-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.

Authors (4)

Trung Dang (17 papers)
David Aponte (6 papers)
Dung Tran (13 papers)
Kazuhito Koishida (22 papers)

Citations (2)

View on Semantic Scholar

Summary

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

The paper "LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes," authored by Trung Dang, David Aponte, Dung Tran, and Kazuhito Koishida from Microsoft Corp, introduces a novel approach for zero-shot text-to-speech (TTS), specifically designed to address the challenges of low-latency streaming applications. This paper presents a significant advancement in leveraging autoregressive models to enhance TTS systems' efficiency and performance.

Problem Statement and Contributions

The primary challenge addressed in this research is the adaptation of zero-shot TTS systems to low-latency scenarios, which is crucial for applications like live communication, speech-to-speech translation, accent conversion, and disfluency removal. Traditional models either suffer from high inference time per step or are non-autoregressive, making them unsuitable for real-time applications.

LiveSpeech introduces a fully autoregressive transformer model that significantly reduces latency while maintaining competitive performance. This is achieved through two key innovations:

Adaptive Codebook Weights: The model uses a loss weighting mechanism that adapts based on the contribution of each codebook to the generated frame, focusing initially on high-level codes to ensure content accuracy and progressively shifting attention to lower-level codes to enhance voice quality.
Parallel Codebook Group Heads: This technique involves grouping codebooks and processing these groups in parallel, which enhances the model's ability to predict multiple tokens in a single decoding step without additional inference costs.

Model Architecture

LiveSpeech employs a combination of a neural audio codec, a speech encoder, and a transformer decoder:

Neural Audio Codec: This component encodes raw audio into discrete codes using residual vector quantization (RVQ), which ensures efficient audio tokenization.
Speech Encoder: It generates fixed-length embeddings from variable-length audio to provide voice conditions.
Transformer Decoder: Adopting a GPT-style architecture, the decoder processes discrete tokens autoregressively, leveraging the adaptive codebook weights and parallel codebook group heads for efficient token generation.

Technical Details and Implementation

Adaptive Codebook Weights

The adaptive weighting mechanism redistributes the model's focus dynamically during training to balance the trade-off between content accuracy and voice quality. Initially, higher weights are assigned to high-level codebooks. As the training progresses and high-level predictions become more accurate, the model starts focusing on lower-level codebooks. This is controlled by a hyperparameter $\lambda$ , which adapts the weight decay rate, and a probability threshold $p_{\max}$ to ignore easy predictions and prevent weight vanishing for low-level codes.

Parallel Codebook Group Heads

By grouping multiple codebooks and processing them in parallel, the model effectively reduces the computational overhead associated with predicting multiple tokens in each autoregressive step. This approach allows the transformer to maintain high inference speed even with an increased number of codebooks, making it suitable for low-latency streaming.

Experimental Evaluation

The paper conducts comprehensive experiments using the LibriLight dataset for pretraining and the LibriTTS test set for evaluation. Objective metrics such as Character Error Rate (CER), Word Error Rate (WER), Phoneme Error Rate (PER), speaker similarity (SS), and the objective perceptual speech quality score (O-MOS) are used to evaluate the model's performance. In addition, subjective evaluations (S-MOS) are carried out to gauge perceptual quality.

Results

Performance: The proposed LiveSpeech model demonstrates competitive performance, achieving scores close to or better than state-of-the-art baselines. Notably, the adaptive codebook weights and parallel processing techniques enable a significant reduction in CER and enhancement in SS scores compared to traditional methods.
Speed and Latency: LiveSpeech achieves a real-time factor (RTF) comparable to advanced TTS models while operating with a 200ms delay, making it highly suitable for real-time applications.
Speech Quality: Both objective and subjective measures indicate that LiveSpeech maintains high speech quality with low latency. The use of an enhancer further improves perceptual scores, aligning them closely with high-reference audio standards.

Implications and Future Directions

The advancements presented in LiveSpeech have practical implications for real-time TTS applications, enhancing user experience in live communication scenarios by reducing latency and maintaining high-quality speech synthesis. Theoretically, these innovations can inspire further research in autoregressive models' efficiency for other sequential prediction tasks.

Future developments could include exploring more advanced neural audio codecs for better compression and quality trade-offs, scaling the transformer architecture for even faster inference, and extending the adaptive weighting mechanism to other domains such as video or multimodal generation.

In conclusion, LiveSpeech represents a significant forward step in low-latency zero-shot TTS, providing a robust and efficient solution for real-time speech synthesis with promising applications across various real-time communication platforms.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1798552256372957623

https://twitter.com/IAmACatAI/status/1798625882288095362