LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
The paper "LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes," authored by Trung Dang, David Aponte, Dung Tran, and Kazuhito Koishida from Microsoft Corp, introduces a novel approach for zero-shot text-to-speech (TTS), specifically designed to address the challenges of low-latency streaming applications. This paper presents a significant advancement in leveraging autoregressive models to enhance TTS systems' efficiency and performance.
Problem Statement and Contributions
The primary challenge addressed in this research is the adaptation of zero-shot TTS systems to low-latency scenarios, which is crucial for applications like live communication, speech-to-speech translation, accent conversion, and disfluency removal. Traditional models either suffer from high inference time per step or are non-autoregressive, making them unsuitable for real-time applications.
LiveSpeech introduces a fully autoregressive transformer model that significantly reduces latency while maintaining competitive performance. This is achieved through two key innovations:
- Adaptive Codebook Weights: The model uses a loss weighting mechanism that adapts based on the contribution of each codebook to the generated frame, focusing initially on high-level codes to ensure content accuracy and progressively shifting attention to lower-level codes to enhance voice quality.
- Parallel Codebook Group Heads: This technique involves grouping codebooks and processing these groups in parallel, which enhances the model's ability to predict multiple tokens in a single decoding step without additional inference costs.
Model Architecture
LiveSpeech employs a combination of a neural audio codec, a speech encoder, and a transformer decoder:
- Neural Audio Codec: This component encodes raw audio into discrete codes using residual vector quantization (RVQ), which ensures efficient audio tokenization.
- Speech Encoder: It generates fixed-length embeddings from variable-length audio to provide voice conditions.
- Transformer Decoder: Adopting a GPT-style architecture, the decoder processes discrete tokens autoregressively, leveraging the adaptive codebook weights and parallel codebook group heads for efficient token generation.
Technical Details and Implementation
Adaptive Codebook Weights
The adaptive weighting mechanism redistributes the model's focus dynamically during training to balance the trade-off between content accuracy and voice quality. Initially, higher weights are assigned to high-level codebooks. As the training progresses and high-level predictions become more accurate, the model starts focusing on lower-level codebooks. This is controlled by a hyperparameter λ, which adapts the weight decay rate, and a probability threshold pmax to ignore easy predictions and prevent weight vanishing for low-level codes.
Parallel Codebook Group Heads
By grouping multiple codebooks and processing them in parallel, the model effectively reduces the computational overhead associated with predicting multiple tokens in each autoregressive step. This approach allows the transformer to maintain high inference speed even with an increased number of codebooks, making it suitable for low-latency streaming.
Experimental Evaluation
The paper conducts comprehensive experiments using the LibriLight dataset for pretraining and the LibriTTS test set for evaluation. Objective metrics such as Character Error Rate (CER), Word Error Rate (WER), Phoneme Error Rate (PER), speaker similarity (SS), and the objective perceptual speech quality score (O-MOS) are used to evaluate the model's performance. In addition, subjective evaluations (S-MOS) are carried out to gauge perceptual quality.
Results
- Performance: The proposed LiveSpeech model demonstrates competitive performance, achieving scores close to or better than state-of-the-art baselines. Notably, the adaptive codebook weights and parallel processing techniques enable a significant reduction in CER and enhancement in SS scores compared to traditional methods.
- Speed and Latency: LiveSpeech achieves a real-time factor (RTF) comparable to advanced TTS models while operating with a 200ms delay, making it highly suitable for real-time applications.
- Speech Quality: Both objective and subjective measures indicate that LiveSpeech maintains high speech quality with low latency. The use of an enhancer further improves perceptual scores, aligning them closely with high-reference audio standards.
Implications and Future Directions
The advancements presented in LiveSpeech have practical implications for real-time TTS applications, enhancing user experience in live communication scenarios by reducing latency and maintaining high-quality speech synthesis. Theoretically, these innovations can inspire further research in autoregressive models' efficiency for other sequential prediction tasks.
Future developments could include exploring more advanced neural audio codecs for better compression and quality trade-offs, scaling the transformer architecture for even faster inference, and extending the adaptive weighting mechanism to other domains such as video or multimodal generation.
In conclusion, LiveSpeech represents a significant forward step in low-latency zero-shot TTS, providing a robust and efficient solution for real-time speech synthesis with promising applications across various real-time communication platforms.