- The paper introduces RWKVTTS, a novel TTS system that replaces Transformer models with RWKV-7 to enhance temporal modeling and reduce resource usage.
- It employs a two-stage pipeline using VQ-VAE for audio tokenization followed by sequential token prediction through RWKV-7.
- Evaluations reveal that RWKVTTS achieves near-human speech quality and improved content enjoyment, making it a resource-efficient alternative for TTS applications.
This paper introduces RWKVTTS, a Text-to-Speech (TTS) system that utilizes the RWKV-7 architecture, an RNN-based model, as its core LLM component, replacing the more common Transformer-based LLMs used in systems like CosyVoice, Fish-Speech, and MegaTTS 3 (2411.01156, 2502.18924). The motivation is to leverage RWKV-7's computational efficiency and potential for better temporal modeling compared to Transformers, aiming for high-quality speech synthesis with lower resource requirements.
Methodology
The RWKVTTS pipeline follows a standard two-stage approach common in modern LLM-based TTS:
- VQ-VAE (Vector Quantized Variational Autoencoder): This component encodes reference audio into discrete audio tokens.
- LLM: This component predicts audio tokens based on input text and potentially reference audio tokens. The core innovation is using RWKV-7 as this LLM.
The process involves:
- Processing input text and reference audio into text tokens and audio tokens, respectively.
- Converting these tokens into embeddings.
- Concatenating text and audio embeddings as input to the RWKV-7 model.
- The RWKV-7 model processes the embeddings sequentially, generating hidden states.
- An audio head translates these hidden states into predicted audio tokens.
- A decoding loop generates the final sequence of audio tokens.
The paper details a specific implementation adapting RWKV-7 for the CosyVoice 2.0 framework (2412.10117). This involves a specific input data layout including SOS embeddings, text embeddings, task ID embeddings, reference audio embeddings, and last audio embeddings. The forward pass processes these inputs through RWKV-7 to generate logits for the target speech tokens, using prompt audio tokens and a special <|endofprompt|>
token for guidance.
Data preparation involves collecting reference audio, preparing multilingual text data (e.g., from Wikipedia), and using the VQ-VAE to generate audio tokens corresponding to text and reference audio pairs. Training utilizes distributed techniques, and inference is evaluated via zero-shot synthesis.
Evaluation
RWKVTTS was evaluated against human-recorded speech (Ground Truth) and FireRedTTS-1S (2503.20499), a Transformer-based TTS model. Human annotators scored synthesized speech on:
- Production Quality: How natural and clear the speech is.
- Production Complexity: How well the speech handles varied prosody and intonation.
- Content Enjoyment: How engaging the speech is.
- Content Usefulness: How suitable the speech is for practical applications.
Results:
Metric |
Ground Truth |
RWKVTTS |
FireRedTTS-1S |
Production Quality |
7.80 |
7.73 |
7.82 |
Production Complex. |
1.53 |
1.53 |
1.51 |
Content Enjoyment |
6.20 |
6.11 |
6.06 |
Content Usefulness |
6.52 |
6.46 |
6.61 |
RWKVTTS achieved scores very close to Ground Truth and was competitive with FireRedTTS-1S. Notably, it slightly outperformed the Transformer-based model in Content Enjoyment, potentially due to RWKV's RNN structure enhancing temporal modeling for expressiveness. All models scored low on Production Complexity, suggesting a general limitation in current TTS technology regarding intricate human prosody.
Discussion and Future Work
The results indicate that RWKV-7 is a viable and efficient alternative to Transformers in TTS, offering comparable quality with potential benefits in engagement and computational cost. This makes it suitable for resource-constrained environments. The structured input data format (as in CosyVoice 2.0) is crucial for guiding the model effectively.
Limitations include the difficulty in capturing the full complexity of human speech and potential challenges in generalizing to low-resource languages or unconditional synthesis (without reference audio).
Future work includes:
- Integrating special control tokens (laughter, breath, dialect) for enhanced expressiveness.
- Enabling unconditional generation, possibly by randomly dropping prompt audio tokens during training.
- Developing streaming capabilities for real-time applications.
- A novel approach using Parameter-Efficient Sparse Adaptation (PISSA) (2411.11057) on the MLP layer of RWKV-7. This would allow fine-tuning for speech synthesis while keeping the core LM parameters unchanged, enabling parameter sharing between TTS and text dialogue tasks for efficient deployment.
In conclusion, RWKVTTS demonstrates the potential of RWKV-7 to create efficient, high-quality TTS systems, paving the way for more accessible speech synthesis solutions.