Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RWKVTTS: Yet another TTS based on RWKV-7 (2504.03289v1)

Published 4 Apr 2025 in cs.SD, cs.CL, and eess.AS

Abstract: Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team

Summary

  • The paper introduces RWKVTTS, a novel TTS system that replaces Transformer models with RWKV-7 to enhance temporal modeling and reduce resource usage.
  • It employs a two-stage pipeline using VQ-VAE for audio tokenization followed by sequential token prediction through RWKV-7.
  • Evaluations reveal that RWKVTTS achieves near-human speech quality and improved content enjoyment, making it a resource-efficient alternative for TTS applications.

This paper introduces RWKVTTS, a Text-to-Speech (TTS) system that utilizes the RWKV-7 architecture, an RNN-based model, as its core LLM component, replacing the more common Transformer-based LLMs used in systems like CosyVoice, Fish-Speech, and MegaTTS 3 (2411.01156, 2502.18924). The motivation is to leverage RWKV-7's computational efficiency and potential for better temporal modeling compared to Transformers, aiming for high-quality speech synthesis with lower resource requirements.

Methodology

The RWKVTTS pipeline follows a standard two-stage approach common in modern LLM-based TTS:

  1. VQ-VAE (Vector Quantized Variational Autoencoder): This component encodes reference audio into discrete audio tokens.
  2. LLM: This component predicts audio tokens based on input text and potentially reference audio tokens. The core innovation is using RWKV-7 as this LLM.

The process involves:

  • Processing input text and reference audio into text tokens and audio tokens, respectively.
  • Converting these tokens into embeddings.
  • Concatenating text and audio embeddings as input to the RWKV-7 model.
  • The RWKV-7 model processes the embeddings sequentially, generating hidden states.
  • An audio head translates these hidden states into predicted audio tokens.
  • A decoding loop generates the final sequence of audio tokens.

The paper details a specific implementation adapting RWKV-7 for the CosyVoice 2.0 framework (2412.10117). This involves a specific input data layout including SOS embeddings, text embeddings, task ID embeddings, reference audio embeddings, and last audio embeddings. The forward pass processes these inputs through RWKV-7 to generate logits for the target speech tokens, using prompt audio tokens and a special <|endofprompt|> token for guidance.

Data preparation involves collecting reference audio, preparing multilingual text data (e.g., from Wikipedia), and using the VQ-VAE to generate audio tokens corresponding to text and reference audio pairs. Training utilizes distributed techniques, and inference is evaluated via zero-shot synthesis.

Evaluation

RWKVTTS was evaluated against human-recorded speech (Ground Truth) and FireRedTTS-1S (2503.20499), a Transformer-based TTS model. Human annotators scored synthesized speech on:

  • Production Quality: How natural and clear the speech is.
  • Production Complexity: How well the speech handles varied prosody and intonation.
  • Content Enjoyment: How engaging the speech is.
  • Content Usefulness: How suitable the speech is for practical applications.

Results:

Metric Ground Truth RWKVTTS FireRedTTS-1S
Production Quality 7.80 7.73 7.82
Production Complex. 1.53 1.53 1.51
Content Enjoyment 6.20 6.11 6.06
Content Usefulness 6.52 6.46 6.61

RWKVTTS achieved scores very close to Ground Truth and was competitive with FireRedTTS-1S. Notably, it slightly outperformed the Transformer-based model in Content Enjoyment, potentially due to RWKV's RNN structure enhancing temporal modeling for expressiveness. All models scored low on Production Complexity, suggesting a general limitation in current TTS technology regarding intricate human prosody.

Discussion and Future Work

The results indicate that RWKV-7 is a viable and efficient alternative to Transformers in TTS, offering comparable quality with potential benefits in engagement and computational cost. This makes it suitable for resource-constrained environments. The structured input data format (as in CosyVoice 2.0) is crucial for guiding the model effectively.

Limitations include the difficulty in capturing the full complexity of human speech and potential challenges in generalizing to low-resource languages or unconditional synthesis (without reference audio).

Future work includes:

  • Integrating special control tokens (laughter, breath, dialect) for enhanced expressiveness.
  • Enabling unconditional generation, possibly by randomly dropping prompt audio tokens during training.
  • Developing streaming capabilities for real-time applications.
  • A novel approach using Parameter-Efficient Sparse Adaptation (PISSA) (2411.11057) on the MLP layer of RWKV-7. This would allow fine-tuning for speech synthesis while keeping the core LM parameters unchanged, enabling parameter sharing between TTS and text dialogue tasks for efficient deployment.

In conclusion, RWKVTTS demonstrates the potential of RWKV-7 to create efficient, high-quality TTS systems, paving the way for more accessible speech synthesis solutions.