Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis (2410.23320v1)
Abstract: Neural codec LLMs have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or lengthy speech excerpts, since the concatenation of source and target speech must fall within the maximum context length which is determined during training. In this work, we introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA). Building on the success of initial-state tuning on RWKV, we extend this technique to voice cloning, enabling the use of multiple speech samples and full utilization of the context window in synthesis. This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes. Notably, Lina-Speech matches or outperforms state-of-the-art baseline models, including some with a parameter count up to four times higher or trained in an end-to-end style. We release our code and checkpoints. Audio samples are available at https://theodorblackbird.github.io/blog/demo_lina/.
- Simple linear attention language models balance the recall-throughput tradeoff. In ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models, International Conference on Machine Learning (ICML), 2024.
- James Betker. Better Speech Synthesis through Scaling. arXiv preprint arXiv:2305.07243, 2023.
- MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11315–11325, 2022.
- WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Why We Should Report the Details in Subjective Evaluation of TTS More Rigorously. In Interspeech, pp. 5551–5555, 2023. doi: 10.21437/Interspeech.2023-416.
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. In International Conference on Machine Learning (ICML), pp. 10041–10071, 2024.
- High Fidelity Neural Audio Compression. Transactions on Machine Learning Research, 2023.
- QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems, pp. 10088–10115, 2023.
- Jelly Fish. Init State Tuning repository. https://github.com/Jellyfish042/RWKV-StateTuning, 2024.
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Submitted to International Conference on Learning Representations (ICLR), 2024.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022.
- ITU-R BS.1534-3. Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems, 2015.
- ITU-T P.800.2. Methods for Objective and Subjective Assessment of Speech Quality - Mean Opinion Score Interpretation and Reporting, 2013.
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling. arXiv preprint arXiv:2408.16532, 2024.
- Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis. In International Conference on Learning Representations (ICLR), 2024.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attentionn. In International Conference on Machine Learning (ICML), pp. 5156–5165, 2020.
- Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint arXiv:2311.01927, 2023.
- CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech. In International Conference on Learning Representations (ICLR), 2024.
- Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation. In Speech Synthesis Workshop (SSW), pp. 41–47, 2023. doi: 10.21437/SSW.2023-7.
- LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. pp. 5496–5500, 2023. doi: 10.21437/Interspeech.2023-1584.
- High-Fidelity Audio Compression with Improved RVQGAN. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
- Parler-TTS. https://github.com/huggingface/parler-tts, 2024.
- Voicebox: Text-guided Multilingual Universal Speech Generation at Scale. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis. In Interspeech, pp. 3420–3424, 2024. doi: 10.21437/Interspeech.2024-508.
- The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3045–3059, 2021.
- StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
- P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association of Computational Linguistics (ACL), pp. 61–68, 2022.
- Natural Language guidance of High-Fidelity Text-To-Speech with Synthetic Annotations. arXiv preprint arXiv:2402.01912, 2024.
- Autoregressive speech synthesis without vector quantization. arXiv preprint arXiv:2407.08551, 2024.
- EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Interspeech, 2023. doi: 10.21437/Interspeech.2023-1905.
- SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition. In Interspeech, pp. 3460–3464, 2024. doi: 10.21437/Interspeech.2024-40.
- RWKV: Reinventing RNNs for the Transformer Era. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence. 2024a.
- VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 12442–12462, 2024b.
- Plachtaa. VALL-E-X repository. URL https://github.com/Plachtaa/VALL-E-X.
- EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech. In International Symposium on Chinese Spoken Language Processing (ISCSLP), 2024.
- Revisiting Over-Smoothness in Text to Speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 8197–8213, 2022. doi: 10.18653/v1/2022.acl-long.564.
- Noam M. Shazeer. GLU Variants Improve Transformer. ArXiv, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
- NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. In International Conference on Learning Representations (ICLR), 2024.
- Hubert Siuzdak. Vocos: Closing the Gap between Time-Domain and Fourier-based Neural Vocoders for High-Quality Audio Synthesis. In International Conference on Learning Representations (ICLR), 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Retentive Network: A Successor to Transformer for Large Language Models. In submitted to International Conference on Learning Representations (ICLR), 2023.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pp. 5180–5189. PMLR, 2018.
- RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis, 2024. URL https://arxiv.org/abs/2404.03204.
- Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. Nature Machine Intellingence, 5:220–235, 2023.
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models. Submitted to IEEE Transactions on Audio, Speech and Language (TASLP), 2024a.
- Simplespeech: Towards simple and efficient text-to-speech with scalar latent transformer diffusion models. In Interspeech 2024, pp. 4398–4402, 2024b. doi: 10.21437/Interspeech.2024-1392.
- Songlin Yang and Yu Zhang. FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism, January 2024. URL https://github.com/sustcsonglin/flash-linear-attention.
- Gated Linear Attention Transformers with Hardware-Efficient Training. In Proceedings of the 41st International Conference on Machine Learning (PMLR), 2024c. doi: 235:56501-56523.
- SoundStream: An End-to-End Neural Audio Codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 30:495–507, 2021.
- LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. pp. 1526–1530, 2019. doi: 10.21437/Interspeech.2019-2441.
- Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling. arXiv preprint arXiv:2303.03926, 2023.