Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis (2411.01156v2)
Abstract: Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages LLMs for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{https://github.com/fishaudio/fish-speech}{https://github.com/fishaudio/fish-speech}.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
- Dennis H Klatt. Review of text-to-speech conversion for english. The Journal of the Acoustical Society of America, 82(3):737–793, 1987.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
- Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
- Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024.
- Matcha-tts: A fast tts architecture with conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11341–11345. IEEE, 2024.
- James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
- Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
- A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
- Almost unsupervised text to speech and automatic speech recognition. In International conference on machine learning, pages 5410–5419. PMLR, 2019.
- Siri on-device deep learning-guided unit selection text-to-speech system. In Interspeech, pages 4011–4015, 2017.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12, 2016.
- Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
- Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
- Eva-gan: Enhanced various audio generation via scalable generative adversarial networks. arXiv preprint arXiv:2402.00892, 2024.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Parler-tts. https://github.com/huggingface/parler-tts, 2024.
- Melotts: High-quality multi-lingual multi-accent text-to-speech, 2023. URL https://github.com/myshell-ai/MeloTTS.
- E3 tts: Easy end-to-end diffusion-based text to speech. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
- Xtts: a massively multilingual zero-shot text-to-speech model. arXiv preprint arXiv:2406.04904, 2024.
- Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601, 2019.
- One model, many languages: Meta-learning for multilingual text-to-speech. arXiv preprint arXiv:2008.00768, 2020.
- Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5621–5625. IEEE, 2019.
- High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36, 2024.
- A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652, 2023.
- A vector quantized variational autoencoder (vq-vae) autoregressive neural f_0𝑓_0f\_0italic_f _ 0 model for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:157–170, 2019.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 21–25. IEEE, 2021.
- Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- F Yu. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023.
- Open-source conversational ai with SpeechBrain 1.0, 2024. URL https://arxiv.org/abs/2407.00463.