SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation (2401.13527v2)
Abstract: Benefiting from effective speech modeling, current Speech LLMs (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG's remarkable proficiency in capturing and modeling speech's semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.
- Common voice: A massively-multilingual speech corpus, 2020.
- Audiolm: a language modeling approach to audio generation, 2022.
- Soundstorm: Efficient parallel audio generation, 2023.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone, 2023.
- Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio, 2021.
- Neural ordinary differential equations, 2019.
- Conformer: Convolution-augmented transformer for speech recognition, 2020.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Scaling laws for neural language models, 2020.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech, 2021.
- Voicebox: Text-guided multilingual universal speech generation at scale, 2023.
- Flow matching for generative modeling, 2023.
- Generative pre-training for speech with flow matching, 2023.
- Matcha-tts: A fast tts architecture with conditional flow matching, 2024.
- OpenAI. Gpt-4 technical report, 2023.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
- Speech resynthesis from discrete disentangled self-supervised representations, 2021.
- Mls: A large-scale multilingual dataset for speech research. In Interspeech 2020. ISCA, October 2020. doi: 10.21437/interspeech.2020-2826. URL http://dx.doi.org/10.21437/Interspeech.2020-2826.
- Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pp. 7836–7846. PMLR, 2020.
- Robust speech recognition via large-scale weak supervision, 2022.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers, 2023.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Audiobox: Unified audio generation with natural language prompts, 2023.
- Neural codec language models are zero-shot text to speech synthesizers, 2023.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Uniaudio: An audio foundation model toward universal audio generation, 2023.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 15757–15773, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055.
- Speechtokenizer: Unified speech tokenizer for speech large language models, 2023b.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.