Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 105 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Kimi K2 193 tok/s Pro
2000 character limit reached

SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation (2401.13527v2)

Published 24 Jan 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Benefiting from effective speech modeling, current Speech LLMs (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG's remarkable proficiency in capturing and modeling speech's semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Common voice: A massively-multilingual speech corpus, 2020.
  2. Audiolm: a language modeling approach to audio generation, 2022.
  3. Soundstorm: Efficient parallel audio generation, 2023.
  4. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone, 2023.
  5. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio, 2021.
  6. Neural ordinary differential equations, 2019.
  7. Conformer: Convolution-augmented transformer for speech recognition, 2020.
  8. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  9. Scaling laws for neural language models, 2020.
  10. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech, 2021.
  11. Voicebox: Text-guided multilingual universal speech generation at scale, 2023.
  12. Flow matching for generative modeling, 2023.
  13. Generative pre-training for speech with flow matching, 2023.
  14. Matcha-tts: A fast tts architecture with conditional flow matching, 2024.
  15. OpenAI. Gpt-4 technical report, 2023.
  16. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
  17. Speech resynthesis from discrete disentangled self-supervised representations, 2021.
  18. Mls: A large-scale multilingual dataset for speech research. In Interspeech 2020. ISCA, October 2020. doi: 10.21437/interspeech.2020-2826. URL http://dx.doi.org/10.21437/Interspeech.2020-2826.
  19. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pp.  7836–7846. PMLR, 2020.
  20. Robust speech recognition via large-scale weak supervision, 2022.
  21. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers, 2023.
  22. Roformer: Enhanced transformer with rotary position embedding, 2023.
  23. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  24. Audiobox: Unified audio generation with natural language prompts, 2023.
  25. Neural codec language models are zero-shot text to speech synthesizers, 2023.
  26. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  27. Uniaudio: An audio foundation model toward universal audio generation, 2023.
  28. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  15757–15773, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055.
  29. Speechtokenizer: Unified speech tokenizer for speech large language models, 2023b.
Citations (11)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel two-stage framework that disentangles semantic and perceptual information for improved speech synthesis.
  • It leverages an 8-billion-parameter architecture with specialized tools like SpeechTokenizer and dual modeling techniques for enhanced speech generation.
  • Experimental results demonstrate superior performance in zero-shot text-to-speech, voice conversion, and dialogue tasks, surpassing established benchmarks.

Introduction

The recent foray into Speech LLMs (SLLMs) has predominantly focused on the simultaneous modeling of semantic and perceptual information in speech generation. However, this integrated approach tends to obscure potential nuances between the dimensions of semantic content and perceptual characteristics, such as voice timbre. This paper discusses a novel contribution to the field: SpeechGPT-Gen, which implements Chain-of-Information Generation (CoIG). This method introduces a bifurcated framework, separating semantic processing and perceptual modeling, with SpeechGPT-Gen leveraging an 8-billion-parameter SLLM.

Semantic-Perceptual Disentanglement

The proposed CoIG approach leans on a two-tier modeling system for speech synthesis. In the semantic modeling phase, SpeechTokenizer, a specialized extraction tool, delineates semantic content. In its subsequent phase, a combination of autoregressive modeling for semantic content and non-autoregressive flow matching for perceptual character is employed. This two-stage approach ensures SpeechGPT-Gen finely tunes each speech dimension, leading to more natural and accurate speech generation. Additionally, SpeechGPT-Gen introduces a method of inserting semantic information directly into the prior distribution for flow matching, bolstering the model's representational fidelity and efficiency.

Experimental Validation

SpeechGPT-Gen's performance was put to the test across several tasks: zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue. Each task demonstrated the model's proficiency in rendering high-quality, semantically nuanced, and perceptually consistent audio. The quantitative measures employed—Word Error Rate, Speaker Similarity, Quality Mean Opinion Score, and Speech Mean Opinion Score—all endorsed SpeechGPT-Gen as a significant step forward in speech synthesis quality.

Conclusion and Implications

The findings suggest that CoIG enhances speech generation models significantly, outperforming established benchmarks in zero-shot settings and providing credible speech-to-speech dialogue outputs. SpeechGPT-Gen not only advances voice synthesis capabilities but also paves the way for SLLMs development focused on efficient and scalable modeling. This breakthrough stands to unlock new dialogic interactions between humans and AI, fostering more natural user experiences. As the first iteration demonstrating the potential of separately modeling semantic and perceptual speech information, SpeechGPT-Gen establishes a foundation for future research to build upon.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com