Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 105 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 193 tok/s Pro

2000 character limit reached

SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation (2401.13527v2)

Published 24 Jan 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Benefiting from effective speech modeling, current Speech LLMs (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG's remarkable proficiency in capturing and modeling speech's semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.

References (29)

Citations (11)

View on Semantic Scholar

Collections

Summary

The paper introduces a novel two-stage framework that disentangles semantic and perceptual information for improved speech synthesis.
It leverages an 8-billion-parameter architecture with specialized tools like SpeechTokenizer and dual modeling techniques for enhanced speech generation.
Experimental results demonstrate superior performance in zero-shot text-to-speech, voice conversion, and dialogue tasks, surpassing established benchmarks.

Introduction

The recent foray into Speech LLMs (SLLMs) has predominantly focused on the simultaneous modeling of semantic and perceptual information in speech generation. However, this integrated approach tends to obscure potential nuances between the dimensions of semantic content and perceptual characteristics, such as voice timbre. This paper discusses a novel contribution to the field: SpeechGPT-Gen, which implements Chain-of-Information Generation (CoIG). This method introduces a bifurcated framework, separating semantic processing and perceptual modeling, with SpeechGPT-Gen leveraging an 8-billion-parameter SLLM.

Semantic-Perceptual Disentanglement

The proposed CoIG approach leans on a two-tier modeling system for speech synthesis. In the semantic modeling phase, SpeechTokenizer, a specialized extraction tool, delineates semantic content. In its subsequent phase, a combination of autoregressive modeling for semantic content and non-autoregressive flow matching for perceptual character is employed. This two-stage approach ensures SpeechGPT-Gen finely tunes each speech dimension, leading to more natural and accurate speech generation. Additionally, SpeechGPT-Gen introduces a method of inserting semantic information directly into the prior distribution for flow matching, bolstering the model's representational fidelity and efficiency.

Experimental Validation

SpeechGPT-Gen's performance was put to the test across several tasks: zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue. Each task demonstrated the model's proficiency in rendering high-quality, semantically nuanced, and perceptually consistent audio. The quantitative measures employed—Word Error Rate, Speaker Similarity, Quality Mean Opinion Score, and Speech Mean Opinion Score—all endorsed SpeechGPT-Gen as a significant step forward in speech synthesis quality.

Conclusion and Implications

The findings suggest that CoIG enhances speech generation models significantly, outperforming established benchmarks in zero-shot settings and providing credible speech-to-speech dialogue outputs. SpeechGPT-Gen not only advances voice synthesis capabilities but also paves the way for SLLMs development focused on efficient and scalable modeling. This breakthrough stands to unlock new dialogic interactions between humans and AI, fostering more natural user experiences. As the first iteration demonstrating the potential of separately modeling semantic and perceptual speech information, SpeechGPT-Gen establishes a foundation for future research to build upon.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

GitHub

GitHub - 0nutation/SpeechGPT: SpeechGPT Series: Speech Large Language Models (973 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1750337710940668143