CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (2412.10117v3)

Published 13 Dec 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, LLMs (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal LLMs, where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at https://funaudioLLM.github.io/cosyvoice2.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that integrating LLMs with finite scalar quantization achieves scalable, low-latency TTS synthesis with human-parity quality.
The model simplifies architecture by removing the text encoder and speaker embeddings, enabling direct use of pre-trained LLMs for improved token alignment.
Evaluation shows that CosyVoice 2 seamlessly unifies streaming and non-streaming synthesis, setting new benchmarks in content consistency and expressiveness.

An Overview of CosyVoice 2: Scalable Streaming Speech Synthesis with LLMs

The paper "CosyVoice 2: Scalable Streaming Speech Synthesis with LLMs" presents an evolved version of a zero-shot text-to-speech (TTS) synthesis model, building upon the foundational work of CosyVoice. With increasing interest and advancements in multi-modal LLMs, this paper explores enhancements that address real-time interaction demands through effective streaming synthesis.

Innovations in Architecture and Techniques

CosyVoice 2 introduces several architectural and methodological changes to improve the efficacy of TTS models. One of the primary enhancements is the use of finite scalar quantization (FSQ), improving codebook utilization in speech tokenization. FSQ's ability to fully exploit the codebook capacity ostensibly leads to enhanced semantic information retention, crucial for natural speech synthesis.

The text-to-speech LLM undergoes significant restructuring too. The authors simplify the model architecture by removing the text encoder and speaker embeddings, allowing direct utilization of pre-trained LLMs as a backbone. This change aims to enhance the model's ability to align speech tokens with text and leverage existing LLM capabilities for improved context understanding.

CosyVoice 2 unifies the synthesis process for streaming and non-streaming scenarios, achieved through a hybrid text-speech LLM and a chunk-aware causal flow matching model. This enables seamless switching between modes with virtually lossless quality, accommodating the highly variable latency requirements of real-time applications.

Evaluation and Performance

The authors extensively evaluate the performance of CosyVoice 2 across several benchmarks. The model achieves impressive content consistency (WER) and speaker similarity (SS) metrics when compared to both its predecessor and contemporary TTS models, such as ChatTTS and GPT-SoVITs. Notably, it exhibits human-parity synthesis quality, with several metrics even surpassing those of natural human speech in controlled settings.

Moreover, the paper presents an evaluation of the model's capacity for instructed generation. Here, CosyVoice 2 demonstrates adaptability to various linguistic instructions, emotional expressions, and speaking styles, setting new standards in expressive TTS synthesis without sacrificing coherence or intelligibility.

Implications and Future Directions

CosyVoice 2 demonstrates the feasibility of leveraging state-of-the-art LLMs to produce high-fidelity, contextually consistent speech. The paper underscores the use of quantization strategies and LLM integration as effective techniques for enhancing TTS systems. The unified approach towards streaming and non-streaming synthesis within a single model architecture pioneers possibilities for more responsive, adaptive voice interaction systems in real-time environments such as voice-driven interfaces or virtual assistants.

Looking forward, the scalability and extensibility of CosyVoice 2's framework suggest potential future applications, including multilingual synthesis and more nuanced paralinguistic controls such as rhythm and intonation patterns. Addressing current limitations, such as language coverage and acoustic characteristic control, remain open research questions that could significantly propel the capabilities of TTS models.

In conclusion, CosyVoice 2 emerges as a notable advancement in the domain of speech synthesis, adeptly marrying the inherent capabilities of LLMs with intricate speech generation processes to meet contemporary needs for real-time, natural, and expressive techniques.

PDF Markdown

Related Papers

GitHub

CosyVoice2.0

Tweets

https://twitter.com/papers_anon/status/1868555287927615678

https://twitter.com/gm8xx8/status/1868674203584438535

https://twitter.com/AkashicMarga/status/1869082051204546582

https://twitter.com/GptMaestro/status/1870776292821418268

https://twitter.com/ryo694/status/1872588426723381543

https://twitter.com/BinWang_Eng/status/1868835866770391090

YouTube

Show All Videos