Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts (2309.11977v3)

Published 21 Sep 2023 in cs.SD and eess.AS

Abstract: Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the LLM, recent LLM-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec LLM VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Shun Lei (21 papers)
  2. Yixuan Zhou (30 papers)
  3. Liyang Chen (33 papers)
  4. Dan Luo (25 papers)
  5. Zhiyong Wu (171 papers)
  6. Xixin Wu (85 papers)
  7. Shiyin Kang (27 papers)
  8. Tao Jiang (274 papers)
  9. Yahui Zhou (18 papers)
  10. Yuxing Han (40 papers)
  11. Helen Meng (204 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com