Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model (2406.17310v1)

Published 25 Jun 2024 in eess.AS

Abstract: We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked LLM (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Joun Yeop Lee (10 papers)
  2. Myeonghun Jeong (12 papers)
  3. Minchan Kim (18 papers)
  4. Ji-Hyun Lee (9 papers)
  5. Hoon-Young Cho (16 papers)
  6. Nam Soo Kim (47 papers)
Citations (2)