Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conversational End-to-End TTS for Voice Agent (2005.10438v2)

Published 21 May 2020 in cs.SD and eess.AS

Abstract: End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors, like fillers and repeated words, which makes the conversational speaking style more realistic.

The paper addresses the challenge of building a high-quality conversational Text-to-Speech (TTS) system for voice agents. It introduces a spontaneous conversational speech corpus and a conversation context-aware end-to-end TTS approach. The approach employs an auxiliary encoder and a conversational context encoder to capture utterance and context information within a conversation. The paper finds that the model can express spontaneous behaviors, enhancing the realistic nature of the conversational speaking style.

The paper identifies two key problems in building a conversational TTS system: developing a conversational speech corpus and creating a high-performance TTS model that captures prosody in conversations. To address the first problem, the paper introduces a new recording scheme for building spontaneous conversational corpora:

  • Conversational scenarios and transcripts are designed to ensure content variety and conversational context.
  • Speakers perform according to the scripts, modifying content and adding spontaneous behaviors.
  • Transcriptions are made of the speaker's actual speech content to ensure correct pronunciation.

The corpus includes the following spontaneous behaviors:

  • Fillers such as "um", "oh", "aha", "uh"
  • Repeated words or phrases
  • False starts
  • Reduced speech rate or pauses

The paper proposes a conversation context-aware end-to-end TTS approach, which uses an auxiliary encoder and a conversational context encoder.

The end-to-end TTS system is based on Tacotron2. The encoder consists of an embedding layer, three 1-D convolution layers followed by batch normalization and ReLU activations, and a BLSTM layer. Dropout is applied in all convolution and LSTM layers. The decoder is an auto-regressive module with a pre-net and two Zoneout-LSTM layers. The output of the second LSTM layer goes to the attention module. PostNet is a post-filter with five 1-D convolution layers. Stepwise monotonic attention is used. Parallel Wavenet is adopted as the neural vocoder.

The auxiliary encoder extracts text features using BERT (Bidirectional Encoder Representations from Transformers) embeddings and statistical features representing the syntactic structure:

  • F1F_1: the number of characters in the current sentence
  • F2F_2: the relative-position of the current character in the current sentence
  • F3F_3: the number of characters in the current utterance
  • F4F_4: the relative-position of the current character in the current utterance
  • F5F_5: the number of sentences in the current utterance
  • F6F_6: the relative-position of the current sentence in the current utterance

The auxiliary encoder uses a pre-net and a CBHG module. The features are up-sampled from character-level to phoneme-level and combined with the encoder outputs using addition.

The conversational context encoder extracts prosody-related features from sentence embeddings. BERT is used to extract sentence representations, and each embedding is attached by a one-hot vector as speaker ID. The conversational context encoder processes the sequence of sentence embeddings Etc:tE_{t-c:t} through a linear layer. A GRU (Gated Recurrent Unit) layer encodes the sequence Etc:t1E_{t-c:t-1} to a state vector StS_{t}. StS_{t} and EtE_t are concatenated and fed to the linear output layer.

The training corpus consists of 45 conversations between two native Chinese speakers (6 hours total, 3 hours per speaker). The agent speech data, containing about 2,000 utterances (3 hours), is used to train the TTS model. The encoder and decoder are pre-trained with a standard TTS corpus containing 6 hours of Chinese reading-style speech.

Three models are used in the subjective evaluation:

  • M1M_1: baseline model
  • M2M_2: M1M_1 plus auxiliary encoder
  • M3M_3: M2M_2 plus conversational context encoder

For all TTS models, the phoneme sequence contains phonemes, punctuations, inter-word, and inter-syllable symbols. The output is Mel Spectrogram extracted with sample rate 16,000. Adam optimizer is used with β1=0.9\beta_1=0.9, β2=0.999\beta_2= 0.999, and the learning rate exponentially decays from 10310^{-3} to 10510^{-5} after 50,000 iterations.

In comparison mean opinion score (CMOS) listening tests with 20 native Chinese speakers, the auxiliary encoder improves performance over the baseline model by a CMOS score of 0.22 and a preference of 42.9% at the utterance level. At the conversation level, it achieves a CMOS score of 0.62 and a preference of 59.0%. The conversation context encoder improves the prosody expression by a CMOS score of 0.18 and preference 42.1% at the utterance level, and a CMOS score of 0.39 and preference rate of 57.0% at the conversation level. The models can express spontaneous behaviors such as fillers and repeated words.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haohan Guo (22 papers)
  2. Shaofei Zhang (7 papers)
  3. Frank K. Soong (17 papers)
  4. Lei He (120 papers)
  5. Lei Xie (337 papers)
Citations (62)