Papers
Topics
Authors
Recent
Search
2000 character limit reached

Controllable neural text-to-speech synthesis using intuitive prosodic features

Published 14 Sep 2020 in eess.AS, cs.CL, cs.LG, and cs.SD | (2009.06775v1)

Abstract: Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles, while maintaining similar mean opinion score (4.23) to our Tacotron baseline (4.26).

Citations (64)

Summary

  • The paper presents a controllable TTS model that integrates a prosody encoder to isolate and manipulate key speech features.
  • It leverages acoustic features like pitch, duration, and energy to intuitively adjust and diversify speaking styles.
  • Objective and subjective evaluations confirm enhanced prosodic expressivity while maintaining the natural quality of synthesized speech.

Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features

Introduction

The paper "Controllable neural text-to-speech synthesis using intuitive prosodic features" introduces a method for prosody modeling in neural text-to-speech (TTS) systems, which offers the advantage of generating diverse speaking styles while maintaining naturalness. Modern TTS systems can produce speech closely resembling natural speech; however, producing varied prosodic styles has remained a challenge due to limitations in current seq2seq models like Tacotron which tend to average prosodic styles based on training data. This work addresses these limitations by presenting a model that leverages acoustic speech features to predict and control prosodic dimensions effectively, allowing for more intuitive and meaningful variations in prosody.

Model Architecture

The proposed architecture is based on an encoder-decoder model with attention, similar to Tacotron 2 but enhanced by a prosody encoder that forecasts prosodic features such as pitch, pitch range, duration, energy, and spectral tilt. These features are disentangled and independently controllable, providing intuitive manipulations of the prosodic style. Figure 1

Figure 1: An overview of the proposed prosody modeling encoder-decoder with attention architecture. The model is divided into training and inference phases.

During training, both prosody and text encoders are used to predict ground-truth prosodic features, which are combined with decoder inputs (teacher-forcing). For inference, the prosody encoder generates feature predictions which subsequently condition the decoder output. This configuration accommodates prosody control, allowing modifications through an additional bias mechanism.

Prosodic Features and Their Impact

The prosodic space is defined by features that provide robust and disentangled prosody modeling, which simplifies the correlation between perceived and generated prosody. Acoustic features are extracted using established methods such as automatic speech recognition for phone durations, spectral tilt for voice quality, and combining multiple pitch tariffs for precise pitch estimations. Data normalization ensures that feature values are confined to a consistent range, facilitating the mapping between intended and realized prosody.

The prosodic model predicts features directly from text, suggesting that provided with suitable data, the system can synthesize text with appropriately contextual prosody.

Experimental Evaluation

The experiments involve three-system comparison: baseline Tacotron, and two versions of the prosody control model using different datasets to illustrate performance in capturing and controlling prosodic aspects. Objective assessment showed good correspondence between target and realized prosody across dimensions. Figure 2

Figure 2: Means and standard deviations of the measured prosodic features with respect to target bias values.

Subjective tests using MOS confirmed similar quality across systems, though the baseline outperformed in AB tests due to simpler prosody prediction mechanisms. Despite the reported degradation at the prosody feature extremes, particularly for systems trained with broader voice coverage, prosody-modified synthesis versions were preferred in real use cases, demonstrating practical enhancement in prosodic expressivity. Figure 3

Figure 3: Means and 95\% confidence intervals of MOS measured over all features and target bias values.

Discussion

The proposed model excels at generating various speaking styles, enhancing synthetic speech's prosodic control. The architecture ensures substantial coverage of prosodic dimensions and independence from acoustic nuisances, providing a holistic synthesis with controllable attributes. Challenges mainly stem from achieving finer-grained control, currently only viable at sentence level, opening pathways for future explorations in cascade generation approaches or finer grain decoder-level interventions.

Conclusions

This research proposes a competent neural TTS system equipped for intuitive prosody control, evidenced to generate speech maintaining a balance of quality and versatility in prosodic variation. The model reveals significant advancements towards automatic prosody synthesis crucial for applications where speech expressivity and nuance are paramount, opening further exploration avenues in synchronized text-to-speech modeling and enhanced user experience improvements.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.