Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS? (2005.01245v2)

Published 4 May 2020 in eess.AS

Abstract: Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora. In addition, we describe a warm-start training strategy that we adopted for Tacotron2 training. A large-scale listening test is conducted, and a distance metric is adopted to evaluate synthesis of dialects. This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach. Audio samples are available online.

Citations (22)

View on Semantic Scholar

Summary

The paper finds that using low-quality data with explicit channel and dialect modeling significantly improves the naturalness of synthesized speech.
The study shows that artificial speaker augmentation via VTLP offers limited benefits compared to low-quality data approaches for dialect identification.
The research highlights that achieving high speaker similarity for unseen voices remains challenging, prompting further exploration of adaptation techniques.

An Insightful Analysis of Speaker Augmentation in Multi-Speaker End-to-End TTS

The paper "Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?" authored by Erica Cooper et al. investigates novel approaches for enhancing multi-speaker text-to-speech (TTS) synthesis, specifically focusing on the benefits of speaker augmentation. The authors explicitly explore the potential of two speaker augmentation strategies: artificial speaker creation and the utilization of low-quality audio data.

Motivation and Background

Traditional TTS systems encounter difficulties in two key areas: simulating a large array of speakers within a single model and efficiently adapting to new speakers using minimal data. Existing speaker adaptation methods largely depend on fine-tuning or utilize external speaker embeddings extracted from automatic speaker verification (ASV) models. However, each method presents limitations regarding speaker similarity, particularly for unseen speakers. This paper bridges the gap by examining speaker augmentation as a viable alternative or complement to conventional adaptation techniques.

Methodology

Cooper et al. modify the underlying Tacotron2 model to incorporate channel and dialect considerations. Two speaker augmentation methods are proposed:

Artificial Speaker Augmentation: This involves perturbing the original high-quality audio data using methods such as resampling, to create "artificial" speakers that expand the diversity within the training dataset.
Speaker Augmentation Using Low-Quality Data: Here, non-TTS datasets, which often contain a broader variety of speakers and dialects but are of lower quality, are integrated into the training process. This necessitates modifications to the Tacotron architecture, including a channel-aware postnet and a dialect encoder to handle variations introduced by different channel and dialect characteristics.

In addition, a warm-start training strategy is employed, initializing the multi-speaker model from pre-trained single-speaker systems. This step-by-step training approach is designed to seamlessly incorporate both new data sources and augmentation strategies.

Results and Analysis

The efficacy of the proposed speaker augmentation techniques was evaluated through large-scale listening tests, analyzing naturalness, speaker similarity, and dialect identification accuracy. Counterintuitively, the paper finds that while artificial speaker augmentation through Vocal Tract Length Perturbation (VTLP) does not significantly enhance the outputs, leveraging low-quality data with carefully modeled channel and dialect factors results in notable improvements.

Naturalness: Significant improvements in the naturalness of synthesized speech were observed for seen speakers when low-quality data coupled with explicit channel and dialect modeling was utilized.
Dialect Identification: The inclusion of dialect-related embeddings improves alignment with dialect perception in natural speech, as evidenced by Frobenius distance metrics between synthetic and natural speech.
Speaker Similarity: Despite enhancements in naturalness and dialect representation, challenges remain in achieving high speaker similarity, especially for unseen test speakers.

Implications and Future Directions

The paper’s findings highlight the potential of non-traditional augmentation techniques in TTS systems, specifically using low-quality datasets. However, achieving balanced improvements across naturalness, speaker similarity, and dialect fidelity remains complex. The paper suggests that while speaker augmentation strategies can address some gaps, further research is required to refine techniques for improving speaker similarity, potentially through more advanced adaptation mechanisms or larger training datasets.

The insights gained from this research can guide future work aiming to reconcile the trade-offs in multi-speaker TTS systems, and expand upon the interplay between synthetic probabilities and perceptual realities in speaker variability and model robustness, potentially enhancing developments in voice cloning and personalized TTS applications.

PDF Markdown

Related Papers

YouTube

Show All Videos