AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2010.11567v2)

Published 22 Oct 2020 in cs.SD and eess.AS

Abstract: In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

Authors (5)

Yao Shi (14 papers)
Hui Bu (25 papers)
Xin Xu (188 papers)
Shaoji Zhang (2 papers)
Ming Li (787 papers)

Citations (193)

View on Semantic Scholar

Summary

An Overview of AISHELL-3: A Multi-speaker Mandarin TTS Corpus and its Baselines

The paper introduces the AISHELL-3 dataset, a significant contribution to the field of multi-speaker speech synthesis, specifically focusing on Mandarin Chinese. This dataset comprises approximately 85 hours of high-quality recordings from 218 native Mandarin speakers. The recordings, made in a controlled acoustic environment, encompass text from various domains including smart home commands, news, and geographic specifics. This wide topical range enhances the corpus's applicability across diverse TTS systems. An essential feature of the dataset is the manually transcribed Chinese character and pinyin-level texts, paired with metadata such as gender, age, and regional accents.

AISHELL-3 bridges a gap in resources available for TTS systems tailored to non-English languages. Given the tonal complexities and specific phonetic variations in Mandarin, the corpus provides an indispensable resource for developing TTS systems capable of mimicking diverse speaker characteristics. The structured provision of speaker attributes facilitates robust ML model training.

The authors develop a baseline multi-speaker TTS system building on the Tacotron-2 framework integrated with a speaker verification model. Leveraging a speaker embedding feedback constraint, this system seeks to achieve zero-shot voice cloning, thereby enhancing its adaptability to previously unencountered speaker voices. The baseline architecture comprises a threefold subsystem: a speaker-agnostic frontend, a Tacotron-2 based acoustic model, and a neural vocoder. The prosody prediction and preprocessing enhance this subsystem's capability to handle variations in speech rhythms and intonations, especially relevant in Mandarin’s context.

Objective evaluations, including SV-EER metrics and cosine similarity measures, reveal promising results with notable voice similarity for both seen and unseen speakers in the training dataset. For instance, the system achieves an EER of 4.56% for validation set speakers, rising to 9.46% for test set speakers, indicating a certain performance dip when handling new speaker identities but overall maintaining reasonable similarity.

The paper also highlights several innovative dataset preparation techniques aimed at improving model training efficacy and generalization abilities. These include silence trimming, long-form sentence augmentation, and prosodic label prediction, which mitigate traditional challenges like alignment instability in longer utterances and monotonous prosody. Such enhancements support a more nuanced and natural synthetic speech output.

From a theoretical perspective, the research underscores the value of language-specific corpora for TTS improvements, especially in tonal languages such as Mandarin. Practically, AISHELL-3 supports a broad domain of applications from commercial voice assistants to automated narration systems within Mandarin-speaking contexts.

Future research may explore refining the discrepancy between synthetic and real speaker similarity for unseen speakers, possibly through augmented feedback constraints or more sophisticated speaker representation techniques. The authors' work invites further investigation into prosodic and phonetic modeling to elevate the naturalness and adaptability of TTS systems trained on AISHELL-3.

In conclusion, this paper provides a crucial dataset and baseline system, equipping researchers and practitioners with the tools to advance multi-speaker TTS research, especially fine-tuned to the intricacies of the Mandarin language. The dataset is positioned as a potentially pivotal resource to inform subsequent developments within the field of speech synthesis.

PDF Markdown

Related Papers

Find Related Papers