An Overview of AISHELL-3: A Multi-speaker Mandarin TTS Corpus and its Baselines
The paper introduces the AISHELL-3 dataset, a significant contribution to the field of multi-speaker speech synthesis, specifically focusing on Mandarin Chinese. This dataset comprises approximately 85 hours of high-quality recordings from 218 native Mandarin speakers. The recordings, made in a controlled acoustic environment, encompass text from various domains including smart home commands, news, and geographic specifics. This wide topical range enhances the corpus's applicability across diverse TTS systems. An essential feature of the dataset is the manually transcribed Chinese character and pinyin-level texts, paired with metadata such as gender, age, and regional accents.
AISHELL-3 bridges a gap in resources available for TTS systems tailored to non-English languages. Given the tonal complexities and specific phonetic variations in Mandarin, the corpus provides an indispensable resource for developing TTS systems capable of mimicking diverse speaker characteristics. The structured provision of speaker attributes facilitates robust ML model training.
The authors develop a baseline multi-speaker TTS system building on the Tacotron-2 framework integrated with a speaker verification model. Leveraging a speaker embedding feedback constraint, this system seeks to achieve zero-shot voice cloning, thereby enhancing its adaptability to previously unencountered speaker voices. The baseline architecture comprises a threefold subsystem: a speaker-agnostic frontend, a Tacotron-2 based acoustic model, and a neural vocoder. The prosody prediction and preprocessing enhance this subsystem's capability to handle variations in speech rhythms and intonations, especially relevant in Mandarin’s context.
Objective evaluations, including SV-EER metrics and cosine similarity measures, reveal promising results with notable voice similarity for both seen and unseen speakers in the training dataset. For instance, the system achieves an EER of 4.56% for validation set speakers, rising to 9.46% for test set speakers, indicating a certain performance dip when handling new speaker identities but overall maintaining reasonable similarity.
The paper also highlights several innovative dataset preparation techniques aimed at improving model training efficacy and generalization abilities. These include silence trimming, long-form sentence augmentation, and prosodic label prediction, which mitigate traditional challenges like alignment instability in longer utterances and monotonous prosody. Such enhancements support a more nuanced and natural synthetic speech output.
From a theoretical perspective, the research underscores the value of language-specific corpora for TTS improvements, especially in tonal languages such as Mandarin. Practically, AISHELL-3 supports a broad domain of applications from commercial voice assistants to automated narration systems within Mandarin-speaking contexts.
Future research may explore refining the discrepancy between synthetic and real speaker similarity for unseen speakers, possibly through augmented feedback constraints or more sophisticated speaker representation techniques. The authors' work invites further investigation into prosodic and phonetic modeling to elevate the naturalness and adaptability of TTS systems trained on AISHELL-3.
In conclusion, this paper provides a crucial dataset and baseline system, equipping researchers and practitioners with the tools to advance multi-speaker TTS research, especially fine-tuned to the intricacies of the Mandarin language. The dataset is positioned as a potentially pivotal resource to inform subsequent developments within the field of speech synthesis.