Almost Unsupervised Text to Speech and Automatic Speech Recognition (1905.06791v3)

Published 13 May 2019 in eess.AS, cs.CL, and cs.SD

Abstract: Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of LLMing both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\hat{x}$, and the ASR model leverages the transformed pair $(\hat{x},y)$ for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which addresses error propagation especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data.

PDF Abstract

Overview of "Almost Unsupervised Text to Speech and Automatic Speech Recognition"

This paper presents an innovative approach to the challenges inherent in Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems, particularly in scenarios where aligned speech and text data are scarce. Recognizing the duality of TTS and ASR tasks, the authors introduce a framework for "almost unsupervised" learning, requiring only a limited set of paired data alongside supplementary unpaired data.

The method comprises four primary components: denoising auto-encoders, dual transformation, bidirectional sequence modeling, and a unified model architecture utilizing Transformers. The dual transformation process, akin to back-translation used in machine translation, allows the TTS and ASR tasks to inform and refine each other by generating pseudo-paired data from unpaired samples. Bidirectional sequence modeling mitigates the error propagation typically associated with generating long sequences by enabling sequence generation in both left-to-right and right-to-left directions.

In empirical evaluations on the LJSpeech dataset with minimal paired examples—200 paired sequences equating to approximately 20 minutes of audio—the model yields a remarkable word intelligibility rate of 99.84%. For TTS quality, the model achieves a MOS of 2.68, and for ASR, it attains a phoneme error rate (PER) of 11.7%. These outcomes significantly outperform baseline models trained solely on the limited paired data.

Key Results and Implications

The numerical results highlight the potential of the proposed almost unsupervised approach. Achieving such high intelligibility and competitive MOS in TTS, alongside a respectable PER in ASR, demonstrates the feasibility of deploying advanced speech technologies in low-resource environments.

From a practical standpoint, this paper suggests a viable pathway for improving TTS and ASR systems in languages or dialects lacking extensive labeled datasets. The approach could foster the development of more inclusive language technologies adaptable to diverse linguistic contexts.

Theoretically, the paper's methodology contributes to the broader field of sequence-to-sequence learning by demonstrating how unsupervised components—typically explored in natural language processing—can be adapted and extended to the domains of speech processing.

Future Directions

The research opens several avenues for future exploration. Firstly, the transition from an "almost unsupervised" setup to a fully unsupervised framework could be pursued, potentially harnessing more sophisticated pre-training strategies as indicated by the authors. Additionally, utilizing advanced neural vocoders, such as WaveNet, could further augment the audio synthesis quality, overcoming the limitations identified with using Griffin-Lim.

In conclusion, the work establishes a pivotal developmental framework for expanding TTS and ASR capabilities in low-resource settings and offers noteworthy insights into leveraging dual-task relationships within deep learning architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yi Ren (215 papers)
Xu Tan (164 papers)
Tao Qin (201 papers)
Sheng Zhao (75 papers)
Zhou Zhao (218 papers)
Tie-Yan Liu (242 papers)

Citations (99)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos