Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion (1905.11563v3)

Published 28 May 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present an unsupervised end-to-end training scheme where we discover discrete subword units from speech without using any labels. The discrete subword units are learned under an ASR-TTS autoencoder reconstruction setting, where an ASR-Encoder is trained to discover a set of common linguistic units given a variety of speakers, and a TTS-Decoder trained to project the discovered units back to the designated speech. We propose a discrete encoding method, Multilabel-Binary Vectors (MBV), to make the ASR-TTS autoencoder differentiable. We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language. Therefore, the TTS-Decoder can synthesize speech with the same content as the input of ASR-Encoder but with different speaker characteristics, which achieves voice conversion (VC). We further improve the quality of VC using adversarial training, where we train a TTS-Patcher that augments the output of TTS-Decoder. Objective and subjective evaluations show that the proposed approach offers strong VC results as it eliminates speaker identity while preserving content within speech. In the ZeroSpeech 2019 Challenge, we achieved outstanding performance in terms of low bitrate.

Authors (3)

Andy T. Liu (21 papers)
Po-chun Hsu (25 papers)
Hung-yi Lee (327 papers)

Citations (28)

View on Semantic Scholar

Summary

The paper presents an unsupervised method that discovers discrete linguistic units to effectively separate speech content from speaker identity.
It employs an ASR-TTS autoencoder with MBV encoding and adversarial training to achieve superior speaker conversion with low bitrate constraints.
Experimental results show improved speaker verification, enhanced naturalness, and intelligibility, outperforming traditional one-hot and continuous vector methods.

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

The paper "Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion" presents a significant advancement in the utilization of unsupervised learning for voice conversion tasks. The researchers propose an innovative method for discovering discrete subword units from speech without any labeled data, which is a noteworthy shift from the traditional paradigms that rely on parallel speech and text transcription pairs.

Methodology

Central to the paper's methodology is the ASR-TTS autoencoder model, which incorporates an ASR-Encoder to discover common linguistic units and a TTS-Decoder to synthesize speech with different speaker characteristics. The ASR-TTS autoencoder functions within an unsupervised framework, eschewing text labels and parallel data, thus capitalizing on the abundance of unlabeled speech data. An integral component of this model is the Multilabel-Binary Vectors (MBV), which provide a discrete, differentiable encoding that is crucial for separating speech content from speaker characteristics. This approach allows for an automatic extraction of linguistic content while achieving voice conversion by altering speaker identity.

Adversarial training further enhances the quality of voice conversion. The TTS-Patcher is trained to augment the TTS-Decoder's output, which helps in compensating for the training-testing discrepancies inherent in the model.

Results

The experimental setup involves comprehensive evaluations that substantiate the efficacy of the proposed method. The approach is tested against a multitude of benchmarks, including speaker verification accuracy for disentanglement, subjective human evaluations for both naturalness and similarity in speaker characteristics, and various objective measures for encoding quality such as Character Error Rate (CER) and bitrate (BR).

The paper reports compelling outcomes, with the proposed MBV encoding achieving superior speaker identity disentanglement compared to one-hot and continuous vector encodings. Subjective assessments highlighted significant improvements in speaker similarity albeit with a minor compromise in naturalness. Notably, in the ZeroSpeech 2019 Challenge, the method excels in achieving low bitrate encoding while maintaining acceptable levels of intelligibility, securing high ranking on the Surprise dataset leaderboard.

Implications and Future Directions

The application of novel discrete linguistic representations introduces potential refinements in unsupervised learning in the domain of speech processing. The use of MBVs as linguistic unit representations could pave the way for improved voice conversion systems and might be applicable in broader tasks requiring content-style disentanglement without labeled data. The ability to achieve high-quality voice conversion under low bitrate constraints suggests significant implications for data compression in speech technologies.

Future research directions could explore the extension of these techniques to multilingual contexts, the integration with existing speech synthesis systems for enhanced performance, and the refinement of adversarial elements to further improve naturalness without losing content fidelity. The scalability of the approach in real-world applications remains an intriguing prospect, given its reliance on unlabeled data, which is easily obtainable.

In conclusion, this paper provides a robust framework for voice conversion using unsupervised discrete unit discovery, indicating promising avenues for future developments in AI-driven speech processing.

PDF Markdown

Related Papers

GitHub

GitHub - andi611/ZeroSpeech-TTS-without-T: A Pytorch implementation for the ZeroSpeech 2019 challenge. (112 stars)