A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation (1804.00522v4)

Published 27 Mar 2018 in cs.CL and cs.LG

Abstract: Domain adaptation plays an important role for speech recognition models, in particular, for domains that have low resources. We propose a novel generative model based on cyclic-consistent generative adversarial network (CycleGAN) for unsupervised non-parallel speech domain adaptation. The proposed model employs multiple independent discriminators on the power spectrogram, each in charge of different frequency bands. As a result we have 1) better discriminators that focus on fine-grained details of the frequency features, and 2) a generator that is capable of generating more realistic domain-adapted spectrogram. We demonstrate the effectiveness of our method on speech recognition with gender adaptation, where the model only has access to supervised data from one gender during training, but is evaluated on the other at test time. Our model is able to achieve an average of $7.41\%$ on phoneme error rate, and $11.10\%$ word error rate relative performance improvement as compared to the baseline, on TIMIT and WSJ dataset, respectively. Qualitatively, our model also generates more natural sounding speech, when conditioned on data from the other domain.

Citations (54)

View on Semantic Scholar

Summary

A Multi-Discriminator CycleGAN for Unsupervised Non-Parallel Speech Domain Adaptation: A Technical Overview

The paper authored by Hosseini-Asl et al., introduces an innovative approach to unsupervised non-parallel speech domain adaptation. This method leverages a Multi-Discriminator CycleGAN, an extension of the traditional CycleGAN architecture, to improve automatic speech recognition (ASR) performance in gender-adaptation tasks. Specifically, the proposed model addresses the challenges of adapting speech recognition systems to work effectively across domains with disparate non-linguistic features without the need for parallel data, which is often difficult to acquire.

The core innovation presented in this research is the integration of multiple independent discriminators, each tasked with focusing on different frequency bands of the power spectrogram. This enhancement enables the discriminators to capture fine-grained spectral details, thus allowing the generator to produce more realistic domain-adjusted spectrograms. By adapting the ASR models to these refined representations, the system demonstrates significant improvements in recognizing speech from unfamiliar domains, as validated through experiments on the TIMIT and WSJ datasets.

Key achievements reported include a notable reduction in phoneme error rate (7.41% relative improvement) and word error rate (11.10% relative improvement) when compared to baseline models. These advancements are showcased through the model's adeptness at performing gender adaptation tasks. Despite being trained on female speech, the ASR model adapted using this approach outperforms several benchmarks when evaluated on male speech and vice versa.

The implications of this research extend beyond the immediate improvements in ASR systems. The successful application of multi-discriminator architectures suggests that similar techniques could enhance domain adaptation in various areas of machine learning, particularly where data parallelism poses a challenge. As AI continues to evolve, the method introduced in this paper may inspire future explorations into more advanced adversarial architectures to tackle increasingly complex domain adaptation tasks.

Future work might explore varying the number and configuration of discriminators to assess their impact on different types of domain adaptation challenges. Also, expanding this architecture to other modalities beyond speech could yield significant insights into the general applicability and robustness of multi-discriminator models. As the field progresses, such innovations will be crucial for advancing unsupervised learning methodologies and enhancing the accuracy and adaptability of AI systems in diverse use-cases.

Related Papers

YouTube

Show All Videos