Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator (2403.16464v1)
Abstract: A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
- “WaveNet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
- “Efficient neural audio synthesis,” in ICML, 2018.
- “Parallel WaveNet: Fast high-fidelity speech synthesis,” in ICML, 2018.
- “ClariNet: Parallel wave generation in end-to-end text-to-speech,” in ICLR, 2019.
- “Glow: Generative flow with invertible 1×1111\times 11 × 1 convolutions,” in NeurIPS, 2018.
- “WaveGlow: A flow-based generative network for speech synthesis,” in ICASSP, 2019.
- “Generative modeling by estimating gradients of the data distribution,” in NeurIPS, 2019.
- “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
- “WaveGrad: Estimating gradients for waveform generation,” in ICLR, 2021.
- “DiffWave: A versatile diffusion model for audio synthesis,” in ICLR, 2021.
- “Generative adversarial nets,” in NIPS, 2014.
- “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in NeurIPS, 2019.
- “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP, 2020.
- “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020.
- “VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network,” in Interspeech, 2020.
- “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” in SLT, 2021.
- “StyleMelGAN: An efficient high-fidelity adversarial vocoder with temporal adaptive normalization,” in ICASSP, 2021.
- “Fre-GAN: Adversarial frequency-consistent audio synthesis,” in Interspeech, 2021.
- “Multi-stream HiFi-GAN with data-driven waveform decomposition,” in ASRU, 2021.
- “iSTFTNet: Fast and lightweight mel-spectrogram vocoder incorporating inverse short-time Fourier transform,” in ICASSP, 2022.
- “Chunked autoregressive GAN for conditional waveform synthesis,” in ICLR, 2022.
- “MISRNet: Lightweight neural vocoder using multi-input single shared residual blocks,” in Interspeech, 2022.
- “WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration,” in SLT, 2022.
- “Wave-U-Net Discriminator: Fast and lightweight discriminator for generative adversarial network-based speech synthesis,” in ICASSP, 2023.
- “PhaseAug: A differentiable augmentation for speech synthesis to simulate one-to-many mapping,” in ICASSP, 2023.
- “iSTFTNet2: Faster and more lightweight iSTFT-based neural vocoder using 1D-2D CNN,” in Interspeech, 2023.
- “LightVoc: An upsampling-free GAN vocoder based on conformer and inverse short-time Fourier transform,” in Interspeech, 2023.
- “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
- “Learning from between-class examples for deep sound recognition,” in ICLR, 2018.
- “CutMix: Regularization strategy to train strong classifiers with localizable features,” in ICCV, 2019.
- “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
- “SpecMix: A mixed sample data augmentation method for training with time-frequency domain features,” in Interspeech, 2021.
- “Data augmenting contrastive learning of speech representations in the time domain,” in SLT, 2021.
- “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
- “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- “Least squares generative adversarial networks,” in ICCV, 2017.
- “Autoencoding beyond pixels using a learned similarity metric,” in ICML, 2016.
- “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks.,” in Interspeech, 2017.
- “Differentiable augmentation for data-efficient GAN training,” in NeurIPS, 2020.
- “Training generative adversarial networks with limited data,” in NeurIPS, 2020.
- “Adam: A method for stochastic optimization,” in ICLR, 2015.
- “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Interspeech, 2022.
- “The VoiceMOS Challenge 2022,” in Interspeech, 2022.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
- “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019.
- Takuhiro Kaneko (40 papers)
- Hirokazu Kameoka (42 papers)
- Kou Tanaka (26 papers)