Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition (2309.09088v2)
Abstract: Vocoder models have recently achieved substantial progress in generating authentic audio comparable to human quality while significantly reducing memory requirement and inference time. However, these data-hungry generative models require large-scale audio data for learning good representations. In this paper, we apply contrastive learning methods in training the vocoder to improve the perceptual quality of the vocoder without modifying its architecture or adding more data. We design an auxiliary task with mel-spectrogram contrastive learning to enhance the utterance-level quality of the vocoder model under data-limited conditions. We also extend the task to include waveforms to improve the multi-modality comprehension of the model and address the discriminator overfitting problem. We optimize the additional task simultaneously with GAN training objectives. Our results show that the tasks improve model performance substantially in data-limited settings.
- “Enhancing gan-based vocoders with contrastive learning,” M.S. thesis, EECS Department, University of California, Berkeley, May 2023.
- “Generative adversarial networks,” 2014.
- “Melgan: Generative adversarial networks for conditional waveform synthesis,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. 2019, vol. 32, Curran Associates, Inc.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” ArXiv, vol. abs/2010.05646, 2020.
- “BigVGAN: A universal neural vocoder with large-scale training,” in The Eleventh International Conference on Learning Representations, 2023.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “Contrastive learning of general-purpose audio representations,” 2020.
- “Masked autoencoders that listen,” 2022.
- “On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning,” in Interspeech 2023, Dublin, France, Aug. 2023.
- “Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition,” arXiv preprint arXiv:2203.15796, 2022.
- “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6572–6576.
- “Towards improved zero-shot voice conversion with conditional dsvae,” arXiv preprint arXiv:2205.05227, 2022.
- “Utts: Unsupervised tts with conditional disentangled sequential variational auto-encoder,” arXiv preprint arXiv:2206.02512, 2022.
- “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
- “Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations,” arXiv preprint arXiv:2302.06419, 2023.
- “Differentiable augmentation for data-efficient gan training,” ArXiv, vol. abs/2006.10738, 2020.
- “Regularizing generative adversarial networks under limited data,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7917–7927, 2021.
- “Training generative adversarial networks with limited data,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 12104–12114, Curran Associates, Inc.
- “Projected gans converge faster,” CoRR, vol. abs/2111.01007, 2021.
- “Training {gan}s with stronger augmentations via contrastive discriminator,” in International Conference on Learning Representations, 2021.
- “Investigating why contrastive learning benefits robustness against label noise,” in First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022.
- “Using self-supervised learning can improve model robustness and uncertainty,” in Neural Information Processing Systems, 2019.
- “Contrastive learning improves model robustness under label noise,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2021, pp. 2703–2708.
- “Multimodal semantic mismatch detection in social media posts,” in Proceedings of IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), 2022.
- “Representation learning with contrastive predictive coding,” 2018.
- “Learning transferable visual models from natural language supervision,” 2021.
- “Vision-language navigation with self-supervised auxiliary reasoning tasks,” 2019.
- “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- Robert F. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128 vol.1, 1993.