Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder (2311.14957v1)

Published 25 Nov 2023 in cs.SD and eess.AS

Abstract: Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Aäron van den Oord, et al., “Wavenet: A generative model for raw audio,” in SSW. 2016, p. 125, ISCA.
  2. Nal Kalchbrenner, et al., “Efficient neural audio synthesis,” in ICML. 2018, vol. 80, pp. 2415–2424, PMLR.
  3. Ryan Prenger, et al., “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP. 2019, pp. 3617–3621, IEEE.
  4. Wei Ping, et al., “Waveflow: A compact flow-based model for raw audio,” in ICML. 2020, vol. 119, pp. 7706–7716, PMLR.
  5. Ryuichi Yamamoto, et al., “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP. 2020, pp. 6199–6203, IEEE.
  6. Kundan Kumar, et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” in NeurIPS, 2019, pp. 14881–14892.
  7. Won Jang, et al., “Universal melgan: A robust neural vocoder for high-fidelity waveform generation in multiple domains,” arXiv, vol. abs/2011.09631, 2020.
  8. Jiaqi Su, et al., “Hifi-gan: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks,” in INTERSPEECH. 2020, pp. 4506–4510, ISCA.
  9. Ji-Hoon Kim, et al., “Fre-gan: Adversarial frequency-consistent audio synthesis,” in INTERSPEECH. 2021, pp. 2197–2201, ISCA.
  10. Rongjie Huang, et al., “Singgan: Generative adversarial network for high-fidelity singing voice generation,” in ACM Multimedia. 2022, pp. 2525–2535, ACM.
  11. Sang-gil Lee, et al., “Bigvgan: A universal neural vocoder with large-scale training,” in ICLR. 2023, OpenReview.net.
  12. Nanxin Chen, et al., “Wavegrad: Estimating gradients for waveform generation,” in ICLR. 2021, OpenReview.net.
  13. Zhifeng Kong, et al., “Diffwave: A versatile diffusion model for audio synthesis,” in ICLR. 2021, OpenReview.net.
  14. Xin Wang, et al., “Neural source-filter-based waveform model for statistical parametric speech synthesis,” in ICASSP. 2019, pp. 5916–5920, IEEE.
  15. Jaeseong You, et al., “GAN vocoder: Multi-resolution discriminator is all you need,” in INTERSPEECH. 2021, pp. 2177–2181, ISCA.
  16. Alexandre Défossez, et al., “High fidelity neural audio compression,” arXiv, vol. abs/2210.13438, 2022.
  17. Rongjie Huang, et al., “Multi-singer: Fast multi-singer singing voice vocoder with A large-scale corpus,” in ACM Multimedia. 2021, pp. 3945–3954, ACM.
  18. “An efficient algorithm for the calculation of a constant q transform,” Journal of the Acoustical Society of America, vol. 92, pp. 2698–2701, 1992.
  19. Yizhi Li, et al., “MERT: acoustic music understanding model with large-scale self-supervised training,” arXiv, vol. abs/2306.00107, 2023.
  20. “Constant-q transform toolbox for music processing,” in Sound and Music Computing Conference, 2010, pp. 3–64.
  21. Brian McFee, et al., “librosa: Audio and music signal analysis in python,” in SciPy. 2015, pp. 18–24, scipy.org.
  22. Kin Wai Cheuk, et al., “nnaudio: An on-the-fly GPU audio to spectrogram conversion toolbox using 1d convolutional neural networks,” IEEE Access, vol. 8, pp. 161981–162003, 2020.
  23. Lichao Zhang, et al., “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” in NeurIPS, 2022.
  24. Junya Koguchi, et al., “PJS: phoneme-balanced japanese singing-voice corpus,” in APSIPA. 2020, pp. 487–491, IEEE.
  25. Yu Wang, et al., “Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis,” in INTERSPEECH. 2022, pp. 4242–4246, ISCA.
  26. Jinglin Liu, et al., “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” in AAAI. 2022, pp. 11020–11028, AAAI Press.
  27. Soonbeom Choi, et al., “Children’s song dataset for singing voice research,” in ISMIR, 2020.
  28. Heiga Zen, et al., “Libritts: A corpus derived from librispeech for text-to-speech,” in INTERSPEECH. 2019, pp. 1526–1530, ISCA.
  29. “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  30. Junichi Yamagishi, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  31. Antony W. Rix, et al., “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP. 2001, pp. 749–752, IEEE.
  32. Robert Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. IEEE, 1993, vol. 1, pp. 125–128.
  33. Wen-Chin Huang, et al., “The singing voice conversion challenge 2023,” arXiv, vol. abs/2306.14422, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.