Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder (2404.17161v1)

Published 26 Apr 2024 in cs.SD, eess.AS, and eess.SP

Abstract: Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism,” in AAAI, 2022, pp. 11 020–11 028.
  2. J. Hwang, S. Lee, and S. Lee, “HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models,” CoRR, vol. abs/2306.06814, 2023.
  3. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” in ICLR, 2021.
  4. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,” CoRR, vol. abs/2304.09116, 2023.
  5. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” arXiv, vol. abs/2210.13438, 2022.
  6. Z. Du, S. Zhang, K. Hu, and S. Zheng, “FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec,” CoRR, vol. abs/2309.07405, 2023.
  7. H. Kawahara, “STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds,” AST, pp. 349–353, 2006.
  8. M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans Inf Syst, vol. 99, no. 7, pp. 1877–1884, 2016.
  9. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in SSW.   ISCA, 2016, p. 125.
  10. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” in ICML, 2018, pp. 2415–2424.
  11. A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” in ICML, 2018, pp. 3918–3926.
  12. W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech,” in ICLR, 2019.
  13. W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A Compact Flow-based Model for Raw Audio,” in ICML, vol. 119, 2020, pp. 7706–7716.
  14. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in ICASSP, 2019, pp. 3617–3621.
  15. L. Juvela, B. Bollepalli, V. Tsiaras, and P. Alku, “Glotnet—a raw waveform model for the glottal excitation in statistical parametric speech synthesis,” TASLP, vol. 27, no. 6, pp. 1019–1030, 2019.
  16. J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in ICASSP, 2019, pp. 5891–5895.
  17. Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A Versatile Diffusion Model for Audio Synthesis,” in ICLR, 2021.
  18. T. D. Nguyen, J.-H. Kim, Y. Jang, J. Kim, and J. S. Chung, “FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder,” arXiv:2401.10032, 2024.
  19. X. Wang, S. Takaki, and J. Yamagishi, “Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis,” in ICASSP, 2019, pp. 5916–5920.
  20. C. Yu and G. Fazekas, “Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables,” in ISMIR, 2023, pp. 667–675.
  21. R. Yamamoto, E. Song, and J. Kim, “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,” in ICASSP, 2020, pp. 6199–6203.
  22. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in NeurIPS, 2019.
  23. W. Jang, D. Lim, and J. Yoon, “Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains,” arXiv, vol. abs/2011.09631, 2020.
  24. J. Su, Z. Jin, and A. Finkelstein, “HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks,” in INTERSPEECH, 2020, pp. 4506–4510.
  25. J. Kim, S. Lee, J. Lee, and S. Lee, “Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis,” in INTERSPEECH, 2021, pp. 2197–2201.
  26. R. Huang, C. Cui, F. Chen, Y. Ren, J. Liu, Z. Zhao, B. Huai, and Z. Wang, “SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation,” in ACM MM, 2022, pp. 2525–2535.
  27. S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A Universal Neural Vocoder with Large-Scale Training,” in ICLR, 2023.
  28. T. Shibuya, Y. Takida, and Y. Mitsufuji, “BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network,” CoRR, vol. abs/2309.02836, 2023.
  29. D. Wu, W. Hsiao, F. Yang, O. Friedman, W. Jackson, S. Bruzenak, Y. Liu, and Y. Yang, “DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation,” in ISMIR, 2022, pp. 76–83.
  30. Z. Liu, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” NeurIPS, vol. 33, pp. 1583–1594, 2020.
  31. S. Li, S. Liu, L. Zhang, X. Li, Y. Bian, C. Weng, Z. Wu, and H. Meng, “SnakeGAN: A Universal Vocoder Leveraging DDSP Prior Knowledge and Periodic Inductive Bias,” in ICME, 2023, pp. 1703–1708.
  32. J. You, D. Kim, G. Nam, G. Hwang, and G. Chae, “GAN Vocoder: Multi-Resolution Discriminator Is All You Need,” in INTERSPEECH, 2021, pp. 2177–2181.
  33. J. Allen, “Short term spectral analysis, synthesis, and modification by discrete Fourier transform,” TASLP, vol. 25, no. 3, pp. 235–238, 1977.
  34. J. C. Brown and M. Puckette, “An efficient algorithm for the calculation of a constant Q transform,” JASA, vol. 92, pp. 2698–2701, 1992.
  35. C. Schörkhuber and A. Klapuri, “Constant-Q transform toolbox for music processing,” in SMCC, 2010, pp. 3–64.
  36. O. Rioul and P. Duhamel, “Fast algorithms for discrete and continuous wavelet transforms,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 569–586, 1992.
  37. Y. Gu, X. Zhang, L. Xue, and Z. Wu, “Multi-scale sub-band constant-q transform discriminator for high-fidelity vocoder,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 10 616–10 620.
  38. E. Hewitt and R. Hewitt, “The Gibbs-Wilbraham phenomenon: An episode in fourier analysis,” Arch. Hist. Exact Sci. 21, 129–160, 1979.
  39. S. Seneff, “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D. dissertation, Massachusetts Institute of Technology, Research Laboratory of Electronics, 1985.
  40. J. P. Stautner, “Analysis and synthesis of music using the auditory transform,” Ph.D. dissertation, Massachusetts Institute of Technology, 1983.
  41. M. Taner, “Joint time/frequency analysis, Q quality factor and dispersion computation using Gabor-Morlet wavelets or the Gabor-Morlet transform,” RSI, pp. 1–5, 1983.
  42. R. Navarro and A. Tabernero, “Gaussian wavelet transform: two alternative fast implementations for images,” MSSP, vol. 2, pp. 421–436, 1991.
  43. G. B. Folland and A. Sitaram, “The uncertainty principle: a mathematical survey,” JFAA, vol. 3, pp. 207–238, 1997.
  44. B. McFee, C. Raffel, D. Liang, D. P. W. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in SciPy, 2015.
  45. K. W. Cheuk, H. Anderson, K. Agres, and D. Herremans, “nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks,” IEEE Access, pp. 161 981–162 003, 2020.
  46. G. R. Lee, R. Gommers, F. Waselewski, K. Wohlfahrt, and A. OLeary, “PyWavelets: A Python package for wavelet analysis,” JOSS, vol. 4, no. 36, p. 1237, 2019.
  47. L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus,” in NeurIPS, 2022.
  48. J. Koguchi, S. Takamichi, and M. Morise, “PJS: phoneme-balanced Japanese singing-voice corpus,” in APSIPA, 2020, pp. 487–491.
  49. Y. Wang, X. Wang, P. Zhu, J. Wu, H. Li, H. Xue, Y. Zhang, L. Xie, and M. Bi, “Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis,” in INTERSPEECH, 2022.
  50. R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus,” in ACM MM, 2021.
  51. S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Children’s song dataset for singing voice research,” in ISMIR, 2020.
  52. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” in INTERSPEECH, 2019, pp. 1526–1530.
  53. K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  54. J. Yamagishi, C. Veaux, K. MacDonald et al., “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” CSTR, 2019.
  55. I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in ICLR, 2019.
  56. Y. Ai and Z. Ling, “APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra,” TASLP, pp. 2145–2157, 2023.
  57. X. Zhang, L. Xue, Y. Wang, Y. Gu, X. Chen, Z. Fang, H. Chen, L. Zou, C. Wang, J. Han, K. Chen, H. Li, and Z. Wu, “Amphion: An Open-Source Audio, Music and Speech Generation Toolkit,” arXiv, vol. abs/2312.09911, 2023.
  58. W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, Y. Yasuda, and T. Toda, “The Singing Voice Conversion Challenge 2023,” arXiv, vol. abs/2306.14422, 2023.
  59. T. Salimans and D. P. Kingma, “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks,” in NeurIPS, 2016, p. 901.
  60. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001, pp. 749–752.
  61. M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. C. Courville, and Y. Bengio, “Chunked Autoregressive GAN for Conditional Waveform Synthesis,” in ICLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yicheng Gu (10 papers)
  2. Xueyao Zhang (32 papers)
  3. Liumeng Xue (24 papers)
  4. Haizhou Li (286 papers)
  5. Zhizheng Wu (45 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com