Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MusicHiFi: Fast High-Fidelity Stereo Vocoding (2403.10493v4)

Published 15 Mar 2024 in cs.SD, eess.AS, and eess.SP

Abstract: Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at \url{https://MusicHiFi.github.io/web/}.

MusicHiFi: A New Frontier in High-Fidelity Stereo Vocoding

Introduction

The generation of high-quality audio through advanced vocoding techniques remains a significant challenge in the field of music generation and audio processing. Despite the advancements, existing methods often produce monophonic audio at lower resolutions, which restricts their application potential. Addressing this gap, the introduction of MusicHiFi, an efficient high-fidelity stereophonic vocoder, marks a significant stride toward achieving superior audio quality. Using a cascade of three generative adversarial networks (GANs), MusicHiFi transforms low-resolution mel-spectrograms into high-fidelity stereophonic audio. Its architecture ensures fast inference speeds, better audio quality, and enhanced spatialization control compared to previous methods.

Methodology

MusicHiFi employs a unified approach across its three stages: vocoding, bandwidth extension (BWE), and mono-to-stereo upmixing (M2S). Each stage utilizes a GAN-based generator and discriminator architecture, with adaptations to meet the specific requirements of each task.

  • Vocoding (MusicHiFi-V): Converts low-resolution mel-spectrograms into audio waveforms, adhering to a unified GAN-based architecture for generation.
  • Bandwidth Extension (MusicHiFi-BWE): Transforms low-resolution audio to high-resolution outputs. Incorporates a residual connection and an upsampling step, allowing the module to focus on generating high-frequency content effectively.
  • Mono-to-Stereo Upmixing (MusicHiFi-M2S): Utilizes mid-side encoding to produce stereo audio from mono inputs. This approach not only preserves the original monophonic content but also facilitates superior control over the spatial width of the audio.

Experiment and Results

MusicHiFi was rigorously evaluated against standard benchmarks and baselines. In terms of vocoding, it demonstrated superior performance on key metrics like Mel-D, STFT-D, and ViSQOL, maintaining comparable performance on SI-SDR with significantly faster inference speeds. The BWE module showed equivalent or better performance with Aero, while significantly outperforming AudioSR. Notably, MusicHiFi was hundreds of times faster than the baseline models. The M2S module outperformed conventional DSP-based decorrelation methods in objective assessments, proving the method’s efficiency and efficacy in creating high-quality stereo audio.

Implications and Future Directions

MusicHiFi represents a breakthrough in stereo vocoding, offering an efficient, high-quality solution for audio and music generation tasks. Its design addresses the key challenges in the field, including speed of generation, quality of the audio, and spatialization control. Looking ahead, the potential applications of MusicHiFi are vast. The model can be integrated with mel-spectrogram-based music generators, enhance the fidelity of low-resolution recordings, or be used to spatialize monophonic music. Furthermore, the unified GAN-based architecture offers a robust framework that could inspire future developments in audio processing and generative modeling.

Conclusion

The advent of MusicHiFi opens new avenues in the generation of high-fidelity, stereophonic audio. By leveraging a cascaded GAN approach, MusicHiFi efficiently transforms low-resolution mel-spectrograms into high-quality stereophonic audio. Its architecture ensures superiority in audio quality, spatialization, and inference speed over existing methods. The successful implementation and validation of MusicHiFi not only underscore its potential for immediate applications but also set the stage for future innovations in audio and music generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Neural information processing systems (NeuraIPS), 2020.
  2. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125
  3. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  4. S. Forsgren and H. Martiros, “Riffusion - Stable diffusion for real-time music generation,” 2022. [Online]. Available: https://riffusion.com/about
  5. H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” in International Conference on Machine Learning (ICML), 2023.
  6. Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., “Noise2Music: Text-conditioned music generation with diffusion models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.03917
  7. C. Hawthorne, I. Simon, A. Roberts, N. Zeghidour, J. Gardner, E. Manilow, and J. Engel, “Multi-instrument music synthesis with spectrogram diffusion,” in International Society for Music Information Retrieval (ISMIR), 2022.
  8. K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01546
  9. S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan, “Music ControlNet: Multiple time-varying controls for music generation,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07069
  10. Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian et al., “AUDIT: Audio editing by following instructions with latent diffusion models,” Neural Information Processing Systems (NeurIPS), 2024.
  11. Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan, “DITTO: diffusion inference-time T-optimization for music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12179
  12. Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations (ICLR), 2020.
  13. K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” Neural information processing systems (NeurIPS), 2019.
  14. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Neural Information Processing Systems (NeurIPS), 2020.
  15. S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in International Conference on Learning Representations (ICLR), 2023.
  16. J. Su, Y. Wang, A. Finkelstein, and Z. Jin, “Bandwidth extension is all you need,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  17. S. Han and J. Lee, “NU-Wave 2: A general neural audio upsampling model for various sampling rates,” in Interspeech, 2022.
  18. K. Zhang, Y. Ren, C. Xu, and Z. Zhao, “WSRGlow: A glow-based waveform generative model for audio super-resolution,” in Interspeech, 2021.
  19. S. E. Eskimez and K. Koishida, “Speech super resolution generative adversarial network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  20. R. Kumar, K. Kumar, V. Anand, Y. Bengio, and A. Courville, “NU-GAN: High resolution neural upsampling with gan,” 2020. [Online]. Available: https://arxiv.org/abs/2010.11362
  21. S. Hu, B. Zhang, B. Liang, E. Zhao, and S. Lui, “Phase-aware music super-resolution using generative adversarial networks,” in Interspeech, 2020.
  22. M. Mandel, O. Tal, and Y. Adi, “AERO: Audio super resolution in the spectral domain,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  23. H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley, “AudioSR: Versatile audio super-resolution at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2309.07314
  24. J. Serrà, D. Scaini, S. Pascual, D. Arteaga, J. Pons, J. Breebaart, and G. Cengarle, “Mono-to-stereo through parametric stereo generation,” in International Society of Music Information Retrieval (ISMIR), 2023.
  25. J. D. Johnston and A. J. Ferreira, “Sum-difference stereo transform coding,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992.
  26. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Neural Information Processing Systems (NeurIPS), 2023.
  27. L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” Neural Information Processing Systems (NeurIPS), 2020.
  28. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021.
  29. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” 2022. [Online]. Available: https://arxiv.org/abs/2210.13438
  30. X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017.
  31. M. R. Schroeder, “An artificial stereophonic effect obtained from a single audio signal,” Journal of the Audio Engineering Society (JAES), 1958.
  32. D. Fitzgerald, “Upmixing from mono-a source separation approach,” in IEEE International Conference on Digital Signal Processing (DSP), 2011.
  33. s3a spatialaudio, “s3a decorrelator,” https://github.com/s3a-spatialaudio/s3a-decorrelation-toolbox, 2019.
  34. M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in International Society for Music Information Retrieval (ISMIR), 2017.
  35. A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in International Conference Latent Variable Analysis and Signal Separation LVA/ICA, P. Tichavský, M. Babaie-Zadeh, O. J. Michel, and N. Thirion-Moreau, Eds.   Springer International Publishing, 2017.
  36. A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “ViSQOL: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, 2015.
  37. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  38. B. Series, “Recommendation ITU BS.1534-3,” 2014.
  39. N. Jillings, D. Moffat, B. De Man, and J. D. Reiss, “Web Audio Evaluation Tool: A browser-based listening test environment,” in Sound and Music Computing Conference, 2015.
  40. S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, 1979.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ge Zhu (17 papers)
  2. Juan-Pablo Caceres (5 papers)
  3. Zhiyao Duan (53 papers)
  4. Nicholas J. Bryan (23 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com