Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach (2306.01433v2)
Abstract: Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)
- E. Larsen, R. M. Aarts, and M. Danessis, “Efficient high-frequency bandwidth extension of music and speech,” in Proc. Audio Eng. Soc. Conv. 112, Munich, Germany, May 2002, paper no. 5627.
- M. Miron and M. Davies, “High frequency magnitude spectrogram reconstruction for music mixtures using convolutional autoencoders,” in Proc. Int. Conf. Digit. Audio Effects, Aveiro, Portugal, Sep. 2018, pp. 173–180.
- M. Lagrange and F. Gontier, “Bandwidth extension of musical audio signals with no side information using dilated convolutional neural networks,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Barcelona, Spain, May 2020, pp. 801–805.
- E. Moliner and V. Välimäki, “BEHM-GAN: Bandwidth extension of historical music using generative adversarial networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 943–956, Jul. 2023.
- E. Moliner, J. Lehtinen, and V. Välimäki, “Solving audio inverse problems with a diffusion model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- K. Zhang, Y. Ren, C. Xu, and Z. Zhao, “WSRGlow: A Glow-based waveform generative model for audio, super-resolution,” in Proc. Interspeech, Shanghai, China, Aug. 2021, pp. 1649–1653.
- A. Gupta, B. Shillingford, Y. Assael, and T. C. Walters, “Speech bandwidth extension with wavenet,” in Proc. IEEE Work. App. Signal Process. Audio Acoust. (WASPAA), New Paltz, NY, USA, Oct. 2019, pp. 205–208.
- K. Schmidt and B. Edler, “Blind bandwidth extension of speech based on LPCNet,” in Proc. 28th European Signal Process. Conf. (EUSIPCO), Aug. 2021, pp. 426–430.
- S. Han and J. Lee, “NU-Wave 2: A general neural audio upsampling model for various sampling rates,” in Proc. Interspeech, Sep. 2022, pp. 4401–4405.
- S. Sulun and M. E. P. Davies, “On filter generalization for music bandwidth extension using deep neural networks,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 1, pp. 132–142, Nov. 2020.
- H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “VoiceFixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109.13731, 2021.
- P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “HiFi++: A unified framework for bandwidth extension and speech enhancement,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- J. Makhoul and M. Berouti, “High-frequency regeneration in speech coding systems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Washington D.C., USA, Apr. 1979, pp. 428–431.
- J. Abel, M. Kaniewska, C. Guillaume, W. Tirry, H. Pulakka, V. Myllylä, J. Sjöberg, P. Alku, I. Katsir, D. Malah, I. Cohen, M. A. Tugtekin Turan, E. Erzin, T. Schlien, P. Vary, A. Nour-Eldin, P. Kabal, and T. Fingscheidt, “A subjective listening test of six different artificial bandwidth extension approaches in English, Chinese, German, and Korean,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Shanghai, China, Mar. 2016, pp. 5915–5919.
- H. Carl and U. Heuter, “Bandwidth enhancement of narrow-band speech signals,” in Proc. European Signal Process. Conf. (EUSIPCO), vol. 2, Edinburgh, UK, Sep. 1994, pp. 1178–1181.
- M. Dietz, L. Liljeryd, K. Kjorling, and O. Kunz, “Spectral band replication, a novel approach in audio coding,” in Proc. Audio Eng. Soc. 112th Conv., Munich, Germany, Apr. 2002.
- K.-Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Istanbul, Turkey, Jun. 2000, pp. 1843–1846.
- H. Seo, H.-G. Kang, and F. Soong, “A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Florence, Italy, May 2014, pp. 6087–6091.
- P. Jax and P. Vary, “Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Hong Kong, China, Apr. 2003, pp. 680–683.
- J. Kontio, L. Laaksonen, and P. Alku, “Neural network-based artificial bandwidth expansion of speech,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 3, pp. 873–881, Feb. 2007.
- H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum,” IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 7, pp. 2170–2183, Aug. 2011.
- K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), South Brisbane, QLD, Australia, Apr. 2015, pp. 4395–4399.
- V. K., S. Z. Enam, and S. Ermon, “Audio super resolution using neural nets,” in Proc. Int. Conf. Learn. Represent. (ICLR) (Workshop Track), Toulon, France, Apr. 2017.
- H. Wang and D. Wang, “Towards robust speech super-resolution,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 2058–2066, Jan. 2021.
- S. E. Eskimez, K. Koishida, and Z. Duan, “Adversarial training for speech super-resolution,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 2, pp. 347–358, Apr. 2019.
- J. Su, Y. Wang, A. Finkelstein, and Z. Jin, “Bandwidth extension is all you need,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Toronto, Canada, Jun. 2021, pp. 696–700.
- Y. Li, M. Tagliasacchi, O. Rybakov, V. Ungureanu, and D. Roblek, “Real-time speech frequency bandwidth extension,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Toronto, Canada, Jun. 2021, pp. 691–695.
- M. Mandel, O. Tal, and Y. Adi, “AERO: Audio super resolution in the spectral domain,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang, “Conditioning and sampling in variational diffusion models for speech super-resolution,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., “Noise2music: Text-conditioned music generation with diffusion models,” arXiv preprint arXiv:2302.03917, 2023.
- F. Schneider, Z. Jin, and B. Schölkopf, “Moûsai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023.
- J. Serrà, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” arXiv preprint arXiv:2206.03065, 2022.
- H. Yen, F. G. Germain, G. Wichern, and J. L. Roux, “Cold diffusion for speech enhancement,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- J. Richter, S. Welker, J. Lemercier et al., “Speech enhancement and dereverberation with diffusion-based generative models,” arXiv preprint arXiv:2208.05830, 2022.
- J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” arXiv preprint arXiv:2212.11851, 2022.
- H. Sahak, D. Watson, C. Saharia, and D. Fleet, “Denoising diffusion probabilistic models for robust image super-resolution in the wild,” arXiv preprint arXiv:2302.07864, 2023.
- S. Welker, H. N. Chapman, and T. Gerkmann, “DriftRec: Adapting diffusion models to blind image restoration tasks,” in Proc. NeurIPS 2022 Workshop DLDE, New Orleans, USA, Dec. 2022.
- H. Chung, J. Kim, S. Kim, and J. C. Ye, “Parallel diffusion models of operator and image for blind inverse problems,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, Jun. 2023, pp. 6059–6069.
- N. Murata, K. Saito, C.-H. Lai, Y. Takida, T. Uesaka, Y. Mitsufuji, and S. Ermon, “GibbsDDRM: A partially collapsed Gibbs sampler for solving blind inverse problems with denoising diffusion restoration,” arXiv preprint arXiv:2301.12686, 2023.
- H. Chung, J. Kim, M. T. Mccann et al., “Diffusion posterior sampling for general noisy inverse problems,” in Proc. Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, May 2023.
- B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion restoration models,” in Proc. Int. Conf. Learning Representations (ICLR), online, May 2022.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma et al., “Score-based generative modeling through stochastic differential equations,” in Proc. Int. Conf. Learning Representations (ICLR), May 2021.
- T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2022.
- A. Hyvärinen, “Estimation of non-normalized statistical models by score matching,” J. Mach. Learn. Res., vol. 6, no. 4, pp. 695–709, Apr. 2005.
- Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 26, Dec. 2013.
- D. P. Kingma and R. Gao, “Understanding the diffusion objective as a weighted integral of ELBOs,” arXiv preprint arXiv:2303.00848, 2023.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Adv. Neural Inf. Process. Syst. (NeurIPS), pp. 6840–6851, Dec. 2020.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019.
- M. Mardani, J. Song, J. Kautz, and A. Vahdat, “A variational perspective on solving inverse problems with diffusion models,” arXiv preprint arXiv:2305.04391, 2023.
- S. Rissanen, M. Heinonen, and A. Solin, “Generative modelling with inverse heat dissipation,” Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2022.
- J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon, “Perception prioritized training of diffusion models,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), New Orleans, USA, 2022, pp. 11 472–11 481.
- J. Choi, S. Kim, Y. Jeong et al., “ILVR: Conditioning method for denoising diffusion probabilistic models,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, Oct. 2021, pp. 14 347–14 356.
- H. Chung, B. Sim, and J. C. Ye, “Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 12 413–12 422.
- H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural vocoder is all you need for speech super-resolution,” in Proc. Interspeech, Incheon, Korea, Aug. 2022.
- E. Moliner and V. Välimäki, “A two-stage U-net for high-fidelity denoising of historical recordings,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Singapore, May 2022, pp. 841–845.
- J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2022.
- H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” in in Proc. Int. Conf. Machine Learning (ICML), Honolulu, Hawaii, USA, Jul. 2023.
- G. A. Velasco, N. Holighaus, M. Dörfler, and T. Grill, “Constructing an invertible constant-Q transform with non-stationary Gabor frames,” in Proc. Int. Conf. Digital Audio Effects (DAFX), Paris, France, 2011.
- E. Moliner and V. Välimäki, “Diffusion-based audio inpainting,” J. Audio Eng. Soc., vol. 72, Mar. 2024.
- C. Hawthorne, A. Stasyuk, A. Roberts et al., “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in Proc. Int. Conf. Learning Representations (ICLR), May 2019.
- Y. Wu, J. Gardner, E. Manilow, I. Simon, C. Hawthorne, and J. Engel, “The chamber ensemble generator: Limitless high-quality MIR data via generative modeling,” arXiv preprint arXiv:2209.14458, 2022.
- Y. Wu, E. Manilow, Y. Deng, R. Swavely, K. Kastner, T. Cooijmans, A. Courville, C.-Z. A. Huang, and J. Engel, “MIDI-DDSP: Detailed control of musical performance via hierarchical modeling,” in Proc. Int. Conf. Learning Representations (ICLR), online, Apr. 2022.
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Proc. Interspeech, Aug. 2019, pp. 2350–2354.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 2880–2894, Nov. 2020.
- S. Sulun and M. E. Davies, “On filter generalization for music bandwidth extension using deep neural networks,” IEEE J. Sel. Top. Signal Process., vol. 15, no. 1, pp. 132–142, Nov. 2020.
- C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided image synthesis and editing with stochastic differential equations,” in Proc. Int. Conf. Learning Representations, online, May 2021.
- S. Pascual, G. Bhattacharya, C. Yeh, J. Pons, and J. Serrà, “Full-band general audio synthesis with score-based diffusion,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
- ITU, “Method for the subjective assessment of intermediate quality level of audio systems,” Geneva, Switzerland, Rec. BS.1534-3, Oct. 2015.
- P. A. A. Esquef, “Audio restoration,” in Handbook of Signal Processing in Acoustics. New York, NY, USA: Springer, 2008, pp. 773–784.
- N. Pretto, N. D. Pozza, A. Padoan, A. Chmiel, K. J. Werner, A. Micalizzi, E. Schubert, A. Roda, S. Milani, and S. Canazza, “A workflow and digital filters for correcting speed and equalization errors on digitized audio open-reel magnetic tapes,” J. Audio Eng. Soc., vol. 70, no. 6, pp. 495–509, Jun. 2022.
- Eloi Moliner (16 papers)
- Filip Elvander (21 papers)
- Vesa Välimäki (30 papers)