Solving Audio Inverse Problems with a Diffusion Model (2210.15228v3)

Published 27 Oct 2022 in eess.AS and cs.SD

Abstract: This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces CQT-Diff, a diffusion-based model that leverages CQT preconditioning to address diverse audio inverse problems.
It employs a U-Net architecture with dilated convolutions in the frequency domain, outperforming baselines in both objective and subjective metrics.
The approach demonstrates robust versatility in real-world audio restoration tasks, including bandwidth extension, inpainting, and declipping.

Solving Audio Inverse Problems with a Diffusion Model

The paper "Solving Audio Inverse Problems with a Diffusion Model" introduces CQT-Diff, a diffusion-based neural model designed for audio restoration. This model addresses audio inverse problems in a problem-agnostic manner, effectively dealing with tasks such as bandwidth extension, inpainting, and declipping across varied audio degradation scenarios. Notably, CQT-Diff employs a Constant-Q Transform (CQT) as a preconditioning step to leverage pitch-equivariant symmetries within music, translating these attributes into the model's performance.

Model Architecture and Methodology

CQT-Diff builds upon the structure of diffusion models, known for their prowess across multiple data modalities, including images and audio. Its architecture utilizes a carefully designed framework that exploits the invertibility and pitch equivariance provided by CQT. This transform is ideal in a music context due to its logarithmic frequency representation, supporting effective pitch-shifting as translational operations along the frequency axis.

The paper delineates the strategy of employing unconditional diffusion models as generative priors, facilitating versatility in handling different audio restoration tasks without retraining. Conditioning is executed through techniques like data consistency and reconstruction guidance, allowing adaptation to new problems seamlessly. The neural network architecture for CQT-Diff is structured around a U-Net with dilated convolutions in the frequency domain, enhancing the model's ability to manage harmonic signal structures and maintain high-quality audio generation.

Experimental Results and Evaluation

CQT-Diff exhibits substantial advancements over existing methods across various audio restoration tasks:

Bandwidth Extension: The model shows superior performance, demonstrated through objective metrics like Log-Spectral Distance (LSD) and Frechet Audio Distance (FAD), as well as subjective MUSHRA scores. Notably, CQT-Diff excelled when employing CQT representation over alternatives based on STFT or raw audio data.
Audio Inpainting: In tasks involving long audio segment restoration, CQT-Diff, through reconstruction guidance, provided coherent and plausible results, outperforming alternative methods like GACELA and Catch-A-Waveform.
Declipping: Here, CQT-Diff leveraged its generative capabilities to manage severe clipping scenarios efficiently, showing superiority in low Signal-to-Distortion Ratio (SDR) setups compared to baselines such as A-SPADE and SS-PEW.

Conclusion and Future Implications

This work illustrates the practicality of diffusion-based models in audio applications, presenting CQT-Diff as a flexible and powerful tool for diverse restoration tasks. Its methodological approach hints at broader applications in other audio restoration problems, particularly those devoid of paired training data.

The research stimulates further exploration of incorporating inductive biases into diffusion models through time-frequency representations like CQT. Future developments could aim at expanding the model's applicability to a more extensive range of audio data, potentially generalizing to non-music signals and increasing the model's utility in real-world scenarios. The advances outlined in this paper make a substantial contribution to audio signal restoration methodologies, leveraging the strengths of diffusion processes alongside innovative data representations.

PDF Markdown

Related Papers

GitHub

GitHub - eloimoliner/CQTdiff: Official repository of the paper "Solving Audio Inverse Problems with a Diffusion Model", submitted to ICASSP 23 (117 stars)

Tweets

https://twitter.com/dnlrtg/status/1762065415658746137