- The paper introduces a novel two-stage method that combines Mel-spectrogram transformation with a neural vocoder to effectively restore clean guitar audio.
- The method significantly outperforms prior approaches by achieving superior objective scores (e.g., lower FAD and higher SI-SDR) and better subjective ratings.
- This advancement enhances audio processing for Music Information Retrieval tasks and paves the way for improved handling of real-world distorted recordings.
An Expert Review of "Distortion Recovery"
This paper introduces a novel two-stage methodology for the recovery of audio signals that have been subjected to distortion effects, specifically targeting electric guitar recordings. The authors combine the use of Mel-spectrogram transformation with neural vocoder technology to produce cleaner and more authentic post-distortion sounds, providing a significant enhancement over previous methods. Their approach, tested with commercial-grade VST plugins, offers substantial improvements both in subjective and objective evaluations.
Introduction
Electric guitar effects, particularly distortion, are prevalent across various musical genres and are crucial for defining the aesthetic qualities of music. However, these effects pose significant challenges for Music Information Retrieval (MIR) tasks such as automatic transcription, source separation, and automatic mixing. The distortion recovery from recorded tracks aims to mitigate these complexities, enabling more accurate and straightforward MIR processing. Previous research had attempted to address this task through methods akin to source separation or enhancement but primarily focused on synthetic distortions, which lack the depth and nuance found in real-world scenarios.
Methodology
The authors put forth a two-stage process for effective distortion recovery. Initially, a "Mel Denoiser" operates in the Mel-spectrogram domain to transform the distorted signal's Mel-spectrogram into its non-distorted counterpart. Following this, a neural vocoder synthesizes the waveform of the pristine guitar sound from the processed Mel-spectrogram. This combination aims to capture both high-level and fine-grained audio characteristics, thus preserving the integrity of the original signal.
Mel Denoiser
In the first stage, the distorted waveform is converted into a sequence of Mel-spectrogram frames. Using a Transformer-based architecture adapted from advancements in voice conversion and synthesis, the Mel Denoiser processes these frames to approximate the clean, dry signal, effectively mitigating the distortion's complexities.
Neural Vocoder
The second stage employs the HiFi-GAN neural vocoder for waveform reconstruction. HiFi-GAN is adept at generating high-fidelity audio by leveraging its multi-period and multi-scale discriminators to ensure the generated waveform captures both periodic and large-scale audio dynamics. This vocoder refines the outputs of the Mel Denoiser, bringing the processed signal closer to the original clean guitar sound.
Experimental Setup
The experiments were conducted on two datasets: one with VST-derived data from Positive Grid's BIAS FX2 ToneCloud presets and another using synthetic distortion effects applied via the Pedalboard library. Objective metrics such as Fréchet Audio Distance (FAD), Error-to-Signal Ratio (ESR), Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), and Multiresolution STFT (MR-STFT) provided quantitative measures of performance. Additionally, Mean Opinion Scores (MOS) from expert listeners assessed the subjective audio quality and the effectiveness of distortion removal.
Results
The proposed model significantly outperformed existing methods, including Demucs V3 and DCUnet, in both subjective and objective evaluations. Notably, the model achieved a lower FAD score and higher SI-SDR, indicating its superior capacity to recover the clean signal accurately. The subjective evaluations corroborated these findings, with the proposed model receiving higher MOS ratings for both audio quality and dryness level, reflecting its effectiveness in removing distortion and preserving the sound's natural characteristics.
Despite these advances, the paper also highlighted areas for future work, such as extending the approach to more challenging real-world settings like YouTube recordings and exploring the model's applicability to downstream MIR tasks.
Conclusion
In conclusion, this paper presents a robust and effective methodology for the recovery of distorted guitar recordings. The two-stage approach, leveraging Mel-spectrogram processing followed by neural vocoder-based reconstruction, marks a significant improvement over prior methods. The authors' detailed experimental validation and the superior performance of their model point to a promising direction for further exploration in both academic research and practical applications in audio processing and MIR systems. The implications of this work enhance both theoretical understanding and practical techniques in the domain, making it a valuable contribution to the field.