Distortion Recovery: A Two-Stage Method for Guitar Effect Removal (2407.16639v1)

Published 23 Jul 2024 in cs.SD and eess.AS

Abstract: Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.

Summary

The paper introduces a novel two-stage method that combines Mel-spectrogram transformation with a neural vocoder to effectively restore clean guitar audio.
The method significantly outperforms prior approaches by achieving superior objective scores (e.g., lower FAD and higher SI-SDR) and better subjective ratings.
This advancement enhances audio processing for Music Information Retrieval tasks and paves the way for improved handling of real-world distorted recordings.

An Expert Review of "Distortion Recovery"

This paper introduces a novel two-stage methodology for the recovery of audio signals that have been subjected to distortion effects, specifically targeting electric guitar recordings. The authors combine the use of Mel-spectrogram transformation with neural vocoder technology to produce cleaner and more authentic post-distortion sounds, providing a significant enhancement over previous methods. Their approach, tested with commercial-grade VST plugins, offers substantial improvements both in subjective and objective evaluations.

Introduction

Electric guitar effects, particularly distortion, are prevalent across various musical genres and are crucial for defining the aesthetic qualities of music. However, these effects pose significant challenges for Music Information Retrieval (MIR) tasks such as automatic transcription, source separation, and automatic mixing. The distortion recovery from recorded tracks aims to mitigate these complexities, enabling more accurate and straightforward MIR processing. Previous research had attempted to address this task through methods akin to source separation or enhancement but primarily focused on synthetic distortions, which lack the depth and nuance found in real-world scenarios.

Methodology

The authors put forth a two-stage process for effective distortion recovery. Initially, a "Mel Denoiser" operates in the Mel-spectrogram domain to transform the distorted signal's Mel-spectrogram into its non-distorted counterpart. Following this, a neural vocoder synthesizes the waveform of the pristine guitar sound from the processed Mel-spectrogram. This combination aims to capture both high-level and fine-grained audio characteristics, thus preserving the integrity of the original signal.

Mel Denoiser

In the first stage, the distorted waveform is converted into a sequence of Mel-spectrogram frames. Using a Transformer-based architecture adapted from advancements in voice conversion and synthesis, the Mel Denoiser processes these frames to approximate the clean, dry signal, effectively mitigating the distortion's complexities.

Neural Vocoder

The second stage employs the HiFi-GAN neural vocoder for waveform reconstruction. HiFi-GAN is adept at generating high-fidelity audio by leveraging its multi-period and multi-scale discriminators to ensure the generated waveform captures both periodic and large-scale audio dynamics. This vocoder refines the outputs of the Mel Denoiser, bringing the processed signal closer to the original clean guitar sound.

Experimental Setup

The experiments were conducted on two datasets: one with VST-derived data from Positive Grid's BIAS FX2 ToneCloud presets and another using synthetic distortion effects applied via the Pedalboard library. Objective metrics such as Fréchet Audio Distance (FAD), Error-to-Signal Ratio (ESR), Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), and Multiresolution STFT (MR-STFT) provided quantitative measures of performance. Additionally, Mean Opinion Scores (MOS) from expert listeners assessed the subjective audio quality and the effectiveness of distortion removal.

Results

The proposed model significantly outperformed existing methods, including Demucs V3 and DCUnet, in both subjective and objective evaluations. Notably, the model achieved a lower FAD score and higher SI-SDR, indicating its superior capacity to recover the clean signal accurately. The subjective evaluations corroborated these findings, with the proposed model receiving higher MOS ratings for both audio quality and dryness level, reflecting its effectiveness in removing distortion and preserving the sound's natural characteristics.

Despite these advances, the paper also highlighted areas for future work, such as extending the approach to more challenging real-world settings like YouTube recordings and exploring the model's applicability to downstream MIR tasks.

Conclusion

In conclusion, this paper presents a robust and effective methodology for the recovery of distorted guitar recordings. The two-stage approach, leveraging Mel-spectrogram processing followed by neural vocoder-based reconstruction, marks a significant improvement over prior methods. The authors' detailed experimental validation and the superior performance of their model point to a promising direction for further exploration in both academic research and practical applications in audio processing and MIR systems. The implications of this work enhance both theoretical understanding and practical techniques in the domain, making it a valuable contribution to the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/YuehP92182/status/1816044996090544613