Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation (1806.03185v1)

Published 8 Jun 2018 in cs.SD, eess.AS, and stat.ML

Abstract: Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

Citations (563)

View on Semantic Scholar

Summary

The paper introduces Wave-U-Net, a novel model that directly processes audio waveforms to overcome limitations of spectrogram-based approaches.
It leverages a multi-scale U-Net architecture with enhanced upsampling and difference output layers to effectively capture both magnitude and phase information.
Empirical results show significant improvements in median SDR scores for singing voice and multi-instrument separation tasks on datasets like MUSDB and CCMixter.

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

This research paper introduces the Wave-U-Net, a neural network model for audio source separation that operates directly in the time domain. This approach addresses several limitations inherent in frequency-domain models, such as the reliance on magnitude spectrograms and disregard for phase information. The paper provides a robust evaluation of the proposed architecture, demonstrating its potential in singing voice and multi-instrument separation tasks.

Overview of Current Challenges

The prevalent models for audio source separation mainly rely on spectrogram representations, which separate audio into magnitude and phase components. Typically, the magnitudes serve as input to parametric models, while the phase information is ignored or assumed to be consistent with the original mixture. However, this assumption is flawed, especially in the context of overlapping partials, potentially degrading separation quality. Fixed parameters in spectral transformations further constrain these models, necessitating an approach that can flexibly learn both magnitude and phase information.

The Wave-U-Net Architecture

Wave-U-Net is a one-dimensional adaptation of the U-Net architecture, specialized for time-domain audio processing. It facilitates the modeling of phase information by operating directly on audio waveforms rather than spectra. The architecture incorporates multi-scale analysis, utilizing downsampling (DS) and upsampling (US) blocks to process temporal features at varying resolutions. Key architectural enhancements include:

Difference Output Layer: Incorporates additivity constraints of source signals directly into the model, facilitating effective learning by constraining outputs through the mixture signal.
Temporal Context and Resampling: Correctly addresses context by discarding zero-padding, instead processing larger input segments to prevent artifacts commonly seen at segment borders.
Improved Upsampling Techniques: Linear interpolation and learned weights are used for upsampling to avoid aliasing artifacts, a common issue with transposed convolutions using zero-padding.

Experiments and Results

The paper presents an empirical evaluation of the Wave-U-Net on both singing voice and multi-instrument separation tasks using datasets like MUSDB and CCMixter. Key findings include:

Performance Metrics: The Wave-U-Net demonstrated substantial improvements, notably in median SDR scores, compared to spectrogram-based U-Nets under similar conditions.
Impact of Architectural Enhancements: The stereo modeling and additional predictive context significantly improved performance across various tasks.
Issues with Evaluation Metrics: The authors highlight inadequacies with the SDR metric, especially with silent or near-silent segments, and propose median-based statistics as a remedy.

Implications and Future Directions

The paper indicates that the Wave-U-Net successfully overcomes many obstacles presented by spectrogram-based approaches. By leveraging phase information and directly processing the raw audio, it offers improved separation quality. Additionally, the model's ability to generalize across tasks without extensive preprocessing underscores its robustness.

Future research could focus on exploring alternative loss functions beyond MSE, potentially employing adversarial training techniques to further enhance perceptual quality. Investigating larger datasets could better elucidate the strengths and weaknesses of this time-domain approach relative to traditional methods.

In conclusion, the Wave-U-Net represents a significant step forward in the field of audio source separation, with potential applications stretching across various auditory tasks in both research and industry contexts.

PDF Markdown