Hybrid Transformers for Music Source Separation (2211.08553v1)

Published 15 Nov 2022 in eess.AS and cs.SD

Abstract: A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB, we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.

Citations (110)

View on Semantic Scholar

Summary

The paper introduces Hybrid Transformer Demucs, employing a cross-domain Transformer Encoder to merge time and spectral data for improved music source separation.
Performance evaluations on MUSDB reveal that with 800 extra songs, HT Demucs outperforms the baseline by enhancing SDR by 0.45 dB.
State-of-the-art results demonstrate that using sparse attention and per-source fine-tuning effectively leverages long-range context in audio processing.

Hybrid Transformers for Music Source Separation

The paper "Hybrid Transformers for Music Source Separation" embarks on investigating the applicability of Transformers in the domain of Music Source Separation (MSS). The authors address a critical inquiry—whether long-range contextual information, in addition to local acoustic features, can enhance the performance of MSS tasks. Transformers, known for their ability to capture and integrate information over extensive sequences, have proven successful in various fields such as vision and natural language processing. This work introduces Hybrid Transformer Demucs (HT Demucs), leveraging the Transformer’s attention mechanism to potentially advance MSS outcomes.

Key Contributions and Findings

Hybrid Transformer Demucs (HT Demucs): This novel architecture is a hybrid temporal/spectral bi-U-Net based on the existing Hybrid Demucs model. The authors integrate Transformer layers into the architecture, specifically targeting the innermost layers, which are replaced by a cross-domain Transformer Encoder. This setup utilizes self-attention within individual domains and cross-attention across different domains, efficiently merging time and spectral data representations.
Performance Evaluation on MUSDB: A substantial portion of the paper is dedicated to evaluating HT Demucs' performance against existing models. Notably, when solely trained on the MUSDB dataset, HT Demucs underperforms. However, when supplemented with an additional 800 songs for training, HT Demucs surpasses the Hybrid Demucs model by achieving an enhancement of 0.45 dB in Signal-to-Distortion Ratio (SDR).
State-of-the-Art Results: By employing sparse attention mechanisms to extend the receptive fields and implementing per-source fine-tuning, the HT Demucs achieves state-of-the-art performance metrics on the MUSDB dataset, registering an SDR of 9.20 dB with extra training data.
Practical and Theoretical Implications: The paper provides significant insights into the information required for effective source separation. The superior performance of HT Demucs when supplied with vast data suggests that leveraging long-range context can ameliorate source separation tasks. This insight recommends a broader application of hybrid models in MSS, potentially revolutionizing practices in audio processing and analysis.

Experimental Details

The experimental setup in this research is meticulously structured to explore varying architectural parameters, such as Transformer depth, dimensions, and segment duration, thus revealing their respective impacts on model efficacy and memory. The inclusion of experiments concerning data augmentations highlights how these techniques continue to substantially benefit models despite the availability of large datasets.

Future Directions

This research opens avenues for further exploration of Transformer-based models in audio processing. Future investigations could examine the potential of Transformer models to accommodate different audio representations, such as sub-band splitting in spectrogram analysis, to further improve precision and accuracy. Moreover, the employment of unsupervised or semi-supervised training methods combined with large-scale datasets could further enhance model generalization, pushing the boundaries of current MSS capabilities.

In summary, the integration of Transformer technology into music source separation showcases a promising advancement. This research effectively contributes to both the practical aspects of audio engineering and the theoretical understanding of the interplay between local and global information in complex auditory settings.

PDF Markdown

Related Papers

YouTube

Show All Videos