Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why does music source separation benefit from cacophony? (2402.18407v1)

Published 28 Feb 2024 in eess.AS

Abstract: In music source separation, a standard training data augmentation procedure is to create new training samples by randomly combining instrument stems from different songs. These random mixes have mismatched characteristics compared to real music, e.g., the different stems do not have consistent beat or tonality, resulting in a cacophony. In this work, we investigate why random mixing is effective when training a state-of-the-art music source separation model in spite of the apparent distribution shift it creates. Additionally, we examine why performance levels off despite potentially limitless combinations, and examine the sensitivity of music source separation performance to differences in beat and tonality of the instrumental sources in a mixture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-Unmix - A reference implementation for music source separation,” J. Open Source Softw., vol. 4, no. 41, p. 1667, 2019.
  2. R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” J. Open Source Softw., vol. 5, no. 50, p. 2154, 2020.
  3. N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation,” in Proc. IWAENC, 2018, pp. 106–110.
  4. A. Défossez, “Hybrid spectrogram and waveform source separation,” in Proc. MDX Workshop, 2021.
  5. Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE Trans. Audio, Speech, Lang. Process., 2023.
  6. Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-HQ - an uncompressed version of MUSDB18,” Dec. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373
  7. E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. WASPAA, Oct. 2019.
  8. I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl, “Moisesdb: A dataset for source separation beyond 4-stems,” in Proc. ISMIR, 2023.
  9. K. Chen, G. Wichern, F. G. Germain, and J. Le Roux, “Pac-HuBERT: Self-supervised music source separation via primitive auditory clustering and hidden-unit BERT,” in Proc. ICASSP SASB, 2023.
  10. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. ICLR, 2018.
  11. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proc. CVPR, 2019, pp. 6023–6032.
  12. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020.
  13. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
  14. J. R. Hershey, Z. Chen, and J. Le Roux, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, Mar. 2016, pp. 31–35.
  15. I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey, “Universal sound separation,” in Proc. WASPAA, Oct. 2019.
  16. D. Petermann, G. Wichern, Z.-Q. Wang, and J. Le Roux, “The cocktail fork problem: Three-stem audio separation for real-world soundtracks,” in Proc. ICASSP, May 2021.
  17. S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proc. ICASSP, 2017, pp. 261–265.
  18. F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proc. LVA, 2018, pp. 293–305.
  19. Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music Demixing Challenge 2021,” Front. Signal Process., vol. 1, 2022.
  20. G. Fabbro, S. Uhlich, C.-H. Lai, W. Choi, M. Martínez-Ramírez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues et al., “The Sound Demixing Challenge 2023 – music demixing track,” arXiv:2308.06979, 2023.
  21. C.-B. Jeon, H. Moon, K. Choi, B. S. Chon, and K. Lee, “Medleyvox: An evaluation dataset for multiple singing voices separation,” in Proc. ICASSP, 2023, pp. 1–5.
  22. M. Kim and J. H. Lee, “Sound demixing challenge 2023–music demixing track technical report,” arXiv preprint arXiv:2306.09382, 2023.
  23. B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp. 838–855, 1992.
  24. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” in Proc. ICLR, 2018.
  25. C.-B. Jeon and K. Lee, “Towards robust music source separation on loud commercial music,” in Proc. ISMIR, 2022.
  26. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
  27. A. Cohen-Hadria, A. Roebel, and G. Peeters, “Improving singing voice separation using deep U-Net and Wave-U-Net with data augmentation,” in Proc. EUSIPCO, 2019, pp. 1–5.
  28. S. Sarkar, E. Benetos, and M. Sandler, “Vocal harmony separation using time-domain neural networks,” in Proc. Interspeech, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chang-Bin Jeon (7 papers)
  2. Gordon Wichern (51 papers)
  3. Jonathan Le Roux (82 papers)
  4. François G. Germain (12 papers)
Citations (2)

Summary

  • The paper demonstrates that random mixing, despite its cacophonous nature, significantly enhances MSS performance by broadening training data.
  • It employs the TFC-TDF-UNet v3 architecture to contrast models trained on original versus randomly mixed audio, revealing key performance differences.
  • The study finds that disruptions in beat and tonality contribute to superior separation quality, prompting fresh perspectives on data augmentation in MSS.

Exploring the Efficacy of Cacophony in Music Source Separation Training

Introduction to the Study

The landscape of Music Source Separation (MSS) has been significantly transformed by developments in deep learning, with novel models achieving remarkable performance enhancements. Behind these advancements, however, lies an often-overlooked element: the role of data augmentation techniques, specifically random mixing, in improving model training. This paper explores why random mixing, which introduces a degree of cacophony and thus a shift away from realistic music distributions, continues to be an effective strategy for training MSS models. The authors aim to dissect the influence of random mixing on model performance, investigate the implications of limitless data combinations, and assess the impact of beat and tonality consistency on MSS outcomes.

Data Augmentation in MSS

Random mixing, a technique that generates new training samples by arbitrarily combining audio stems from different songs, introduces a notable discord —the resulting mixtures typically lack cohesive beat or tonality, presenting as cacophonous to the human ear. Despite this, the practice has gained traction within the MSS research community for its perplexing ability to enhance model performance. By evaluating the state-of-the-art TFC-TDF-UNet v3 architecture within the 4-stem MSS framework, the paper highlights the conventional method’s efficacy and questions the underlying mechanisms of its success.

Experimental Insights

The researchers meticulously structure their experiments to dissect the impact of random mixing, comparing the training dynamics of models subjected to varying ratios of original versus randomly mixed data. Surprisingly, they observe that models trained exclusively on random mixes outperform those trained solely on original data by significant margins, with minimal performance difference when introducing a small percentage of original mixes into the training data. These findings challenge the intuitive expectation that closer adherence to realistic music distributions would yield better MSS performance.

Impact of Beat and Tonality Consistency

An intriguing aspect of the paper is its examination of how deviations in beat and tonality affect MSS performance. Models trained on mixtures with inconsistent beats or tonalities demonstrate improved separation capabilities, suggesting that such disparities contribute positively to the learning process. This is further corroborated by experiments showing that intentional timing and pitch modifications during training bolster model effectiveness, underscoring the importance of introducing variability in these domains.

Conclusions and Potential Directions

The research establishes that the benefit of random mixing in MSS is twofold: it not only provides an expanded range of training data but also introduces beneficial inconsistencies in beat and tonality. This challenges the traditional view of data augmentation’s role and opens up avenues for exploring more nuanced applications of cacophony in AI-driven music processing tasks. Looking ahead, the authors propose extending their findings to larger datasets and investigating structured learning approaches that leverage both random and original mixes, signaling a continued evolution in the methodology of MSS research.

In summary, this work sheds light on the unexpectedly positive impact of cacophony in MSS training, offering a fresh perspective on the interplay between data augmentation and model performance. By breaking down the intricacies of this phenomenon, the paper paves the way for refining training strategies and enhancing the future development of MSS technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com