Why does music source separation benefit from cacophony? (2402.18407v1)
Abstract: In music source separation, a standard training data augmentation procedure is to create new training samples by randomly combining instrument stems from different songs. These random mixes have mismatched characteristics compared to real music, e.g., the different stems do not have consistent beat or tonality, resulting in a cacophony. In this work, we investigate why random mixing is effective when training a state-of-the-art music source separation model in spite of the apparent distribution shift it creates. Additionally, we examine why performance levels off despite potentially limitless combinations, and examine the sensitivity of music source separation performance to differences in beat and tonality of the instrumental sources in a mixture.
- F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-Unmix - A reference implementation for music source separation,” J. Open Source Softw., vol. 4, no. 41, p. 1667, 2019.
- R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” J. Open Source Softw., vol. 5, no. 50, p. 2154, 2020.
- N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation,” in Proc. IWAENC, 2018, pp. 106–110.
- A. Défossez, “Hybrid spectrogram and waveform source separation,” in Proc. MDX Workshop, 2021.
- Y. Luo and J. Yu, “Music source separation with band-split rnn,” IEEE Trans. Audio, Speech, Lang. Process., 2023.
- Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-HQ - an uncompressed version of MUSDB18,” Dec. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373
- E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. WASPAA, Oct. 2019.
- I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl, “Moisesdb: A dataset for source separation beyond 4-stems,” in Proc. ISMIR, 2023.
- K. Chen, G. Wichern, F. G. Germain, and J. Le Roux, “Pac-HuBERT: Self-supervised music source separation via primitive auditory clustering and hidden-unit BERT,” in Proc. ICASSP SASB, 2023.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. ICLR, 2018.
- S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proc. CVPR, 2019, pp. 6023–6032.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020.
- D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
- J. R. Hershey, Z. Chen, and J. Le Roux, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, Mar. 2016, pp. 31–35.
- I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux, and J. R. Hershey, “Universal sound separation,” in Proc. WASPAA, Oct. 2019.
- D. Petermann, G. Wichern, Z.-Q. Wang, and J. Le Roux, “The cocktail fork problem: Three-stem audio separation for real-world soundtracks,” in Proc. ICASSP, May 2021.
- S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proc. ICASSP, 2017, pp. 261–265.
- F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proc. LVA, 2018, pp. 293–305.
- Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music Demixing Challenge 2021,” Front. Signal Process., vol. 1, 2022.
- G. Fabbro, S. Uhlich, C.-H. Lai, W. Choi, M. Martínez-Ramírez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues et al., “The Sound Demixing Challenge 2023 – music demixing track,” arXiv:2308.06979, 2023.
- C.-B. Jeon, H. Moon, K. Choi, B. S. Chon, and K. Lee, “Medleyvox: An evaluation dataset for multiple singing voices separation,” in Proc. ICASSP, 2023, pp. 1–5.
- M. Kim and J. H. Lee, “Sound demixing challenge 2023–music demixing track technical report,” arXiv preprint arXiv:2306.09382, 2023.
- B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM J. Control Optim., vol. 30, no. 4, pp. 838–855, 1992.
- P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” in Proc. ICLR, 2018.
- C.-B. Jeon and K. Lee, “Towards robust music source separation on loud commercial music,” in Proc. ISMIR, 2022.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
- A. Cohen-Hadria, A. Roebel, and G. Peeters, “Improving singing voice separation using deep U-Net and Wave-U-Net with data augmentation,” in Proc. EUSIPCO, 2019, pp. 1–5.
- S. Sarkar, E. Benetos, and M. Sandler, “Vocal harmony separation using time-domain neural networks,” in Proc. Interspeech, 2021.
- Chang-Bin Jeon (7 papers)
- Gordon Wichern (51 papers)
- Jonathan Le Roux (82 papers)
- François G. Germain (12 papers)