Mel-Band RoFormer for Music Source Separation (2310.01809v1)
Abstract: Recently, multi-band spectrogram-based approaches such as Band-Split RNN (BSRNN) have demonstrated promising results for music source separation. In our recent work, we introduce the BS-RoFormer model which inherits the idea of band-split scheme in BSRNN at the front-end, and then uses the hierarchical Transformer with Rotary Position Embedding (RoPE) to model the inner-band and inter-band sequences for multi-band mask estimation. This model has achieved state-of-the-art performance, but the band-split scheme is defined empirically, without analytic supports from the literature. In this paper, we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale. In contract, the band-split mapping in BSRNN and BS-RoFormer is non-overlapping and designed based on heuristics. Using the MUSDB18HQ dataset for experiments, we demonstrate that Mel-RoFormer outperforms BS-RoFormer in the separation tasks of vocals, drums, and other stems.
- Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An overview of lead and accompaniment separation in music,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 8, pp. 1307–1335, 2018.
- Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music demixing challenge 2021,” Frontiers in Signal Processing, 2022.
- A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in 13th International Conference on Latent Variable Analysis and Signal Separation, 2017, pp. 323–332.
- Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017, https://doi.org/10.5281/zenodo.1117372.
- P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep convolutional neural networks,” in Latent Variable Analysis and Signal Separation (LVA/ICA), 2017, pp. 258–266.
- Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” in ISMIR, 2021.
- A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” in ISMIR, 2017.
- Y. Luo and J. Yu, “Music Source Separation With Band-Split RNN,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1893–1901, 2023.
- W.-T. Lu, J.-C. Wang, Q. Kong, and Y.-N. Hung, “Music source separation with Band-Split RoPE Transformer,” arXiv preprint arXiv:2309.02612, 2023.
- G. Fabbro, S. Uhlich, C. Lai, W. Choi, M. Martinez-Ramirez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues et al., “The sound demixing challenge 2023–music demixing track,” arXiv preprint arXiv:2308.06979, 2023.
- S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The journal of the acoustical society of america, vol. 8, no. 3, pp. 185–190, 1937.
- A. Défossez, “Hybrid spectrogram and waveform source separation,” arXiv preprint arXiv:2111.03600, 2021.
- M. Kim and J. H. Lee, “Sound demixing challenge 2023–music demixing track technical report,” arXiv preprint arXiv:2306.09382, 2023.
- B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, 2015, pp. 18–25.
- M. Slaney, “Auditory toolbox,” Interval Research Corporation, Tech. Rep, vol. 10, no. 1998, p. 1194, 1998.
- E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
- F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in LVA/ICA, 2018, pp. 293–305.
- W.-T. Lu, J.-C. Wang, M. Won, K. Choi, and X. Song, “SpecTNT: A time-frequency transformer for music audio,” in ISMIR, 2021.
- W.-T. Lu, J.-C. Wang, and Y.-N. Hung, “Multitrack music transcription with a time-frequency perceiver,” in IEEE ICASSP, 2023.
- Y.-N. Hung, J.-C. Wang, X. Song, W.-T. Lu, and M. Won, “Modeling beats and downbeats with a time-frequency transformer,” in IEEE ICASSP, 2022, pp. 401–405.
- J.-C. Wang, Y.-N. Hung, and J. B. Smith, “To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions,” in IEEE ICASSP, 2022, pp. 416–420.