Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mel-Band RoFormer for Music Source Separation (2310.01809v1)

Published 3 Oct 2023 in cs.SD and eess.AS

Abstract: Recently, multi-band spectrogram-based approaches such as Band-Split RNN (BSRNN) have demonstrated promising results for music source separation. In our recent work, we introduce the BS-RoFormer model which inherits the idea of band-split scheme in BSRNN at the front-end, and then uses the hierarchical Transformer with Rotary Position Embedding (RoPE) to model the inner-band and inter-band sequences for multi-band mask estimation. This model has achieved state-of-the-art performance, but the band-split scheme is defined empirically, without analytic supports from the literature. In this paper, we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale. In contract, the band-split mapping in BSRNN and BS-RoFormer is non-overlapping and designed based on heuristics. Using the MUSDB18HQ dataset for experiments, we demonstrate that Mel-RoFormer outperforms BS-RoFormer in the separation tasks of vocals, drums, and other stems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An overview of lead and accompaniment separation in music,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 26, no. 8, pp. 1307–1335, 2018.
  2. Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music demixing challenge 2021,” Frontiers in Signal Processing, 2022.
  3. A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in 13th International Conference on Latent Variable Analysis and Signal Separation, 2017, pp. 323–332.
  4. Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017, https://doi.org/10.5281/zenodo.1117372.
  5. P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep convolutional neural networks,” in Latent Variable Analysis and Signal Separation (LVA/ICA), 2017, pp. 258–266.
  6. Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” in ISMIR, 2021.
  7. A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” in ISMIR, 2017.
  8. Y. Luo and J. Yu, “Music Source Separation With Band-Split RNN,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1893–1901, 2023.
  9. W.-T. Lu, J.-C. Wang, Q. Kong, and Y.-N. Hung, “Music source separation with Band-Split RoPE Transformer,” arXiv preprint arXiv:2309.02612, 2023.
  10. G. Fabbro, S. Uhlich, C. Lai, W. Choi, M. Martinez-Ramirez, W. Liao, I. Gadelha, G. Ramos, E. Hsu, H. Rodrigues et al., “The sound demixing challenge 2023–music demixing track,” arXiv preprint arXiv:2308.06979, 2023.
  11. S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The journal of the acoustical society of america, vol. 8, no. 3, pp. 185–190, 1937.
  12. A. Défossez, “Hybrid spectrogram and waveform source separation,” arXiv preprint arXiv:2111.03600, 2021.
  13. M. Kim and J. H. Lee, “Sound demixing challenge 2023–music demixing track technical report,” arXiv preprint arXiv:2306.09382, 2023.
  14. B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8, 2015, pp. 18–25.
  15. M. Slaney, “Auditory toolbox,” Interval Research Corporation, Tech. Rep, vol. 10, no. 1998, p. 1194, 1998.
  16. E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
  17. F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in LVA/ICA, 2018, pp. 293–305.
  18. W.-T. Lu, J.-C. Wang, M. Won, K. Choi, and X. Song, “SpecTNT: A time-frequency transformer for music audio,” in ISMIR, 2021.
  19. W.-T. Lu, J.-C. Wang, and Y.-N. Hung, “Multitrack music transcription with a time-frequency perceiver,” in IEEE ICASSP, 2023.
  20. Y.-N. Hung, J.-C. Wang, X. Song, W.-T. Lu, and M. Won, “Modeling beats and downbeats with a time-frequency transformer,” in IEEE ICASSP, 2022, pp. 401–405.
  21. J.-C. Wang, Y.-N. Hung, and J. B. Smith, “To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions,” in IEEE ICASSP, 2022, pp. 416–420.
Citations (3)

Summary

We haven't generated a summary for this paper yet.