Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment (2404.06682v1)

Published 10 Apr 2024 in cs.SD and eess.AS

Abstract: To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. IFPI, “Global music report 2022,” 2022, https://www.ifpi.org/wp-content/uploads/2022/04/IFPI_Global_Music_Report_2022-State_of_the_Industry.pdf.
  2. Apple Inc, “Apple music,” 2023, https://www.apple.com/jp/apple-music/.
  3. P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” in International Society for Music Information Retrieval Conference, 2010, pp. 339–344.
  4. A. Elbir and N. Aydin, “Music genre classification and music recommendation by using deep learning,” Electronics Letters, vol. 56, no. 12, pp. 627–629, 2020.
  5. J. Park, J. Lee, J. Park, J. Ha, and J. Nam, “Representation learning of music using artist labels,” in International Society for Music Information Retrieval Conference, 2018, pp. 717–724.
  6. J. Cleveland, D. Cheng, M. Zhou, T. Joachims, and D. Turnbull, “Content-based music similarity with triplet networks,” 2020. [Online]. Available: [https://arxiv.org/abs/2008.04938]
  7. R. Lu, K. Wu, Z. Duan, and C. Zhang, “Deep ranking: Triplet matchnet for music metric learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 121–125.
  8. Y. Hashizume, L. Li, and T. Toda, “Music similarity calculation of individual instrumental sounds using metric learning,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2022, pp. 33–38.
  9. Y. Hashizume, L. Li, and T. Toda, “Evaluation of music similarity learning focusing on each instrumental sound,” in Proc. of Autumn Meeting of ASJ (in Japanese) 3-1-5, 2022, pp. 1517–1518.
  10. Y. Bengio, “Deep learning of representations: Looking forward.” in International Conference on Statistical Language and Speech Processing, 2013, pp. 1–37.
  11. W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” 2018. [Online]. Available: [https://arxiv.org/abs/1810.07217]
  12. W.-N. Hsu, Y. Zhang, R. J. Weiss, Y.-A. Chung, Y. Wang, Y. Wu, and J. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 5901–5905.
  13. Y.-N. Hung, Y. Chen, and Y.-H. Yang, “Learning disentangled representations for timber and pitch in music audio,” 2018. [Online]. Available: [https://arxiv.org/abs/1811.03271]
  14. K. A. Y-J. Luo and D. Herremans, “Learning disentangled representations of timbre and pitch for musical instrument sounds using gaussian mixture variational autoencoders,” in International Society for Music Information Retrieval Conference, 2019, pp. 746–753.
  15. K. Tanaka, R. Nishikimi, Y. Bando, K. Yoshii, and S. Morishima, “Pitch-timbre disentanglement of musical instrument sounds based on vae-based metric learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 111–115.
  16. A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1781–1789.
  17. J. Lee, N. J. Bryan, J. Salamon, Z. Jin, and J. Nam, “Disentangled multidimensional metric learning for music similarity,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 6–10.
  18. E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition, 2015, pp. 84–92.
  19. E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019, pp. 45–49.
  20. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
  21. A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep u-net convolutional networks,” in International Society for Music Information Retrieval Conference, 2017.
  22. L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.

Summary

We haven't generated a summary for this paper yet.