Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment (2404.06682v1)

Published 10 Apr 2024 in cs.SD and eess.AS

Abstract: To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

References (22)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ArxivSound/status/1778272590680797559

https://twitter.com/ballforest/status/1778265302905901442

https://twitter.com/AudioAndSpeech/status/1778340639567925495

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment (2404.06682v1)

Summary

Related Papers

Tweets