Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 19 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 438 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference (2503.18486v1)

Published 24 Mar 2025 in cs.SD and eess.AS

Abstract: This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.

Summary

Evaluation of Music Similarity Representation Learning Methods

The paper presents a novel approach to Music Similarity Representation Learning (MSRL), focusing on individual instrument sounds using music source separation (MSS) and human preference. The methodology proposed is significant in that it does not rely on clean instrument data during inference, thus offering practical applicability in music recommendation and retrieval systems for extensive music databases.

Methodological Contributions

Three distinct methods were outlined to enhance instrument feature disentanglement and perceptual performance:

End-to-End Fine-Tuning (E2E-FT): Utilized within the Cascade approach, this method mitigates separation errors between MSS and music similarity feature extraction. Empirical results indicate substantial improvement in MSRL compared to independent training phases.
Multi-Task Learning with Reconstruction (Reconst): Implemented in the Direct approach, this method directly extracts disentangled music similarity features, leveraging multi-task learning to refine feature extraction and MSS. The results demonstrate improved disentanglement performance, suggesting the advantage of joint task optimization.
Perception-Aware Fine-Tuning (PAFT): This method, guided by human preference, enhances perceptual MSRL performance, aligning model outputs more closely with human perceptual similarity by utilizing a small dataset of human-labeled preferences.

Quantitative Results

The paper produced strong numerical evidence favoring the new methods over conventional approaches. Cascade models, refined with E2E-FT, significantly outperformed the Spleeter-enhanced Cascade model in terms of music similarity representation accuracy. Similarly, multi-task learning and reconstruction in the Direct approach yielded better disentangled feature extraction compared to single-task-trained models.

Most notably, PAFT demonstrated a meaningful link between the learned representations and human perception, bridging the gap left by conventional MSRL methods that struggle with similarity representation across differing musical pieces. However, the cascade approach especially benefits from the end-to-end training by diminishing errors stemming from MSS.

Implications and Future Directions

The implications of these findings are twofold. Practically, they offer a framework for robust music recommendation systems that understand and prioritize individual instrument preferences without needing clean audio input. Theoretically, they push us toward models that not only extract features but do so while considering human-like perception in the mix.

This research opens the door for further exploration into AI-driven recommendations where personalization is key. Next steps might involve adapting these methods to broader datasets using raw audio formats, including vocals, to test the scalability and adaptability across differing musical genres and formats. The potential of integrating human perceptual metrics into machine learning models presents a compelling avenue for enhancing user interaction in digital spaces governed by AI.