Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures (2411.19806v2)

Published 29 Nov 2024 in cs.SD, cs.AI, and eess.AS

Abstract: In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.

Summary

The paper proposes a novel JEPA-based method for zero-shot musical stem retrieval using contrastive pretraining and FiLM-inspired conditioning.
Empirical validation on MUSDB18 and MoisesDB shows significantly improved recall rates and normalized ranks over baseline methods.
The approach’s embeddings retain temporal information, suggesting broader applications in music information retrieval, including beat tracking.

Analysis of "Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures"

The paper "Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures" presents a compelling advancement in the domain of music information retrieval, specifically addressing the task of musical stem retrieval. The authors propose an innovative method leveraging Joint-Embedding Predictive Architectures (JEPA) for identifying compatible musical stems that can be combined harmoniously with a given musical mix. This approach accommodates flexibility by adopting a zero-shot retrieval scheme, which allows the system to process music stems without the need for specific prior training on those particular instrument classes.

Methodological Advancements

The authors detail a novel methodology rooted in the combination of JEPA and contrastive learning. Their approach consists of two phases. The first phase involves the pretraining of the encoder using a contrastive learning mechanism, aiming to establish a latent space where compatible musical samples are in proximity. The second phase incorporates a joint-embedding predictive architecture, conditioning the predictor with inputs derived from textual descriptions of instrument classes. This setup is enabled by using CLAP embeddings, allowing the model to predict the target stem representation effectively even under zero-shot scenarios. Additionally, the application of FiLM-inspired conditioning techniques was demonstrated to enhance model performance compared to baseline conditioning methods used in prior works.

Empirical Validation

The validation of this approach is executed on two datasets, MUSDB18 and MoisesDB. The authors highlight that their model surpasses existing baselines across various performance metrics. They emphasize particularly robust results in zero-shot situations, where stems corresponding to instrument classes absent from the training set are retrieved effectively. The paper reports substantial improvements in recall rates and normalized ranks when using their method, indicating a significant leap from traditional techniques. The integration of contrastive pretraining showcased a substantial improvement in model performance, further refined by the FiLM conditioning integration, which proved critical in the nuanced differentiation of complex musical structures.

Contributions and Implications

Beyond stem retrieval, the paper shows that embeddings learned by the proposed model retain a significant degree of temporal information. This insight was validated through beat tracking tests, where results demonstrated competitive accuracy, suggesting potential applicability across various tasks within music information retrieval (MIR). Such capabilities position these embeddings as versatile tools for more granular musical analysis tasks beyond stem retrieval.

In conclusion, the presented work exemplifies a methodological leap in tackling the challenges inherent in zero-shot musical stem retrieval. Although the paper primarily focuses on musical applications, the outlined architecture and the integration of multimodal conditioning techniques potentially offer broad implications across various domains involving similar challenges. Future research could explore extending this methodology to other fields requiring sophisticated handling of complex, high-dimensional multimodal data. The promising results on beat tracking also invite inquiries into further MIR applications, indicating a pathway for expansive use of the derived embeddings in temporal musical analysis.

PDF Markdown

Related Papers

Tweets

https://twitter.com/deeplearnmusic/status/1884595131661086741

https://twitter.com/ArxivSound/status/1863448071436218427

https://twitter.com/howariou/status/1898832576644608405