Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation (2408.02514v1)

Published 5 Aug 2024 in cs.SD, cs.LG, and eess.AS

Abstract: This paper explores the automated process of determining stem compatibility by identifying audio recordings of single instruments that blend well with a given musical context. To tackle this challenge, we present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset using a self-supervised learning approach. Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems from the embeddings of a given context, typically a mix of several instruments. Training a model in this manner allows its use in estimating stem compatibility - retrieving, aligning, or generating a stem to match a given mix - or for downstream tasks such as genre or key estimation, as the training paradigm requires the model to learn information related to timbre, harmony, and rhythm. We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix and through a subjective user study. We also show that the learned embeddings capture temporal alignment information and, finally, evaluate the representations learned by our model on several downstream tasks, highlighting that they effectively capture meaningful musical features.

Authors (5)

Alain Riou (5 papers)
Stefan Lattner (33 papers)
Gaëtan Hadjeres (24 papers)
Michael Anslow (1 paper)
Geoffroy Peeters (29 papers)

Citations (1)

View on Semantic Scholar

Summary

Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

The paper introduces Stem-JEPA, an innovative Joint-Embedding Predictive Architecture (JEPA) designed for estimating musical stem compatibility. The principal aim of the paper is to develop a self-supervised learning (SSL) model capable of determining the compatibility of individual musical stems within a broader contextual mix. This has practical applications in tasks such as stem retrieval, automatic arrangement, and stem generation.

Methodology

The model comprises two neural networks: an encoder and a predictor. These components are jointly trained to produce embeddings that predict the compatibility of stems within a mix. The encoder utilizes Vision Transformers (ViT) to embed the input data, and the predictor, a 6-layer MLP, is conditioned on the instrument label of the target stem. The training pipeline involves converting audio chunks into Log Mel Spectrograms, which are then used to generate embeddings for both context and target stems. The predictor utilizes these embeddings to generate a representation of the target stem from the context stem. The training paradigm employs Exponential Moving Average (EMA) to update the parameters of the target encoder, ensuring the stability and accuracy of the generated embeddings.

Evaluation

The paper evaluates Stem-JEPA using a retrieval task on the MUSDB18 dataset and through a subjective user paper. The retrieval task tests the model's ability to identify the correct missing stem from a mixture of other stems. Key metrics such as Recall at K (R@K) and Normalized Rank are used to measure the model’s performance. For subjective evaluation, a user paper is conducted to assess the perceptual compatibility of retrieved stems.

Retrival Task Results

Recall @1: 33.0%
Normalized Rank (Median): 0.5%

These results indicate that Stem-JEPA significantly outperforms traditional methods like AutoMashupper in retrieving compatible stems. Performance varies across different instrument categories, with the "other" category showing the highest retrieval accuracy.

User Study

The subjective evaluation demonstrates that the stems retrieved by Stem-JEPA are rated significantly higher in compatibility compared to random stems, though slightly lower than the ground truth stems. This underscores the model's efficacy in identifying musically compatible stems that are not necessarily part of the original composition.

Temporal Alignment Analysis

An analysis of temporal alignment is conducted by evaluating cosine similarity between embeddings and predictions at various temporal shifts. The results indicate that Stem-JEPA's embeddings capture local temporal features effectively, as evidenced by periodic peaks in cosine similarity corresponding to beats and bars.

Musical Plausibility

The musical plausibility of the learned embeddings is assessed using key and chord annotations from the Isophonics dataset. A co-occurrence matrix of keys and chords within the same clusters reveals that embeddings close in the latent space are musically relevant, often reflecting dominant, subdominant, or tonic relationships.

Downstream Task Performance

Stem-JEPA is evaluated on downstream tasks such as key detection, genre classification, tagging, and instrument classification using the MARBLE benchmark. The results show competitive performance, particularly in tagging and instrument classification, even with a relatively limited training dataset:

Key Detection Accuracy: 40.2%
Genre Classification Accuracy: 68.6%
Tagging ROC-AUC: 89.9%
Instrument Classification Accuracy: 73.5%

Implications and Future Work

The implications of Stem-JEPA are manifold. Practically, it can facilitate more efficient music production workflows by automating tasks related to stem compatibility and arrangement. Theoretically, it extends the application of JEPA systems to the audio domain, showcasing the potential of JEPAs beyond traditional vision tasks. Future research could explore scaling the model to accommodate a wider variety of instrument classes and leveraging advances in source separation technology to expand training datasets.

Conclusion

Stem-JEPA presents a significant advancement in the field of Music Information Retrieval (MIR) by addressing the stem compatibility problem via a self-supervised learning approach. Its robust performance in both retrieval and downstream tasks, combined with its ability to capture meaningful musical features, positions it as a valuable tool for both research and practical applications in music technology. The paper opens new avenues for further exploration of JEPA models in various domains, highlighting their versatility and potential for broader adoption.

Related Papers

Find Related Papers

Tweets

https://twitter.com/howariou/status/1820831374691176819

https://twitter.com/deeplearnmusic/status/1821493669003735355

https://twitter.com/fly51fly/status/1822749985861145022

https://twitter.com/deeplearnmusic/status/1890007567285235781

https://twitter.com/gm8xx8/status/1820661733582578060

https://twitter.com/arxivsanitybot/status/1821177067402359233