Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation
The paper introduces Stem-JEPA, an innovative Joint-Embedding Predictive Architecture (JEPA) designed for estimating musical stem compatibility. The principal aim of the paper is to develop a self-supervised learning (SSL) model capable of determining the compatibility of individual musical stems within a broader contextual mix. This has practical applications in tasks such as stem retrieval, automatic arrangement, and stem generation.
Methodology
The model comprises two neural networks: an encoder and a predictor. These components are jointly trained to produce embeddings that predict the compatibility of stems within a mix. The encoder utilizes Vision Transformers (ViT) to embed the input data, and the predictor, a 6-layer MLP, is conditioned on the instrument label of the target stem. The training pipeline involves converting audio chunks into Log Mel Spectrograms, which are then used to generate embeddings for both context and target stems. The predictor utilizes these embeddings to generate a representation of the target stem from the context stem. The training paradigm employs Exponential Moving Average (EMA) to update the parameters of the target encoder, ensuring the stability and accuracy of the generated embeddings.
Evaluation
The paper evaluates Stem-JEPA using a retrieval task on the MUSDB18 dataset and through a subjective user paper. The retrieval task tests the model's ability to identify the correct missing stem from a mixture of other stems. Key metrics such as Recall at K (R@K) and Normalized Rank are used to measure the model’s performance. For subjective evaluation, a user paper is conducted to assess the perceptual compatibility of retrieved stems.
Retrival Task Results
- Recall @1: 33.0%
- Normalized Rank (Median): 0.5%
These results indicate that Stem-JEPA significantly outperforms traditional methods like AutoMashupper in retrieving compatible stems. Performance varies across different instrument categories, with the "other" category showing the highest retrieval accuracy.
User Study
The subjective evaluation demonstrates that the stems retrieved by Stem-JEPA are rated significantly higher in compatibility compared to random stems, though slightly lower than the ground truth stems. This underscores the model's efficacy in identifying musically compatible stems that are not necessarily part of the original composition.
Temporal Alignment Analysis
An analysis of temporal alignment is conducted by evaluating cosine similarity between embeddings and predictions at various temporal shifts. The results indicate that Stem-JEPA's embeddings capture local temporal features effectively, as evidenced by periodic peaks in cosine similarity corresponding to beats and bars.
Musical Plausibility
The musical plausibility of the learned embeddings is assessed using key and chord annotations from the Isophonics dataset. A co-occurrence matrix of keys and chords within the same clusters reveals that embeddings close in the latent space are musically relevant, often reflecting dominant, subdominant, or tonic relationships.
Downstream Task Performance
Stem-JEPA is evaluated on downstream tasks such as key detection, genre classification, tagging, and instrument classification using the MARBLE benchmark. The results show competitive performance, particularly in tagging and instrument classification, even with a relatively limited training dataset:
- Key Detection Accuracy: 40.2%
- Genre Classification Accuracy: 68.6%
- Tagging ROC-AUC: 89.9%
- Instrument Classification Accuracy: 73.5%
Implications and Future Work
The implications of Stem-JEPA are manifold. Practically, it can facilitate more efficient music production workflows by automating tasks related to stem compatibility and arrangement. Theoretically, it extends the application of JEPA systems to the audio domain, showcasing the potential of JEPAs beyond traditional vision tasks. Future research could explore scaling the model to accommodate a wider variety of instrument classes and leveraging advances in source separation technology to expand training datasets.
Conclusion
Stem-JEPA presents a significant advancement in the field of Music Information Retrieval (MIR) by addressing the stem compatibility problem via a self-supervised learning approach. Its robust performance in both retrieval and downstream tasks, combined with its ability to capture meaningful musical features, positions it as a valuable tool for both research and practical applications in music technology. The paper opens new avenues for further exploration of JEPA models in various domains, highlighting their versatility and potential for broader adoption.