Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search (2401.08902v1)
Abstract: Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo.
- “Deep content-based music recommendation,” in Advances in Neural Information Processing Systems, 2013, vol. 26, pp. 2643–2651.
- “Supervised and unsupervised learning of audio representations for music understanding,” in Proc. of the 23rd Int. Society for Music Information Retrieval Conf., 2022, pp. 256–263.
- “Transfer learning in mir: Sharing learned latent representations for music audio classification and similarity,” in Proc. of the 14th Int. Society for Music Information Retrieval Conf., 2013, pp. 9–14.
- “MERT: Acoustic music understanding model with large-scale self-supervised training,” arXiv preprint arXiv:2306.00107, 2023.
- J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,” in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., J. H. Lee, A. Lerch, Z. Duan, J. Nam, P. Rao, P. van Kranenburg, and A. Srinivasamurthy, Eds., 2021, pp. 673–681.
- “How significant is statistically significant? the case of audio music similarity and retrieval,” in Proc. of the 13th Int. Society for Music Information Retrieval Conf., 2012, pp. 181–186.
- “The neglected user in music information retrieval research,” J. Intell. Inf. Syst., vol. 41, no. 3, pp. 523–539, 2013.
- “Metric learning vs classification for disentangled music representation learning,” in Proc. of the 21st Int. Conf. on Music Information Retrieval, 2020, pp. 439–445.
- “Discover: Disentangled music representation learning for cover song identification,” in ACM SIGIR, 2023, pp. 453–463.
- “Disentangled multidimensional metric learning for music similarity,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2020, pp. 6–10.
- “On the effect of data-augmentation on local embedding properties in the contrastive learning of music audio representations,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2024.
- “A simple framework for contrastive learning of visual representations,” in Proc. of the 37th Int. Conf. on Machine Learning, 2020, pp. 1597–1607.
- “The HTK book,” Cambridge University Engineering Department, vol. 3, no. 175, pp. 12, 2002.
- H. Schreiber and M. Müller, “A single-step approach to musical tempo estimation using a convolutional neural network,” in Proc. of the 19th Int. Society for Music Information Retrieval Conf., 2018, pp. 100–105.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd Int. Conf. on Learning Representations, 2015.
- I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in Proc. of the 5th Int. Conf. on Learning Representations, 2017.
- “An experimental comparison of audio tempo induction algorithms,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1832–1844, 2006.
- G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
- U. Marchand and G. Peeters, “Swing ratio estimation,” in Proc. of the 18th Int. Conf. on Digital Audio Effects, 2015, pp. 423–428.
- “Evaluation of algorithms using games: The case of music tagging.,” in Proc. of the 10th Int. Society for Music Information Retrieval Conf., 2009, pp. 387–392.
- “Metric learning vs classification for disentangled music representation learning,” in Proc. of the 21st Int. Society for Music Information Retrieval Conf., 2020, pp. 439–445.
- “The million song dataset,” in Proc. of the 12th Int. Conf. on Music Information Retrieval, 2011, pp. 591–596.
- E. Quinton, “Equivariant self-supervision for musical tempo estimation,” in Proc. of the 23rd Int. Society for Music Information Retrieval Conf., 2022, pp. 84–92.
- S. Böck and M. E. P. Davies, “Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation,” in Proc. of the 21st Int. Society for Music Information Retrieval Conf., 2020, pp. 574–582.
- “The Harmonix Set: Beats, downbeats, and functional segment annotations of western popular music,” in Proc. of the 20th Int. Society for Music Information Retrieval Conf., 2019, pp. 565–572.
- H. Schreiber and M. Müller, “A crowdsourced experiment for tempo estimation of electronic dance music.,” in Proc. of the 19th Int. Society for Music Information Retrieval Conf., 2018, pp. 409–415.
- “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” in Proc. of the 16th Int. Society for Music Information Retrieval Conf., 2015, pp. 364–370.
- G. Peeters and J. Flocon-Cholet, “Perceptual tempo estimation using GMM-regression,” in Proc. of the 2nd Int. ACM workshop on Music information retrieval with user-centered and multimodal strategies, 2012, pp. 45–50.
- G. Percival and G. Tzanetakis, “Streamlined tempo estimation based on autocorrelation and cross-correlation with pulses,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1765–1776, 2014.