Tempo estimation as fully self-supervised binary classification (2401.08891v1)
Abstract: This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.
- H. Schreiber and M. Müller, “A single-step approach to musical tempo estimation using a convolutional neural network,” in Proc. of the 19th Int. Society for Music Information Retrieval Conf., 2018, pp. 100–105.
- S. Böck and M. E. P. Davies, “Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation,” in Proc. of the 21st Int. Society for Music Information Retrieval Conf., 2020, pp. 574–582.
- E. Quinton, “Equivariant self-supervision for musical tempo estimation,” in Proc. of the 23rd Int. Society for Music Information Retrieval Conf., 2022, pp. 84–92.
- “Tempo vs. pitch: understanding self-supervised tempo estimation,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- “Music Tempo Estimation: Are we done yet?,” Trans. of the Int. Society for Music Information Retrieval, vol. 3, no. 1, 2020.
- “An experimental comparison of audio tempo induction algorithms,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1832–1844, 2006.
- “Evaluation of audio beat tracking and music tempo extraction algorithms,” Journal of New Music Research, vol. 36, no. 1, pp. 1–16, 2007.
- “SPICE: Self-supervised pitch estimation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 28, pp. 1118–1128, 2020.
- P. Grosche and M. Müller, “Extracting predominant local pulse information from music recordings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1688–1701, 2011.
- “Supervised and unsupervised learning of audio representations for music understanding,” in Proc. of the 23rd Int. Society for Music Information Retrieval Conf., 2022, pp. 256–263.
- “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,” arXiv preprint arXiv:2306.00107, 2023.
- “On the effect of data-augmentation on local embedding properties in the contrastive learning of music audio representations,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2024.
- “Similar but faster: manipulation of tempo in music audio embeddings for tempo prediction and search,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2024.
- “A simple framework for contrastive learning of visual representations,” in Proc. of the 37th Int. Conf. on Machine Learning, 2020, pp. 1597–1607.
- V. Nair and G. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. of the 27th Int. Conf. on Machine Learning, 2010, pp. 807–814.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the 3rd Int. Conf. on Learning Representations, 2015.
- I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in Proc. of the 5th Int. Conf. on Learning Representations, 2017.
- “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2017, pp. 776–780.
- “The million song dataset,” in Proc. of the 12th Int. Conf. on Music Information Retrieval, 2011, pp. 591–596.
- G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. on Speech and Audio processing, vol. 10, no. 5, pp. 293–302, 2002.
- U. Marchand and G. Peeters, “Swing ratio estimation,” in Proc. of the 18th Int. Conf. on Digital Audio Effects, 2015, pp. 423–428.
- “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” in Proc. of the 16th Int. Society for Music Information Retrieval Conf., 2015, pp. 364–370.
- H. Schreiber and M. Müller, “A crowdsourced experiment for tempo estimation of electronic dance music.,” in Proc. of the 19th Int. Society for Music Information Retrieval Conf., 2018, pp. 409–415.
- G. Peeters and J. Flocon-Cholet, “Perceptual tempo estimation using GMM-regression,” in Proc. of the 2nd Int. ACM workshop on Music Information Retrieval with user-centered and multimodal strategies, 2012, pp. 45–50.
- G. Percival and G. Tzanetakis, “Streamlined tempo estimation based on autocorrelation and cross-correlation with pulses,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1765–1776, 2014.
- “Music tempo estimation and beat tracking by applying source separation and metrical relations,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2012, pp. 421–424.
- “Multi-task learning of tempo and beat: Learning one to improve the other.,” in Proc. of the 20th Int. Society for Music Information Retrieval Conf., 2019, pp. 486–493.
- H. Schreiber and M. Müller, “A post-processing procedure for improving music tempo estimates using supervised learning.,” in Proc. of the 18th Int. Society for Music Information Retrieval Conf., 2017, pp. 235–242.
- M. F. McKinney and D. Moelants, “Ambiguity in tempo perception: What draws listeners to different metrical levels?,” Music Perception: An Interdisciplinary Journal, vol. 24, no. 2, pp. 155–166, 2006.