Toward Fully Self-Supervised Multi-Pitch Estimation (2402.15569v1)
Abstract: Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
- Jazz bass transcription using a U-net architecture. Electronics, 10(6):670, 2021.
- CLAR: Contrastive learning of auditory representations. In Proceedings of AISTATS, 2021.
- Joint multi-pitch detection using harmonic envelope estimation for polyphonic music transcription. IEEE Journal of Selected Topics in Signal Processing (JSTSP), 5(6):1111–1123, 2011.
- An efficient temporally-constrained probabilistic model for multiple-instrument music transcription. In Proceedings of ISMIR, 2015.
- Automatic music transcription: Challenges and future directions. Journal of Intelligent Information Systems, 41:407–434, 2013.
- Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1):20–30, 2019.
- Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(3):538–549, 2010.
- MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proceedings of ISMIR, 2014.
- Deep salience representations for F0 estimation in polyphonic music. In Proceedings of ISMIR, 2017.
- A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. In Proceedings of ICASSP, 2022.
- Polyphonic piano note transcription with recurrent neural networks. In Proceedings of ICASSP, 2012.
- Bregman, A. S. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, 1994.
- Brown, J. C. Calculation of a constant Q spectral transform. Journal of the Acoustical Society of America (JASA), 89(1):425–434, 1991.
- A simple framework for contrastive learning of visual representations. In Proceedings of ICML, 2020.
- ReconVAT: A semi-supervised automatic music transcription framework for low-resource real-world data. In Proceedings of ACM MM, 2021a.
- The effect of spectrogram reconstruction on automatic music transcription: An alternative approach to improve transcription accuracy. In Proceedings of ICPR, 2021b.
- Jointist: Joint learning for multi-instrument transcription and its applications. arXiv preprint arXiv:2206.10805, 2022.
- Diffroll: Diffusion-based generative music transcription with unsupervised pretraining capability. In Proceedings of ICASSP, 2023.
- Multi-Pitch Estimation. Morgan & Claypool, 2009.
- Multi-pitch estimation. Signal Processing, 88(4):972–983, 2008.
- Joint fundamental frequency and order estimation using optimal filtering. EURASIP Journal on Advances in Signal Processing, 2011(13), 2011.
- Piano transcription with convolutional sparse lateral inhibition. IEEE Signal Processing Letters, 24(4):392–396, 2017.
- Cwitkowitz, F. End-to-end music transcription using fine-tuned variable-Q filterbanks. Master’s thesis, Rochester Institute of Technology, 2019.
- Timbre-Trap: A low-resource framework for instrument-agnostic music transcription. In Proceedings of ICASSP, 2024.
- de Cheveigné, A. Multiple F0 estimation. In Wang, D. and Brown, G. J. (eds.), Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, chapter 2, pp. 45–79. Wiley-IEEE Press, 2006.
- Zero-note samba: Self-supervised beat tracking. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 31:2922–2934, 2023.
- Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(8):2121–2133, 2010.
- Elowsson, A. Polyphonic pitch tracking with deep layered learning. Journal of the Acoustical Society of America (JASA), 148(1):446–468, 2020.
- Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(6):1643–1654, 2010.
- Neural audio synthesis of musical notes with wavenet autoencoders. In Proceedings of ICML, 2017.
- DDSP: Differentiable digital signal processing. In Proceedings of ICLR, 2020a.
- Self-supervised pitch detection by inverse audio synthesis. In ICML Workshop on Self-Supervision in Audio and Speech, 2020b.
- Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine, 39(3):42–62, 2022.
- Fritsch, J. High quality musical audio source separation. Master’s thesis, UPMC / IRCAM / Telécom ParisTech, 2012.
- MT3: Multi-task multitrack music transcription. In Proceedings of ICLR, 2021.
- SPICE: Self-supervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:1118–1128, 2020.
- Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model. EURASIP Journal on Audio, Speech, and Music Processing, 2023(51), 2023.
- Onsets and frames: Dual-objective piano transcription. In Proceedings of ISMIR, 2018.
- Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of ICLR, 2019.
- Sequence-to-sequence piano transcription with transformers. In Proceedings of ISMIR, 2021.
- Sinusoidal frequency estimation by gradient descent. In Proceedings of ICASSP, 2023.
- Tempo estimation as fully self-supervised binary classification. In Proceedings of ICASSP, 2024.
- On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(2):310–319, 2010.
- Transcription is all you need: Learning to separate musical mixtures with score as supervision. In Proceedings of ICASSP, 2021.
- A multipitch analyzer based on harmonic temporal structured clustering. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 15(3):982–994, 2007.
- Crepe: A convolutional representation for pitch estimation. In Proceedings of ICASSP, 2018.
- Klapuri, A. Number theoretical means of resolving a mixture of several harmonic sounds. In Proceedings of EUSIPCO, 1998.
- Klapuri, A. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6):804–816, 2003.
- Klapuri, A. Multiple fundamental frequency estimation by summing harmonic amplitudes. In Proceedings of ISMIR, 2006.
- Klapuri, A. Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 16(2):255–266, 2008.
- Signal Processing Methods for Music Transcription. Springer, 2006.
- Weakly supervised multi-pitch estimation using cross-version alignment. In Proceedings of ISMIR, 2023a.
- Soft dynamic time warping for multi-pitch estimation and beyond. In Proceedings of ICASSP, 2023b.
- Learning complex basis functions for invariant representations of audio. In Proceedings of ISMIR, 2019.
- Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21:522–535, 2018.
- Multi-speaker pitch tracking via embodied self-supervised learning. In Proceedings of ICASSP, 2022.
- MERT: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023.
- Decoupled weight decay regularization. In Proceedings of ICLR, 2019.
- Multitrack music transcription with a time-frequency perceiver. In Proceedings of ICASSP, 2023.
- Unaligned supervision for automatic music transcription in the wild. In Proceedings of ICML, 2022.
- Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In Proceedings of WASPAA, 2019.
- Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In Proceedings of ICASSP, 2020.
- PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of ICASSP, 2014.
- Supervised and unsupervised learning of audio representations for music understanding. In Proceedings of ISMIR, 2022.
- On the effect of data-augmentation on local embedding properties in the contrastive learning of music audio representations. In Proceedings of ICASSP, 2024.
- Cross-domain neural pitch and periodicity estimation. arXiv preprint arXiv:2301.12258, 2023.
- Müller, M. Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications. Springer, 2015.
- AAM: A dataset of artificial audio multitracks for diverse music information retrieval tasks. EURASIP Journal on Audio, Speech, and Music Processing, 2023(13), 2023.
- 20 years of automatic chord recognition from audio. In Proceedings of ISMIR, 2019.
- Improving music transcription by pre-stacking a U-Net. In Proceedings of ICASSP, 2020.
- mir_eval: A transparent implementation of common MIR metrics. In Proceedings of ISMIR, 2014.
- DDSP-Piano: a neural sound synthesizer informed by instrument knowledge. Journal of the Audio Engineering Society (AES), 2023.
- PESTO: Pitch estimation with self-supervised transposition-equivariant objective. In Proceedings of ISMIR, 2023.
- Contrastive learning of general-purpose audio representations. In Proceedings of ICASSP, 2021.
- An analysis/synthesis framework for automatic F0 annotation of multitrack datasets. In Proceedings of ISMIR, 2017.
- Exploring data augmentation for improved singing voice detection with neural networks. In Proceedings of ISMIR, 2015.
- Constant-Q transform toolbox for music processing. In Proceedings of SMC, 2010.
- A matlab toolbox for efficient perfect reconstruction time-frequency transforms with log-frequency resolution. In Proceedings of AES, 2014.
- Unsupervised music source separation using differentiable parametric source models. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 31:1276–1289, 2023.
- An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 24(5):927–939, 2016.
- Scaling polyphonic transcription with mixtures of monophonic transcriptions. In Proceedings of ISMIR, 2022.
- Contrastive learning of musical representations. In Proceedings of ISMIR, 2021.
- Escaping from the abyss of manual annotation: New methodology of building polyphonic datasets for automatic music transcription. In Proceedings of CMMR, 2015.
- Pre-training audio representations with self-supervision. IEEE Signal Processing Letters, 27:600–604, 2020.
- Multi-instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In Proceedings of ISMIR, 2020.
- Singer identity representation learning using self-supervised techniques. In Proceedings of ISMIR, 2023.
- Unsupervised harmonic parameter estimation using differentiable DSP and spectral optimal transport. In Proceedings of ICASSP, 2024.
- Automatic piano transcription with hierarchical frequency-time transformer. In Proceedings of ISMIR, 2023.
- Enhancing piano transcription by dilated convolution. In Proceedings of ICMLA, 2020a.
- Harmonic structure-based neural network model for music pitch detection. In Proceedings of ICMLA, 2020b.
- HarmoF0: Logarithmic scale dilated convolution for pitch estimation. In Proceedings of ICME, 2022a.
- HPPNet: Modeling the harmonic structure and pitch invariance in piano transcription. In Proceedings of ISMIR, 2022b.
- Learning multi-pitch estimation from weakly aligned score-audio pairs using a multi-label CTC loss. In Proceedings of WASPAA, 2021.
- Comparing deep models and evaluation strategies for multi-pitch estimation in music recordings. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 30:2814–2827, 2022.
- MFAE: Masked frame-level autoencoder with hybrid-supervision for low-resource music transcription. In Proceedings of ICME, 2023.
- Multi-instrument automatic music transcription with self-attention-based instance segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 28:2796–2809, 2020.
- GuitarSet: A dataset for guitar transcription. In Proceedings of ISMIR, 2018.
- Skipping the frame-level: Event-based piano transcription with neural semi-CRFs. In Proceedings of NeurIPS, 2021.
- Multiple fundamental frequency estimation and polyphony inference of polyphonic music signals. IEEE Transactions on Audio, Speech, and Language Processing (TASLP), 18(6):1116–1126, 2010.
- SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 30:495–507, 2021.
- Deep audio priors emerge from harmonic convolutional networks. In Proceedings of ICLR, 2020.
- A computationally efficient method for polyphonic pitch estimation. EURASIP Journal on Advances in Signal Processing, 2009, 2009.