MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer (2312.06197v3)
Abstract: Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.
- 2019. The receptive field as a regularizer in deep convolutional neural networks for acoustic scene classification. European Signal Processing Conference 2019-Septe (2019).
- Haider Al-Tahan and Yalda Mohsenzadeh. 2021. CLAR: Contrastive Learning of Auditory Representations. AISTATS 2021 (2021).
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 2020-December (2020), 1–12.
- Pierre Baldi. 2012. Autoencoders, Unsupervised Learning, and Deep Architectures. ICML Unsupervised and Transfer Learning (2012), 37–50. https://doi.org/10.1561/2200000006
- S4L: Self-supervised semi-supervised learning. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob (2019), 1476–1485. https://doi.org/10.1109/ICCV.2019.00156
- Large scale GaN training for high fidelity natural image synthesis. 7th International Conference on Learning Representations, ICLR 2019 (2019), 1–35.
- Deep clustering for unsupervised learning of visual features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11218 LNCS (2018), 139–156. https://doi.org/10.1007/978-3-030-01264-9_9
- Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. NeurIPS (2020), 1–13. http://arxiv.org/abs/2006.09882
- Hierarchical Perceiver. CVPR 2022 (2022). http://arxiv.org/abs/2202.10890
- Improved Baselines with Momentum Contrastive Learning. (2020), 1–3. http://arxiv.org/abs/2003.04297
- Automatic tagging using deep convolutional neural networks. Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016 (2016), 805–811.
- Convolutional recurrent neural networks for music classification. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (2017), 2392–2396. https://doi.org/10.1109/ICASSP.2017.7952585
- (NIPS2020)Debiased Contrastive Learning. NeurIPS (2020), 1–20.
- BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1 (2019), 4171–4186.
- Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision 2015 International Conference on Computer Vision, ICCV 2015 (2015), 1422–1430. https://doi.org/10.1109/ICCV.2015.167
- Jeff Donahue and Karen Simonyan. 2019. Large scale adversarial representation learning. Advances in Neural Information Processing Systems 32, NeurIPS (2019), 1–32.
- Bytecover: Cover Song Identification Via Multi-Loss Training. ICASSP 2021 (2021), 551–555. https://doi.org/10.1109/icassp39728.2021.9414128
- Interpretable part-whole hierarchies and conceptual-semantic relationships in neural networks. (2022). http://arxiv.org/abs/2203.03282
- Michal Genussov and Israel Cohen. 2010. Musical genre classification of audio signals using geometric methods. European Signal Processing Conference 10, 5 (2010), 497–501.
- Unsupervised representation learning by predicting image rotations. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings 2016 (2018), 1–16.
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144. https://doi.org/10.1145/3422622
- Bootstrap your own latent: A new approach to self-supervised Learning. NeurIPS 2020 200 (2020). http://arxiv.org/abs/2006.07733
- Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2 (2006), 1735–1742. https://doi.org/10.1109/CVPR.2006.100
- Transformer in Transformer. NeurIPS (2021), 1–12. http://arxiv.org/abs/2103.00112
- Jeff Hawkins. 2021. A thousand brains: A new theory of intelli- gence. (2021).
- Kaiming He. 2015. Delving Deep into Rectifiers : Surpassing Human-Level Performance on ImageNet Classification. CVPR 2015 (2015).
- Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975
- Data-efficient image recognition with contrastive predictive coding. CVPR 2020 2018 (2019).
- Geoffrey Hinton. 2021. How to represent part-whole hierarchies in a neural network. (2021), 1–44. http://arxiv.org/abs/2102.12627
- AdCO: Adversarial Contrast for Efficient Learning of Unsupervised Representations from Self-Trained Negative Adversaries. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2021), 1074–1083. https://doi.org/10.1109/CVPR46437.2021.00113
- Unsupervised Learning of Semantic Audio Representations. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2018-April (2018), 126–130. https://doi.org/10.1109/ICASSP.2018.8461684
- Similarity Learning for Cover Song Identification Using Cross-Similarity Matrices of Multi-Level Deep Sequences. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 26–30. https://doi.org/10.1109/ICASSP40776.2020.9053257
- Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems 2020-December, NeurIPS (2020), 1–12.
- Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2018-April (2018), 366–370. https://doi.org/10.1109/ICASSP.2018.8462046
- Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015), 1–15.
- Deep neural network baseline for DCASE challenge 2016. University of Surrey September (2016), 4–8.
- Evaluation of algorithms using games: The case of music tagging. Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009 Ismir (2009), 387–392.
- Jongpil Lee and Juhan Nam. 2017. Multi-Level and Multi-Scale Feature Aggregation Using Pre-trained Convolutional Neural Networks for Music Auto-tagging. PMLR 2017 (2017), 1–5.
- Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Proceedings of the 14th Sound and Music Computing Conference 2017, SMC 2017 (2019), 220–226.
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. (2022), 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986
- L. v. d. Maaten and G. Hinton. 2008. Visualizing Data using t-SNE. Journal ofmachine learning research (2008).
- TUT Database for Acoustic Scene Classification and Sound Event Detection. In 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary.
- Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2020), 6706–6716. https://doi.org/10.1109/CVPR42600.2020.00674
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010 (2010).
- BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation. IJCNN 20210 (2021). http://arxiv.org/abs/2103.06695
- Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9910 LNCS (2016), 69–84. https://doi.org/10.1007/978-3-319-46466-4_5
- End-to-end learning for music audio tagging at scale. Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018 (2018), 637–644.
- GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2020), 1150–1160. https://doi.org/10.1145/3394486.3403168
- Contrastive Learning of General-Purpose Audio Representations. ICASSP 2021 (2021), 3875–3879. https://doi.org/10.1109/icassp39728.2021.9413528
- WAV2vec: Unsupervised pre-training for speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2019-Septe (2019), 3465–3469. https://doi.org/10.21437/Interspeech.2019-1873
- Hierarchical representations in the auditory cortex. Current Opinion in Neurobiology 21, 5 (2011), 761–767. https://doi.org/10.1016/j.conb.2011.05.027
- Kihyuk Sohn. 2016. Improved deep metric learning with multi-class N-pair loss objective. Advances in Neural Information Processing Systems Nips (2016), 1857–1865.
- Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. ISMIR 2021 (2021). http://arxiv.org/abs/2103.09410
- Bob L. Sturm. 2013. The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use. 11 (2013), 1–29. https://doi.org/10.1080/09298215.2014.894533
- Visual Parser : Representing Part-whole Hierarchies with Transformers. ([n. d.]).
- Contrastive Multiview Coding. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12356 LNCS (2020), 776–794. https://doi.org/10.1007/978-3-030-58621-8_45
- Training data-efficient image transformers & distillation through attention. (2020). http://arxiv.org/abs/2012.12877
- Image Parsing: Segmentation, Detection, and Recognition. Towards Category-Level Object Recognition 63, 2 (2006), 545–576.
- Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem, Nips (2017), 5999–6009.
- Luyu Wang and Aaron van den Oord. 2021a. Multi-Format Contrastive Learning of Audio Representations. NeurIPS workshop 2020 (2021). http://arxiv.org/abs/2103.06508
- Luyu Wang and Aaron van den Oord. 2021b. Multi-Format Contrastive Learning of Audio Representations. arXiv:2103.06508 (March 2021).
- Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. ICCV 2021 (2022), 548–558. https://doi.org/10.1109/iccv48922.2021.00061
- Data-Driven Harmonic Filters for Audio Representation Learning. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 536–540. https://doi.org/10.1109/ICASSP40776.2020.9053669
- Toward Interpretable Music Tagging with Self-Attention. (2019). http://arxiv.org/abs/1906.04972
- Evaluation of CNN-based automatic music tagging models. Proceedings of the Sound and Music Computing Conferences 2020-June (2020), 331–337.
- Unsupervised Feature Learning via Non-parametric Instance Discrimination. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2018), 3733–3742. https://doi.org/10.1109/CVPR.2018.00393
- Key-Invariant Convolutional Neural Network Toward Efficient Cover Song Identification. Proceedings - IEEE International Conference on Multimedia and Expo 2018-July (2018), 1–6. https://doi.org/10.1109/ICME.2018.8486531
- Focal Self-attention for Local-Global Interactions in Vision Transformers. NeurIPS 2021 (2021), 1–21. http://arxiv.org/abs/2107.00641
- Temporal Pyramid Pooling Convolutional Neural Network for Cover Song Identification.. In IJCAI (2019). 4846–4852.
- Temporal pyramid pooling convolutional neural network for cover song identification. IJCAI International Joint Conference on Artificial Intelligence 2019-Augus (2019), 4846–4852. https://doi.org/10.24963/ijcai.2019/673
- Learning a Representation for Cover Song Identification Using Convolutional Neural Network. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2020-May (2020), 541–545. https://doi.org/10.1109/ICASSP40776.2020.9053839
- Barlow Twins: Self-Supervised Learning via Redundancy Reduction. ICML 2021 (2021). http://arxiv.org/abs/2103.03230
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021 (2021).
- Graph contrastive learning with adaptive augmentation. The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021 NeurIPS (2021), 2069–2080. https://doi.org/10.1145/3442381.3449802
- Music Genre Classification with Transformer Classifier. In Proceedings of the 2020 4th International Conference on Digital Signal Processing (Chengdu, China) (ICDSP 2020). Association for Computing Machinery, New York, NY, USA, 155–159. https://doi.org/10.1145/3408127.3408137
- Dong Yao (20 papers)
- Shengyu Zhang (160 papers)
- Zhou Zhao (219 papers)
- Jieming Zhu (68 papers)
- Liqun Deng (13 papers)
- Wenqiao Zhang (51 papers)
- Zhenhua Dong (76 papers)
- Xin Jiang (242 papers)
- Jiahao Xun (4 papers)