Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio Embeddings as Teachers for Music Classification (2306.17424v1)

Published 30 Jun 2023 in cs.SD, cs.IR, and eess.AS

Abstract: Music classification has been one of the most popular tasks in the field of music information retrieval. With the development of deep learning models, the last decade has seen impressive improvements in a wide range of classification tasks. However, the increasing model complexity makes both training and inference computationally expensive. In this paper, we integrate the ideas of transfer learning and feature-based knowledge distillation and systematically investigate using pre-trained audio embeddings as teachers to guide the training of low-complexity student networks. By regularizing the feature space of the student networks with the pre-trained embeddings, the knowledge in the teacher embeddings can be transferred to the students. We use various pre-trained audio embeddings and test the effectiveness of the method on the tasks of musical instrument classification and music auto-tagging. Results show that our method significantly improves the results in comparison to the identical model trained without the teacher's knowledge. This technique can also be combined with classical knowledge distillation approaches to further improve the model's performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002.
  2. E. Humphrey, S. Durand, and B. McFee, “Openmic-2018: An open data-set for multiple instrument recognition,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 438–444.
  3. E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tagging,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 387–392.
  4. S. Gururani, M. Sharma, and A. Lerch, “An attention mechanism for musical instrument recognition,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 83–90.
  5. K. Koutini, H. Eghbal-zadeh, and G. Widmer, “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1987–2000, 2021.
  6. K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 805–811.
  7. T. Kim, J. Lee, and J. Nam, “Sample-level cnn architectures for music auto-tagging using raw waveforms,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 366–370.
  8. K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Transfer learning for music classification and regression tasks,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 141–149.
  9. S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  10. K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Proceedings of INTERSPEECH 2022, 2022, pp. 2753–2757.
  11. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  12. M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. F. Ehmann, “Supervised and unsupervised learning of audio representations for music understanding,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 256–263.
  13. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
  14. J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2014.
  15. L. Yu, V. O. Yazici, X. Liu, J. v. d. Weijer, Y. Cheng, and A. Ramisa, “Learning metrics from teachers: Compact networks for image embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (ICCV), 2019, pp. 2902–2911.
  16. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the International Conference on Machine Learning (ICML), 2021, pp. 10 347–10 357.
  17. A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby, “Big transfer (bit): General visual representation learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 491–507.
  18. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  19. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  20. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 4171–4186.
  21. P. Alonso-Jiménez, D. Bogdanov, and X. Serra, “Deep embeddings with essentia models,” in Late Breaking Demo of the International Society for Music Information Retrieval Conference (ISMIR), 2020.
  22. S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 131–135.
  23. P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” 2020. [Online]. Available: http://arxiv.org/2005.00341
  24. R. Castellon, C. Donahue, and P. Liang, “Codified audio language modeling learns useful representations for music information retrieval,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 88–96.
  25. E. S. Koh and S. Dubnov, “Comparison and analysis of deep audio embeddings for music emotion recognition,” in AAAI Workshop on Affective Content Analysis, 2021.
  26. D. Bogdanov, X. Lizarraga Seijas, P. Alonso-Jiménez, and X. Serra, “Musav: a dataset of relative arousal-valence annotations for validation of audio models,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 650–658.
  27. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
  28. R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2019.
  29. T. Huang, S. You, F. Wang, C. Qian, and C. Xu, “Knowledge distillation from a stronger teacher,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  30. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  31. B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, “A comprehensive overhaul of feature distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1921–1930.
  32. B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5007–5016.
  33. J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4133–4141.
  34. J. Kim, S. Park, and N. Kwak, “Paraphrasing complex network: Network compression via factor transfer,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  35. J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, pp. 1789–1819, 2021.
  36. L. Wang and K.-J. Yoon, “Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 3048–3068, 2022.
  37. A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2014.
  38. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the International Conference on Machine Learning (ICML), 2020, pp. 1597–1607.
  39. J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 673–681.
  40. P. Alonso-Jiménez, X. Serra, and D. Bogdanov, “Music representation learning based on editorial metadata from discogs,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 825–833.
  41. P. Seshadri and A. Lerch, “Improving music performance assessment with contrastive learning,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 634–641.
  42. Y.-N. Hung and A. Lerch, “Feature-informed embedding space regularization for audio classification,” in Proceedings of the European Signal Processing Conference (EUSIPCO), 2022, pp. 419–423.
  43. ——, “Feature-informed latent space regularization for music source separation,” in Proceedings of the International Conference on Digital Audio Effects (DAFx), 2022.
  44. G. J. Székely, M. L. Rizzo, and N. K. Bakirov, “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
  45. Y. Xu and W. Yin, “Block stochastic gradient iteration for convex and nonconvex optimization,” SIAM Journal on Optimization, vol. 25, no. 3, pp. 1686–1716, 2015.
  46. X. Zhen, Z. Meng, R. Chakraborty, and V. Singh, “On the versatile uses of partial distance correlation in deep learning,” in Proceedings of the European Conference on Computer Vision (ECCV), 2022, pp. 327–346.
  47. X. Gastaldi, “Shake-shake regularization of 3-branch residual networks,” in Workshop Track of the International Conference on Learning Representations (ICLR), 2017.
  48. M. Won, A. Ferraro, D. Bogdanov, and X. Serra, “Evaluation of cnn-based automatic music tagging models,” in Proceedings of the Sound and Music Computing (SMC), 2020, pp. 331–337.
  49. H.-H. Chen and A. Lerch, “Music instrument classification reprogrammed,” in Proceedings of the International Conference on Multimedia Modeling (MMM), 2023, pp. 345–357.
  50. M. Won, S. Chun, and X. Serra, “Toward interpretable music tagging with self-attention,” 2019. [Online]. Available: http://arxiv.org/1906.04972
  51. M. Won, S. Chun, O. Nieto, and X. Serrc, “Data-driven harmonic filters for audio representation learning,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 536–540.
  52. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
  53. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  54. J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello, “Look, listen, and learn more: Design choices for deep audio embeddings,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3852–3856.
  55. R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 609–617.
  56. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
  57. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yiwei Ding (13 papers)
  2. Alexander Lerch (43 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.