Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Complexity Audio Embedding Extractors (2303.01879v2)

Published 3 Mar 2023 in cs.SD and eess.AS

Abstract: Solving tasks such as speaker recognition, music classification, or semantic audio event tagging with deep learning models typically requires computationally demanding networks. General-purpose audio embeddings (GPAEs) are dense representations of audio signals that allow lightweight, shallow classifiers to tackle various audio tasks. The idea is that a single complex feature extractor would extract dense GPAEs, while shallow MLPs can produce task-specific predictions. If the extracted dense representations are general enough to allow the simple downstream classifiers to generalize to a variety of tasks in the audio domain, a single costly forward pass suffices to solve multiple tasks in parallel. In this work, we try to reduce the cost of GPAE extractors to make them suitable for resource-constrained devices. We use efficient MobileNets trained on AudioSet using Knowledge Distillation from a Transformer ensemble as efficient GPAE extractors. We explore how to obtain high-quality GPAEs from the model, study how model complexity relates to the quality of extracted GPAEs, and conclude that low-complexity models can generate competitive GPAEs, paving the way for analyzing audio streams on edge devices w.r.t. multiple audio classification and recognition tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. J. Turian, J. Shier, H. R. Khan, B. Raj, B. W. Schuller, C. J. Steinmetz, C. Malloy, G. Tzanetakis, G. Velarde, K. McNally, M. Henry, N. Pinto, C. Noufi, C. Clough, D. Herremans, E. Fonseca, J. H. Engel, J. Salamon, P. Esling, P. Manocha, S. Watanabe, Z. Jin, and Y. Bisk, “HEAR: holistic evaluation of audio representations,” in NeurIPS 2021 Competitions and Demonstrations Track.   PMLR, 2021.
  2. Z. Liu, Y. Wang, and T. Chen, “Audio feature extraction and analysis for scene segmentation and classification,” J. VLSI Signal Process., 1998.
  3. F. Eyben, M. Wöllmer, and B. W. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th International Conference on Multimedia.   ACM, 2010.
  4. B. Logan, “Mel frequency cepstral coefficients for music modeling,” in ISMIR, 1st International Symposium on Music Information Retrieval, 2000.
  5. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson, “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2017.
  6. J. Cramer, H. Wu, J. Salamon, and J. P. Bello, “Look, listen, and learn more: Design choices for deep audio embeddings,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2019.
  7. A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” CoRR, 2018.
  8. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for audio: Self-supervised learning for general-purpose audio representation,” in International Joint Conference on Neural Networks, IJCNN.   IEEE, 2021.
  9. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020.
  10. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop.   ISCA, 2016.
  11. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” in 5th International Conference on Learning Representations, ICLR.   OpenReview.net, 2017.
  12. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR.   OpenReview.net, 2021.
  13. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR.   IEEE, 2022.
  14. Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in Interspeech, 22nd Annual Conference of the International Speech Communication Association.   ISCA, 2021.
  15. K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in Interspeech, 23rd Annual Conference of the International Speech Communication Association.   ISCA, 2022.
  16. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” CoRR, 2022.
  17. D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” CoRR, 2022.
  18. A. Baade, P. Peng, and D. Harwath, “MAE-AST: masked autoencoding audio spectrogram transformer,” in Interspeech, 23rd Annual Conference of the International Speech.   ISCA, 2022.
  19. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR.   IEEE Computer Society, 2009.
  20. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2017.
  21. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., 2020.
  22. P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2022.
  23. Y. Gong, C. Lai, Y. Chung, and J. R. Glass, “SSAST: self-supervised audio spectrogram transformer,” in Thirty-Sixth AAAI Conference on Artificial Intelligence.   AAAI Press, 2022.
  24. A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2021.
  25. L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smaira, A. Brock, A. Jaegle, J. Alayrac, S. Dieleman, J. Carreira, and A. van den Oord, “Towards learning universal audio representations,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2022.
  26. M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Self-supervised audio representation learning for mobile devices,” CoRR, 2019.
  27. P. Lopez-Meyer, J. A. del Hoyo Ontiveros, H. Lu, and G. Stemmer, “Efficient end-to-end audio embeddings generation for audio classification on target applications,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2021.
  28. A. Howard, R. Pang, H. Adam, Q. V. Le, M. Sandler, B. Chen, W. Wang, L. Chen, M. Tan, G. Chu, V. Vasudevan, and Y. Zhu, “Searching for mobilenetv3,” in IEEE/CVF International Conference on Computer Vision, ICCV.   IEEE, 2019.
  29. F. Schmid, K. Koutini, and G. Widmer, “Efficient large-scale audio tagging via transformer-to-cnn knowledge distillation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP.   IEEE, 2023.
  30. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, 2017.
  31. M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR.   Computer Vision Foundation / IEEE Computer Society, 2018.
  32. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the 36th International Conference on Machine Learning, ICML.   PMLR, 2019.
  33. ——, “Efficientnetv2: Smaller models and faster training,” in Proceedings of the 38th International Conference on Machine Learning, ICML.   PMLR, 2021.
  34. Y. Gong, Y. Chung, and J. R. Glass, “PSLA: improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
  35. Y. Gong, S. Khurana, A. Rouditchenko, and J. R. Glass, “CMKD: cnn/transformer-based cross-model knowledge distillation for audio classification,” CoRR, 2022.
  36. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR.   Computer Vision Foundation / IEEE Computer Society, 2018.
  37. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Composing general audio representation by fusing multilayer features of a pre-trained model,” in 30th European Signal Processing Conference, EUSIPCO.   IEEE, 2022.
  38. K. Koutini, S. Masoudian, F. Schmid, H. Eghbal-zadeh, J. Schlüter, and G. Widmer, “Learning general audio representations with large-scale training of patchout audio transformers,” in HEAR.   PMLR, 2023.
  39. G. Elbanna, N. Scheidwasser-Clow, M. Kegler, P. Beckmann, K. E. Hajal, and M. Cernak, “BYOL-S: learning self-supervised speech representations by bootstrapping,” in HEAR.   PMLR, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.