Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation (2211.04772v3)

Published 9 Nov 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., 2020.
  2. “PSLA: improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
  3. “Eranns: Efficient residual audio neural networks for audio pattern recognition,” Pattern Recognit. Lett., 2022.
  4. “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
  5. “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, 2017.
  6. “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018.
  7. “Searching for mobilenetv3,” in ICCV, 2019.
  8. “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML. 2019, Proceedings of Machine Learning Research, PMLR.
  9. “Efficientnetv2: Smaller models and faster training,” in ICML, 2021.
  10. “CMKD: cnn/transformer-based cross-model knowledge distillation for audio classification,” CoRR, 2022.
  11. “Attention is all you need,” in NIPS, 2017.
  12. “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  13. “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
  14. “A convnet for the 2020s,” in CVPR. 2022, IEEE.
  15. “Masked autoencoders that listen,” CoRR, 2022.
  16. “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP, 2022.
  17. “Efficient Training of Audio Transformers with Patchout,” in Interspeech, 2022.
  18. “AST: Audio Spectrogram Transformer,” in Interspeech, 2021.
  19. “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
  20. “Squeeze-and-excitation networks,” in CVPR, 2018.
  21. “Distilling the knowledge in a neural network,” CoRR, 2015.
  22. “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
  23. “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  24. “Knowledge distillation: A good teacher is patient and consistent,” CoRR, 2021.
  25. “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Florian Schmid (16 papers)
  2. Khaled Koutini (20 papers)
  3. Gerhard Widmer (144 papers)
Citations (45)

Summary

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

The paper "Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation" presents an innovative approach to improving the efficiency and performance of Convolutional Neural Networks (CNNs) in the domain of Audio Tagging (AT). The research addresses the computational inefficiencies of current Transformer models, which, although state-of-the-art in AT, demand substantial resources due to their global self-attention mechanisms and large parameter sizes. This work leverages a Knowledge Distillation (KD) strategy to transfer the complex knowledge of Transformer models to more efficient CNN architectures based on MobileNetV3.

Summary of Methodology

The essence of the proposed approach lies in using KD to teach CNN models from highly performant Transformer ensembles, thereby capturing the advantages of Transformers without incurring their computational costs. The authors employ a comprehensive training schema that involves pre-computing Transformer model predictions on the AudioSet dataset—a large-scale collection of audio clips with multiple class labels—and using these predictions to train the CNN student models.

The paper specifically utilizes MobileNetV3 CNNs as the student network architecture. MobileNetV3, known for its efficiency in parameter size and computational demand, serves as the foundation for scalability in this work. The trained CNN models exhibit a substantial reduction in parameters, approximately tenfold compared to state-of-the-art Transformers, and achieve a significant decrease in multiply-accumulate operations, making them feasible for deployment on edge devices.

Results and Impact

The empirical results underscore the efficacy of the proposed KD-based training procedure. The distilled CNN models not only achieve parameter and computational efficiency but also set a new performance benchmark with a mean Average Precision (mAP) score of 0.483 on AudioSet. This score represents an improvement over existing models, supporting the claim that the proposed CNN models can achieve performance on par with computationally expensive Transformers.

The contribution extends to a framework facilitating the distillation process, bridging the gap between efficient CNN designs and high-performing Transformer models. The practical implications of this innovation are substantial; it opens pathways for deploying AT models in resource-constrained environments without compromising performance. This advancement holds potential for applications in mobile devices and real-time audio processing where computational efficiency is paramount.

Theoretical Insights and Future Directions

From a theoretical perspective, the paper highlights the synergy between CNNs and Transformers via KD. The findings advocate the utility of teacher-student paradigms in maximizing the generalization capabilities of efficient model architectures. This research enriches the discourse on transferable learning and the spectrum of interpretability between neural network architectures.

Future work could involve exploring the application of these distilled models in other domains of audio analysis beyond AT. Additionally, there is potential to investigate the embedding quality produced by distilled CNNs in relation to those derived from Transformer networks, possibly leading to further optimizations in audio feature extraction and representation learning.

The paper contributes significantly to the literature on efficient machine learning model design, offering a pragmatic solution to the challenge of balancing performance and computational resource constraints in large-scale audio tagging tasks.