Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation (2211.04772v3)

Published 9 Nov 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT

References (25)

Authors (3)

Florian Schmid (16 papers)
Khaled Koutini (20 papers)
Gerhard Widmer (144 papers)

Citations (45)

View on Semantic Scholar

Summary

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

The paper "Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation" presents an innovative approach to improving the efficiency and performance of Convolutional Neural Networks (CNNs) in the domain of Audio Tagging (AT). The research addresses the computational inefficiencies of current Transformer models, which, although state-of-the-art in AT, demand substantial resources due to their global self-attention mechanisms and large parameter sizes. This work leverages a Knowledge Distillation (KD) strategy to transfer the complex knowledge of Transformer models to more efficient CNN architectures based on MobileNetV3.

Summary of Methodology

The essence of the proposed approach lies in using KD to teach CNN models from highly performant Transformer ensembles, thereby capturing the advantages of Transformers without incurring their computational costs. The authors employ a comprehensive training schema that involves pre-computing Transformer model predictions on the AudioSet dataset—a large-scale collection of audio clips with multiple class labels—and using these predictions to train the CNN student models.

The paper specifically utilizes MobileNetV3 CNNs as the student network architecture. MobileNetV3, known for its efficiency in parameter size and computational demand, serves as the foundation for scalability in this work. The trained CNN models exhibit a substantial reduction in parameters, approximately tenfold compared to state-of-the-art Transformers, and achieve a significant decrease in multiply-accumulate operations, making them feasible for deployment on edge devices.

Results and Impact

The empirical results underscore the efficacy of the proposed KD-based training procedure. The distilled CNN models not only achieve parameter and computational efficiency but also set a new performance benchmark with a mean Average Precision (mAP) score of 0.483 on AudioSet. This score represents an improvement over existing models, supporting the claim that the proposed CNN models can achieve performance on par with computationally expensive Transformers.

The contribution extends to a framework facilitating the distillation process, bridging the gap between efficient CNN designs and high-performing Transformer models. The practical implications of this innovation are substantial; it opens pathways for deploying AT models in resource-constrained environments without compromising performance. This advancement holds potential for applications in mobile devices and real-time audio processing where computational efficiency is paramount.

Theoretical Insights and Future Directions

From a theoretical perspective, the paper highlights the synergy between CNNs and Transformers via KD. The findings advocate the utility of teacher-student paradigms in maximizing the generalization capabilities of efficient model architectures. This research enriches the discourse on transferable learning and the spectrum of interpretability between neural network architectures.

Future work could involve exploring the application of these distilled models in other domains of audio analysis beyond AT. Additionally, there is potential to investigate the embedding quality produced by distilled CNNs in relation to those derived from Transformer networks, possibly leading to further optimizations in audio feature extraction and representation learning.

The paper contributes significantly to the literature on efficient machine learning model design, offering a pragmatic solution to the challenge of balancing performance and computational resource constraints in large-scale audio tagging tasks.

Related Papers

Find Related Papers

GitHub

GitHub - fschmid56/EfficientAT: This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings. (219 stars)