Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation (2211.04772v3)
Abstract: Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT
- “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., 2020.
- “PSLA: improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
- “Eranns: Efficient residual audio neural networks for audio pattern recognition,” Pattern Recognit. Lett., 2022.
- “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE ACM Trans. Audio Speech Lang. Process., 2021.
- “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, 2017.
- “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018.
- “Searching for mobilenetv3,” in ICCV, 2019.
- “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML. 2019, Proceedings of Machine Learning Research, PMLR.
- “Efficientnetv2: Smaller models and faster training,” in ICML, 2021.
- “CMKD: cnn/transformer-based cross-model knowledge distillation for audio classification,” CoRR, 2022.
- “Attention is all you need,” in NIPS, 2017.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
- “A convnet for the 2020s,” in CVPR. 2022, IEEE.
- “Masked autoencoders that listen,” CoRR, 2022.
- “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP, 2022.
- “Efficient Training of Audio Transformers with Patchout,” in Interspeech, 2022.
- “AST: Audio Spectrogram Transformer,” in Interspeech, 2021.
- “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
- “Squeeze-and-excitation networks,” in CVPR, 2018.
- “Distilling the knowledge in a neural network,” CoRR, 2015.
- “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
- “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
- “Knowledge distillation: A good teacher is patient and consistent,” CoRR, 2021.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
- Florian Schmid (16 papers)
- Khaled Koutini (20 papers)
- Gerhard Widmer (144 papers)