Fine-tune the pretrained ATST model for sound event detection (2309.08153v2)
Abstract: Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019, pp. 2613–2617.
- “Filteraugment: An acoustic environmental data augmentation method,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4308–4312.
- “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), 2018.
- A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in Neural Information Processing Systems (NIPS), 2017, vol. 30.
- “Interpolation consistency training for semi-supervised learning,” in International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 3635–3641.
- “Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 376–380.
- “RCT: Random consistency training for semi-supervised sound event detection,” in Interspeech, 2022, pp. 1541–1545.
- L. JiaKai, “Mean teacher convolution system for dcase 2018 task 4,” Tech. Rep., DCASE2018 Challenge, 2018.
- “An improved mean teacher based method for large scale weakly labeled semi-supervised sound event detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 356–360.
- “Frequency Dynamic Convolution: Frequency-Adaptive Pattern Recognition for Sound Event Detection,” in Interspeech, 2022, pp. 2763–2767.
- “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017, vol. 30.
- “BEATs: Audio pre-training with acoustic tokenizers,” in International Conference on Machine Learning (ICML), 2023, pp. 5178–5193.
- “Threshold independent evaluation of sound event detection scores,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 1021–1025.
- “Sound event detection with weak prediction for dcase 2023 challenge task4a,” Tech. Rep., DCASE2023 Challenge, 2023.
- “Semi-supervised learning-based sound event detection using frequency dynamic convolution with large kernel attention for DCASE challenge 2023 task 4,” Tech. Rep., DCASE2023 Challenge, 2023.
- X. Li and X. Li, “ATST: Audio representation learning with teacher-student transformer,” in Interspeech, 2022, pp. 4172–4176.
- “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,” arXiv preprint arXiv:2306.04186, 2023.
- “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, p. 253.
- L. v/d Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, pp. 2579–2605, 2008.
- “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
- “BEiT: BERT pre-training of image transformers,” in International Conference on Learning Representations (ICLR), 2022.
- “AST-SED: An effective sound event detection method based on audio spectrogram transformer,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Li USTC team’s submission for DCASE 2023 challenge task4a,” Tech. Rep., DCASE2023 Challenge, June 2023.
- “The benefit of temporally-strong labels in audio event classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 366–370.