Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tune the pretrained ATST model for sound event detection (2309.08153v2)

Published 15 Sep 2023 in eess.AS and cs.SD

Abstract: Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019, pp. 2613–2617.
  2. “Filteraugment: An acoustic environmental data augmentation method,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4308–4312.
  3. “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), 2018.
  4. A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in Advances in Neural Information Processing Systems (NIPS), 2017, vol. 30.
  5. “Interpolation consistency training for semi-supervised learning,” in International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 3635–3641.
  6. “Sound event detection by consistency training and pseudo-labeling with feature-pyramid convolutional recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 376–380.
  7. “RCT: Random consistency training for semi-supervised sound event detection,” in Interspeech, 2022, pp. 1541–1545.
  8. L. JiaKai, “Mean teacher convolution system for dcase 2018 task 4,” Tech. Rep., DCASE2018 Challenge, 2018.
  9. “An improved mean teacher based method for large scale weakly labeled semi-supervised sound event detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 356–360.
  10. “Frequency Dynamic Convolution: Frequency-Adaptive Pattern Recognition for Sound Event Detection,” in Interspeech, 2022, pp. 2763–2767.
  11. “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017, vol. 30.
  12. “BEATs: Audio pre-training with acoustic tokenizers,” in International Conference on Machine Learning (ICML), 2023, pp. 5178–5193.
  13. “Threshold independent evaluation of sound event detection scores,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 1021–1025.
  14. “Sound event detection with weak prediction for dcase 2023 challenge task4a,” Tech. Rep., DCASE2023 Challenge, 2023.
  15. “Semi-supervised learning-based sound event detection using frequency dynamic convolution with large kernel attention for DCASE challenge 2023 task 4,” Tech. Rep., DCASE2023 Challenge, 2023.
  16. X. Li and X. Li, “ATST: Audio representation learning with teacher-student transformer,” in Interspeech, 2022, pp. 4172–4176.
  17. “Self-supervised audio teacher-student transformer for both clip-level and frame-level tasks,” arXiv preprint arXiv:2306.04186, 2023.
  18. “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Acoustic Scenes and Events 2019 Workshop (DCASE2019), 2019, p. 253.
  19. L. v/d Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, pp. 2579–2605, 2008.
  20. “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
  21. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  22. “BEiT: BERT pre-training of image transformers,” in International Conference on Learning Representations (ICLR), 2022.
  23. “AST-SED: An effective sound event detection method based on audio spectrogram transformer,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  24. “Li USTC team’s submission for DCASE 2023 challenge task4a,” Tech. Rep., DCASE2023 Challenge, June 2023.
  25. “The benefit of temporally-strong labels in audio event classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 366–370.
Citations (19)

Summary

We haven't generated a summary for this paper yet.