Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks (2401.10544v1)

Published 19 Jan 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “AST: Audio Spectrogram Transformer,” in Proc. Interspeech 2021, 2021, pp. 571–575.
  2. “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
  3. “Efficient Training of Audio Transformers with Patchout,” in Proc. Interspeech 2022, 2022, pp. 2753–2757.
  4. “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
  5. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  6. “Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292–3306, 2021.
  7. “Ssast: Self-supervised audio spectrogram transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 10699–10709.
  8. “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” in Proc. Interspeech 2022, 2022, pp. 2438–2442.
  9. “Masked autoencoders that listen,” Advances in Neural Information Processing Systems, vol. 35, pp. 28708–28720, 2022.
  10. “Transfer learning from sound representations for anger detection in speech,” arXiv preprint arXiv:1902.02120, 2019.
  11. “How transferable are features in deep neural networks?,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, Cambridge, MA, USA, 2014, NIPS’14, p. 3320–3328, MIT Press.
  12. “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
  13. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  14. “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  15. “Adaptformer: Adapting vision transformers for scalable visual recognition,” Advances in Neural Information Processing Systems, vol. 35, pp. 16664–16678, 2022.
  16. “Visual prompt tuning,” in European Conference on Computer Vision (ECCV), 2022.
  17. “Aim: Adapting image models for efficient video action recognition,” in The Eleventh International Conference on Learning Representations, 2022.
  18. “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  19. “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
  20. “Attention is not all you need: Pure attention loses rank doubly exponentially with depth,” in International Conference on Machine Learning. PMLR, 2021, pp. 2793–2803.
  21. “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  22. Karol J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia. pp. 1015–1018, ACM Press.
  23. “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM International Conference on Multimedia, New York, NY, USA, 2014, MM ’14, p. 1041–1044, Association for Computing Machinery.
  24. Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
  25. “Openmic-2018: An open data-set for multiple instrument recognition.,” in ISMIR, 2018, pp. 438–444.
  26. “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yun Liang (42 papers)
  2. Hai Lin (200 papers)
  3. Shaojian Qiu (3 papers)
  4. Yihang Zhang (18 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets