Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frequency-aware convolution for sound event detection (2403.13252v2)

Published 20 Mar 2024 in cs.SD and eess.AS

Abstract: In sound event detection (SED), convolutional neural networks (CNNs) are widely employed to extract time-frequency (TF) patterns from spectrograms. However, the ability of CNNs to recognize different sound events is limited by their insensitivity to shifts of TF patterns along the frequency dimension, caused by translation equivariance. To address this issue, a model called frequency dynamic convolution (FDY) has been proposed, which involves applying specific convolution kernels to different frequency components. However, FDY requires a significantly larger number of parameters and computational resources compared to a standard CNN. This paper proposes a more efficient solution called frequency-aware convolution (FAC). FAC incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram. To ensure that the amplitude of the encoding vector matches that of the input spectrogram, the encoding vector is adaptively and channel-dependently scaled using self-attention. To evaluate the effectiveness of FAC, we conducted experiments within the context of the DCASE 2023 task 4. The results show that FAC achieves comparable performance to FDY while requiring only an additional 515 parameters, whereas FDY necessitates an additional 8.02 million parameters. Furthermore, an ablation study confirms that the adaptive and channel-dependent scaling of the encoding vector is critical to the performance of FAC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. N. Shreyas, M. Venkatraman, S. Malini, and S. Chandrakala, “Trends of sound event recognition in audio surveillance: A recent review and study,” in The Cognitive Approach in Cloud Computing and Internet of Things Technologies for Surveillance Tracking Systems, D. Peter, A. H. Alavi, B. Javadi, and S. L. Fernandes, Eds.   Academic Press, 2020, ch. 7, pp. 95–106.
  2. A. Vafeiadis, K. Votis, D. Giakoumis, D. Tzovaras, L. Chen, and R. Hamzaoui, “Audio content analysis for unobtrusive event detection in smart homes,” Engineering Applications of Artificial Intelligence, vol. 89, p. 103226, 2020.
  3. Z. Mnasri, S. Rovetta, and F. Masulli, “Anomalous sound event detection: A survey of machine learning based methods and applications,” Multimedia Tools and Applications, vol. 81, no. 4, pp. 5537–5586, 2022.
  4. E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
  5. K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self-attention,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Apr. 2020, pp. 66–70.
  6. J. Ebbers and R. Haeb-Umbach, “Self-trained audio tagging and sound event detection in domestic environments,” in Detection and Classification of Acoustic Scenes and Events (DCASE), 2021, pp. 226––230.
  7. K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, and B. Yang, “Seld-tcn: Sound event localization & detection via temporal convolutional networks,” in European Signal Processing Conference (EUSIPCO), Amsterdam, Netherlands, Dec. 2021, pp. 16–20.
  8. K. Wakayama and S. Saito, “Cnn-transformer with self-attention network for sound event detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, Apr. 2022, pp. 806–810.
  9. H. Nam, S. Kim, B. Ko, and Y. Park, “Frequency dynamic convolution: Frequency-adaptive pattern recognition for sound event detection,” in Interspeech, Incheon, Korea, Sep. 2022, pp. 2763–2767.
  10. K. Koutini, H. Eghbalzadeh, and G. Widmer, “Receptive-field-regularized cnn variants for acoustic scene classification,” in Workshop on Detection and Classification of Acoustic Scenes and Events, 2019, pp. 1–5.
  11. A. Rakowski and M. Kosmider, “Frequency-aware cnn for open set acoustic scene classification,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, 2019, pp. 25–26.
  12. N. Aryal and S.-W. Lee, “Frequency-based cnn and attention module for acoustic scene classification,” Applied Acoustics, vol. 210, p. 109411, 2023.
  13. R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.
  14. A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.”   Red Hook, NY: Curran Associates Inc., 2017, p. 1195–1204.
  15. H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
  16. H. Nam, S.-H. Kim, and Y.-H. Park, “Filteraugment: An acoustic environmental data augmentation method,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, Apr. 2022, pp. 4308–4312.
  17. Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Apr. 2020, pp. 61–65.
  18. “DCASE 2022 task 4: Task description,” https://dcase.community/challenge2022/task-sound-event-detection-in-domestic-environments.

Summary

We haven't generated a summary for this paper yet.