Channel-Spatial-Based Few-Shot Bird Sound Event Detection (2306.10499v2)
Abstract: In this paper, we propose a model for bird sound event detection that focuses on a small number of training samples within the everyday long-tail distribution. As a result, we investigate bird sound detection using the few-shot learning paradigm. By integrating channel and spatial attention mechanisms, improved feature representations can be learned from few-shot training datasets. We develop a Metric Channel-Spatial Network model by incorporating a Channel Spatial Squeeze-Excitation block into the prototype network, combining it with these attention mechanisms. We evaluate the Metric Channel Spatial Network model on the DCASE 2022 Take5 dataset benchmark, achieving an F-measure of 66.84% and a PSDS of 58.98%. Our experiment demonstrates that the combination of channel and spatial attention mechanisms effectively enhances the performance of bird sound classification and detection.
- H. Brumm and M. Naguib, “Environmental acoustics and the evolution of bird song,” Advances in the Study of Behavior, vol. 40, pp. 1–33, 2009.
- J. Bateman and A. Uzal, “The relationship between the acoustic complexity index and avian species richness and diversity: a review,” Bioacoustics, vol. 31, no. 5, pp. 614–627, 2022.
- D. Stowell, M. Wood, Y. Stylianou, and H. Glotin, “Bird detection in audio: a survey and a challenge,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016, pp. 1–6.
- A. Mesaros, T. Heittola, T. Virtanen, and M. D. Plumbley, “Sound event detection: A tutorial,” IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, 2021.
- D. Yang, H. Wang, Y. Zou, Z. Ye, and W. Wang, “A mutual learning framework for few-shot sound event detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 811–815.
- V. Morfi, I. Nolasco, V. Lostanlen, S. Singh, A. Strandburg-Peshkin, L. F. Gill, H. Pamula, D. Benvent, and D. Stowell, “Few-shot bioacoustic event detection: A new task at the dcase 2021 challenge.” in DCASE, 2021, pp. 145–149.
- O. Maron and T. Lozano-Pérez, “A framework for multiple-instance learning,” Advances in neural information processing systems, vol. 10, 1997.
- M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, “Multiple instance learning: A survey of problem characteristics and applications,” Pattern Recognition, vol. 77, pp. 329–353, 2018.
- F. Briggs, X. Z. Fern, and R. Raich, “Rank-loss support instance machines for miml instance annotation,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012, pp. 534–542.
- J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
- Y. Zhang, J. Wang, D. Zhang, and F. Deng, “Few-shot bioacoustic event detection using prototypical network with background classs,” DCASE2021 Challenge, Tech. Rep, Tech. Rep., 2021.
- M. Anderson and N. Harte, “Bioacoustic event detection with prototypical networks and data augmentation,” arXiv preprint arXiv:2112.09006, 2021.
- M.-H. Yang and N. Ahuja, “Gaussian mixture model for human skin color and its applications in image and video databases,” in Storage and retrieval for image and video databases VII, vol. 3656. SPIE, 1998, pp. 458–466.
- A. Khademi, “Hidden markov models for time series: An introduction using r,” Journal of Statistical Software, vol. 80, pp. 1–4, 2017.
- J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
- S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in International conference on learning representations, 2017.
- Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shot sound event detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 81–85.
- Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,” ACM computing surveys (csur), vol. 53, no. 3, pp. 1–34, 2020.
- B. Kulis, “Metric learning: A survey. foundations and trends® in machine learning 5 (4), 287–364 (2013).”
- P. Rodríguez, I. Laradji, A. Drouin, and A. Lacoste, “Embedding propagation: Smoother manifold for few-shot classification,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. Springer, 2020, pp. 121–138.
- H. Liu, X. Liu, X. Mei, Q. Kong, W. Wang, and M. D. Plumbley, “Segment-level metric learning for few-shot bioacoustic event detection,” arXiv preprint arXiv:2207.07773, 2022.
- T. Tang, Y. Liang, and Y. Long, “Two improved architectures based on prototype network for few-shot bioacoustic event detection,” DCASE Challenge, 2021.
- C. Lan, L. Zhang, Y. Zhang, L. Fu, C. Sun, Y. Han, and M. Zhang, “Attention mechanism combined with residual recurrent neural network for sound event detection and localization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2022, no. 1, p. 29, 2022.
- M. A. Hossan, S. Memon, and M. A. Gregory, “A novel approach for mfcc feature extraction,” in 2010 4th International Conference on Signal Processing and Communication Systems. IEEE, 2010, pp. 1–5.
- Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Trainable frontend for robust and far-field keyword spotting,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5670–5674.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
- X. Yao, X. Wang, S.-H. Wang, and Y.-D. Zhang, “A comprehensive survey on convolutional neural network in medical image analysis,” Multimedia Tools and Applications, pp. 1–45, 2020.
- Ç. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulović, “A framework for the robust evaluation of sound event detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 61–65.