Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers (2107.13616v2)
Abstract: Many applications involve detecting and localizing specific sound events within long, untrimmed documents, including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train such models. In this paper, we propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture. Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets: "Vox-CASE," using clips of celebrity speech as the sound event, and "ESC-CASE," using environmental sound events. Our highest performing proposed few-shot approaches achieve 0.483 and 0.418 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent relative improvements of 72.5% and 11.2% over strong proposal-free few-shot sound event detection baselines.
- J.-M. Liu, M. You, Z. Wang, G.-Z. Li, X. Xu, and Z. Qui, “Cough detection using deep neural networks,” IEEE BIBM, 2014.
- A. Morehead, L. Ogden, G. Magee, R. Hosler, B. White, and G. Mohler, “Low cost gunshot detection using deep learning on the raspberry pi,” IEEE International Conference on Big Data, 2019.
- M. Zeppelzauer, S. Hensman, and A. S. Stoeger, “Towards an automated acoustic detection system for free-ranging elephants,” Bioacoustics, vol. 24, no. 1, pp. 13–29, 2015.
- E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
- A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” 2017.
- H. Lim, J. Park, K. Lee, and Y. Han, “Rare sound event detection using 1d convolutional recurrent neural networks,” DCASE 2017 Workshop, 2017.
- R. Lu and Z. Duan, “Bidirectional GRU for sound event detection,” Detection and Classification of Acoustic Scenes and Events, 2017.
- P. Pham, J. Li, J. Szurley, and S. Das, “Eventness: Object detection on spectrograms for temporal localization of audio events,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2491–2495.
- K. Wang, L. Yang, and B. Yang, “Audio event detection and classification using extended r-fcn approach,” DCASE 2017 Workshop, 2017.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Neurips 2015, 2015.
- C.-C. Kao, W. Wang, M. Sun, and C. Wang, “R-crnn: Region-based convolutional recurrent neural network for audio event detection,” Interspeech 2018, 2018.
- J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Region proposal network based small-footprint keyword spotting,” 2019.
- O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” Neurips 2016, 2016.
- J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” Neurips 2017, 2017.
- G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2. Lille, 2015.
- F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- P. Wolters, C. Careaga, B. Hutchinson, and L. Phillips, “A study of few-shot audio classification,” Grace Hopper Celebration, 2020.
- S.-Y. Chou, K.-H. Cheng, J.-S. R. Jang, and Y.-H. Yang, “Learning to match transient sound events using attentional similarity for few-shot sound recognition,” ICASSP 2019, 2019.
- S. Zhang, Y. Qin, K. Sun, and Y. Lin, “Few-shot audio classification with attentional graph neural networks,” Interspeech 2019, 2019.
- B. Shi, M. Sun, K. C. Puvvada, C.-C. Kao, S. Matsoukas, and C. Wang, “Few-shot acoustic event detection via meta learning,” ICASSP, 2020.
- J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudno, “Centroid-based deep metric learning for speaker recognition,” ICASSP 2019, 2019.
- Y. Chen, T. Ko, L. Shang, X. Chen, X. Jiang, and Q. Li, “An investigation of few-shot learning in spoken term classification,” Interspeech 2020, 2020.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” 2017.
- Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shot sound event detection,” ICASSP 2020, 2020.
- A. Parnami and M. Lee, “Few-shot keyword spotting with prototypical networks,” 2020.
- S. Kazuki, K. Yuichiro, and I. Akira, “Metric learning with background noise class for few-shot detection of rare sound events,” ICASSP, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Neurips, 2017.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, 2020.
- A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver: General perception with iterative attention,” 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
- R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 244–253.
- H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8808–8817.
- K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self-attention,” ICASSP, 2020.
- K. Miyazaki1, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Conformer-based sound event detection with semi-supervised learning and data augmentation,” DCASE 2020 Workshop, 2020.
- J. Yan, Y. Song, W. Guo, L.-R. Dai, I. McLoughlin, and L. Chen, “A region based attention method for weakly supervised sound event detection and classification,” ICASSP, 2019.
- H. Yang, X. He, and F. Porikli, “One-Shot Action Localization by Learning Sequence Matching Network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT: IEEE, Jun. 2018, pp. 1450–1459.
- D. Zhang, X. Dai, and Y.-F. Wang, “Metal: Minimum effort temporal activity localization in untrimmed videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3882–3892.
- H. Xu, B. Kang, X. Sun, J. Feng, K. Saenko, and T. Darrell, “Similarity R-C3D for Few-shot Temporal Activity Detection,” arXiv:1812.10000 [cs], Dec. 2018, arXiv: 1812.10000. [Online]. Available: http://arxiv.org/abs/1812.10000
- Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
- L. Zhang, X. Chang, J. Liu, M. Luo, S. Wang, Z. Ge, and A. Hauptmann, “Zstad: Zero-shot temporal activity detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 879–888.
- H. Xu, X. Sun, E. Tzeng, A. Das, K. Saenko, and T. Darrell, “Revisiting few-shot activity detection with class similarity control,” arXiv:2004.00137, 2020.
- K. J. Piczak, “Esc: Dataset for environmental sound classification,” Harvard Dataverse, 2015.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Interspeech 2018, 2018.
- A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 24th European Signal Processing Conference 2016 (EUSIPCO 2016), Budapest, Hungary, 2016.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” ICCV 2017, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
- Piper Wolters (8 papers)
- Logan Sizemore (4 papers)
- Chris Daw (4 papers)
- Brian Hutchinson (22 papers)
- Lauren Phillips (4 papers)