Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proposal-based Few-shot Sound Event Detection for Speech and Environmental Sounds with Perceivers (2107.13616v2)

Published 28 Jul 2021 in eess.AS, cs.NE, and cs.SD

Abstract: Many applications involve detecting and localizing specific sound events within long, untrimmed documents, including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train such models. In this paper, we propose a region proposal-based approach to few-shot sound event detection utilizing the Perceiver architecture. Motivated by a lack of suitable benchmark datasets, we generate two new few-shot sound event localization datasets: "Vox-CASE," using clips of celebrity speech as the sound event, and "ESC-CASE," using environmental sound events. Our highest performing proposed few-shot approaches achieve 0.483 and 0.418 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent relative improvements of 72.5% and 11.2% over strong proposal-free few-shot sound event detection baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. J.-M. Liu, M. You, Z. Wang, G.-Z. Li, X. Xu, and Z. Qui, “Cough detection using deep neural networks,” IEEE BIBM, 2014.
  2. A. Morehead, L. Ogden, G. Magee, R. Hosler, B. White, and G. Mohler, “Low cost gunshot detection using deep learning on the raspberry pi,” IEEE International Conference on Big Data, 2019.
  3. M. Zeppelzauer, S. Hensman, and A. S. Stoeger, “Towards an automated acoustic detection system for free-ranging elephants,” Bioacoustics, vol. 24, no. 1, pp. 13–29, 2015.
  4. E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.
  5. A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasets and baseline system,” 2017.
  6. H. Lim, J. Park, K. Lee, and Y. Han, “Rare sound event detection using 1d convolutional recurrent neural networks,” DCASE 2017 Workshop, 2017.
  7. R. Lu and Z. Duan, “Bidirectional GRU for sound event detection,” Detection and Classification of Acoustic Scenes and Events, 2017.
  8. P. Pham, J. Li, J. Szurley, and S. Das, “Eventness: Object detection on spectrograms for temporal localization of audio events,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 2491–2495.
  9. K. Wang, L. Yang, and B. Yang, “Audio event detection and classification using extended r-fcn approach,” DCASE 2017 Workshop, 2017.
  10. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Neurips 2015, 2015.
  11. C.-C. Kao, W. Wang, M. Sun, and C. Wang, “R-crnn: Region-based convolutional recurrent neural network for audio event detection,” Interspeech 2018, 2018.
  12. J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Region proposal network based small-footprint keyword spotting,” 2019.
  13. O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” Neurips 2016, 2016.
  14. J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” Neurips 2017, 2017.
  15. G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2.   Lille, 2015.
  16. F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  17. P. Wolters, C. Careaga, B. Hutchinson, and L. Phillips, “A study of few-shot audio classification,” Grace Hopper Celebration, 2020.
  18. S.-Y. Chou, K.-H. Cheng, J.-S. R. Jang, and Y.-H. Yang, “Learning to match transient sound events using attentional similarity for few-shot sound recognition,” ICASSP 2019, 2019.
  19. S. Zhang, Y. Qin, K. Sun, and Y. Lin, “Few-shot audio classification with attentional graph neural networks,” Interspeech 2019, 2019.
  20. B. Shi, M. Sun, K. C. Puvvada, C.-C. Kao, S. Matsoukas, and C. Wang, “Few-shot acoustic event detection via meta learning,” ICASSP, 2020.
  21. J. Wang, K.-C. Wang, M. Law, F. Rudzicz, and M. Brudno, “Centroid-based deep metric learning for speaker recognition,” ICASSP 2019, 2019.
  22. Y. Chen, T. Ko, L. Shang, X. Chen, X. Jiang, and Q. Li, “An investigation of few-shot learning in spoken term classification,” Interspeech 2020, 2020.
  23. C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” 2017.
  24. Y. Wang, J. Salamon, N. J. Bryan, and J. P. Bello, “Few-shot sound event detection,” ICASSP 2020, 2020.
  25. A. Parnami and M. Lee, “Few-shot keyword spotting with prototypical networks,” 2020.
  26. S. Kazuki, K. Yuichiro, and I. Akira, “Metric learning with background noise class for few-shot detection of rare sound events,” ICASSP, 2020.
  27. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Neurips, 2017.
  28. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, 2020.
  29. A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira, “Perceiver: General perception with iterative attention,” 2021.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  31. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  32. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
  33. R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 244–253.
  34. H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8808–8817.
  35. K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self-attention,” ICASSP, 2020.
  36. K. Miyazaki1, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Conformer-based sound event detection with semi-supervised learning and data augmentation,” DCASE 2020 Workshop, 2020.
  37. J. Yan, Y. Song, W. Guo, L.-R. Dai, I. McLoughlin, and L. Chen, “A region based attention method for weakly supervised sound event detection and classification,” ICASSP, 2019.
  38. H. Yang, X. He, and F. Porikli, “One-Shot Action Localization by Learning Sequence Matching Network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.   Salt Lake City, UT: IEEE, Jun. 2018, pp. 1450–1459.
  39. D. Zhang, X. Dai, and Y.-F. Wang, “Metal: Minimum effort temporal activity localization in untrimmed videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3882–3892.
  40. H. Xu, B. Kang, X. Sun, J. Feng, K. Saenko, and T. Darrell, “Similarity R-C3D for Few-shot Temporal Activity Detection,” arXiv:1812.10000 [cs], Dec. 2018, arXiv: 1812.10000. [Online]. Available: http://arxiv.org/abs/1812.10000
  41. Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1130–1139.
  42. L. Zhang, X. Chang, J. Liu, M. Luo, S. Wang, Z. Ge, and A. Hauptmann, “Zstad: Zero-shot temporal activity detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 879–888.
  43. H. Xu, X. Sun, E. Tzeng, A. Das, K. Saenko, and T. Darrell, “Revisiting few-shot activity detection with class similarity control,” arXiv:2004.00137, 2020.
  44. K. J. Piczak, “Esc: Dataset for environmental sound classification,” Harvard Dataverse, 2015.
  45. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Interspeech 2018, 2018.
  46. A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 24th European Signal Processing Conference 2016 (EUSIPCO 2016), Budapest, Hungary, 2016.
  47. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  48. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” ICCV 2017, 2017.
  49. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Piper Wolters (8 papers)
  2. Logan Sizemore (4 papers)
  3. Chris Daw (4 papers)
  4. Brian Hutchinson (22 papers)
  5. Lauren Phillips (4 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.