Papers
Topics
Authors
Recent
Search
2000 character limit reached

SwG-former: A Sliding-Window Graph Convolutional Network for Simultaneous Spatial-Temporal Information Extraction in Sound Event Localization and Detection

Published 21 Oct 2023 in eess.AS | (2310.14016v3)

Abstract: Sound event localization and detection (SELD) involves sound event detection (SED) and direction of arrival (DoA) estimation tasks. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. This paper addresses the need to simultaneously extract spatial-temporal information in audio signals to improve SELD performance. A novel block, the sliding-window graph-former (SwG-former), is designed to learn temporal context information of sound events based on their spatial correlations. The SwG-former block transforms audio signals into a graph representation and constructs graph vertices to capture higher abstraction levels for spatial correlations. It uses different-sized sliding windows to adapt various sound event durations and aggregates temporal features with similar spatial information while incorporating multi-head self-attention (MHSA) to model global information. Furthermore, as the cornerstone of message passing, a robust Conv2dAgg function is proposed and embedded into the block to aggregate the features of neighbor vertices. As a result, a SwG-former model, which stacks the SwG-former blocks, demonstrates superior performance compared to recent advanced SELD models. The SwG-former block is also integrated into the event-independent network version 2 (EINV2), called SwG-EINV2, which surpasses the state-of-the-art (SOTA) methods under the same acoustic environment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE transactions on intelligent transportation systems, 17(1):279–288, 2015.
  2. Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 74–79. IEEE, 2018.
  3. Optimizing passive acoustic systems for marine mammal detection and localization: Application to real-time monitoring north atlantic right whales in gulf of st. lawrence. Applied Acoustics, 178:107949, 2021.
  4. Convolutional neural networks for distant speech recognition. IEEE Signal Processing Letters, 21(9):1120–1124, 2014.
  5. Acoustic event detection in real life recordings. In 2010 18th European signal processing conference, pages 1267–1271. IEEE, 2010.
  6. Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2259–2263. IEEE, 2016.
  7. Performance study of the mvdr beamformer as a function of the source incidence angle. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):67–79, 2013.
  8. Acoustic event localization using a crosspower-spectrum phase based technique. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages II–273. IEEE, 1994.
  9. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, 2018.
  10. An improved event-independent network for polyphonic sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 885–889. IEEE, 2021.
  11. A track-wise ensemble event independent network for polyphonic sound event localization and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9196–9200. IEEE, 2022.
  12. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040, 2020.
  13. Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 915–919. IEEE, 2021.
  14. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  15. L3das22 challenge: Learning 3d audio sources in a real office environment. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9186–9190. IEEE, 2022.
  16. Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 316–320. IEEE, 2022.
  17. Mlp-mixer enhanced crnn for sound event localization and detection in dcase 2022 task 3. Technical report, DCASE2022 Challenge, Tech. Rep, 2022.
  18. Improving low-resource sound event localization and detection via active learning with domain adaptation. Technical report, DCASE2022 Challenge, Tech. Rep, 2022.
  19. A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1251–1264, 2023.
  20. The nercslip-ustc system for the l3das23 challenge task2: 3d sound event localization and detection (seld). In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023.
  21. Divided spectro-temporal attention for sound event localization and detection in real scenes for dcase2023 challenge. arXiv preprint arXiv:2306.02591, 2023.
  22. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
  23. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817–826, 2009.
  24. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE transactions on intelligent transportation systems, 21(9):3848–3858, 2019.
  25. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 914–921, 2020.
  26. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
  27. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14:347–375, 2008.
  28. Geometric matrix completion with recurrent multi-graph neural networks. Advances in neural information processing systems, 30, 2017.
  29. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
  30. Compact graph architecture for speech emotion recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6284–6288. IEEE, 2021.
  31. Multi-channel speech enhancement using graph neural networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3415–3419. IEEE, 2021.
  32. Time-domain speech separation networks with graph encoding auxiliary. IEEE Signal Processing Letters, 30:110–114, 2023.
  33. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  34. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
  35. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  36. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  37. Graph attention networks. In International Conference on Learning Representations, 2018.
  38. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  39. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
  40. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision, pages 9267–9276, 2019.
  41. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  42. Vision gnn: An image is worth graph of nodes. Advances in Neural Information Processing Systems, 35:8291–8303, 2022.
  43. Searching for activation functions. In International Conference on Learning Representations, 2018.
  44. Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. arXiv preprint arXiv:2206.01948, 2022.
  45. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
  46. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:684–698, 2020.
  47. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.