Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CST-former: Transformer with Channel-Spectro-Temporal Attention for Sound Event Localization and Detection (2312.12821v1)

Published 20 Dec 2023 in eess.AS and cs.SD

Abstract: Sound event localization and detection (SELD) is a task for the classification of sound events and the localization of direction of arrival (DoA) utilizing multichannel acoustic signals. Prior studies employ spectral and channel information as the embedding for temporal attention. However, this usage limits the deep neural network from extracting meaningful features from the spectral or spatial domains. Therefore, our investigation in this paper presents a novel framework termed the Channel-Spectro-Temporal Transformer (CST-former) that bolsters SELD performance through the independent application of attention mechanisms to distinct domains. The CST-former architecture employs distinct attention mechanisms to independently process channel, spectral, and temporal information. In addition, we propose an unfolded local embedding (ULE) technique for channel attention (CA) to generate informative embedding vectors including local spectral and temporal information. Empirical validation through experimentation on the 2022 and 2023 DCASE Challenge task3 datasets affirms the efficacy of employing attention mechanisms separated across each domain and the benefit of ULE, in enhancing SELD performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE JSTSP, vol. 13, no. 1, pp. 34–48, 2019.
  2. “Overview and evaluation of sound event localization and detection in dcase 2019,” IEEE/ACM TASLP, vol. 29, pp. 684–698, 2021.
  3. “The ustc-iflytek system for sound event localization and detection of dcase2020 challenge,” Tech. Rep., DCASE2020 Challenge, July 2020.
  4. “Sound event localization and detection with various loss functions,” Tech. Rep., DCASE2020 Challenge, July 2020.
  5. “Dcase 2020 task 3: Ensemble of sequence matching networks for dynamic sound event localization, detection, and tracking,” Tech. Rep., DCASE2020 Challenge, July 2020.
  6. “Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains,” Tech. Rep., DCASE2022 Challenge, June 2022.
  7. “Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection,” in Proc. of IEEE ICASSP, 2021, pp. 915–919.
  8. “Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in Proc. of IEEE ICASSP, 2022, pp. 316–320.
  9. “Ad-yolo: You look only once in training multiple sound event localization and detection,” in Proc. of IEEE ICASSP, 2023, pp. 1–5.
  10. “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in Proc. of DCASE Workshop, Nancy, France, November 2022, pp. 125–129.
  11. “STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023,” Mar. 2023, doi:10.5281/zenodo.7880637.
  12. A. Politis, “[DCASE2022 Task 3] Synthetic SELD mixtures for baseline training,” Apr. 2022, doi:10.5281/zenodo.6406873.
  13. “Fsd50k: An open dataset of human-labeled sound events,” IEEE/ACM TASLP, vol. 30, pp. 829–852, 2022.
  14. “TAU Spatial Room Impulse Response Database (TAU- SRIR DB),” Apr. 2022, doi:10.5281/zenodo.6408611.
  15. “A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection,” IEEE/ACM TASLP, vol. 31, pp. 1251–1264, 2023.
  16. “An experimental study on sound event localization and detection under realistic testing conditions,” in Proc. of IEEE ICASSP, 2023, pp. 1–5.
  17. “Tparn: Triple-path attentive recurrent network for time-domain multichannel speech enhancement,” in Proc. of IEEE ICASSP, 2022, pp. 6497–6501.
  18. “Dasformer: Deep alternating spectrogram transformer for multi/single-channel speech separation,” in Proc. of IEEE ICASSP, 2023, pp. 1–5.
  19. D. Lee and J.-W. Choi, “Deft-an: Dense frequency-time attentive network for multichannel speech enhancement,” IEEE Signal Processing Letters, vol. 30, pp. 155–159, 2023.
  20. “Is space-time attention all you need for video understanding?,” in Proc. of the 38th ICML, Marina Meila and Tong Zhang, Eds. 18–24 Jul 2021, vol. 139 of Proc. of Machine Learning Research, pp. 813–824, PMLR.
  21. “Divided spectro-temporal attention for sound event localization and detection in real scenes for dcase2023 challenge,” Tech. Rep., DCASE2023 Challenge, June 2023.
  22. “Cmt: Convolutional neural networks meet vision transformers,” in Proc. of the IEEE/CVF CVPR, June 2022, pp. 12175–12185.
  23. “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. of ICLR, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.