Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection (2312.09034v1)

Published 14 Dec 2023 in eess.AS, cs.SD, and eess.IV

Abstract: Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE J. of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019.
  2. “A multi-room reverberant dataset for sound event localization and detection,” in DCASE Workshop, 2019.
  3. “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” ArXiv, vol. abs/2306.09126, 2023.
  4. “Audio-visual speaker tracking: Progress, challenges, and future directions,” ArXiv, vol. abs/2310.14778, 2023.
  5. “L3DAS22 challenge: Learning 3D audio sources in a real office environment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2022.
  6. “Polyphonic sound event detection and localization using a two-stage strategy,” in DCASE Workshop, 2019.
  7. “SALSA: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
  8. “The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge technical report,” in Tech. Rep. of DCASE Challenge, 2020.
  9. “An improved event-independent network for polyphonic sound event localization and detection,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2021.
  10. “PLDISET: Probabilistic localization and detection of independent sound events with transformers,” DCASE Workshop, 2023.
  11. “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020.
  12. “A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 1251–1264, 2023.
  13. “Event-independent network for polyphonic sound event localization and detection,” in DCASE Workshop, 2020.
  14. “SALSA-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2022.
  15. “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2022.
  16. “Quo vadis, action recognition? A new model and the kinetics dataset,” IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
  17. “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
  18. “Audio-visual cross-attention network for robotic speaker tracking,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 550–562, 2023.
  19. “Audio-visual sound event localization and detection based on CRNN using depth-wise separable convolution,” in Tech. Report of DCASE Challenge, 2023.
  20. “The NERC-SLIP system for sound event localization and detection of DCASE2023 challenge,” in Tech. Report of DCASE Challenge, 2023.
  21. “The distillation system for sound event localization and detection of DCASE2023 challenge,” in Tech. Report of DCASE Challenge, 2023.
  22. “Data augmentation, neural networks, and ensemble methods for sound event localization and detection,” in Tech. Report of DCASE Challenge, 2023.
  23. “Attention is all you need,” in Int. Conf. on Neural Information Processing Systems, 2017.
  24. “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” DCASE Workshop, 2022.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com