Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection (2312.09034v1)
Abstract: Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.
- “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE J. of Selected Topics in Signal Processing, vol. 13, pp. 34–48, 2019.
- “A multi-room reverberant dataset for sound event localization and detection,” in DCASE Workshop, 2019.
- “STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” ArXiv, vol. abs/2306.09126, 2023.
- “Audio-visual speaker tracking: Progress, challenges, and future directions,” ArXiv, vol. abs/2310.14778, 2023.
- “L3DAS22 challenge: Learning 3D audio sources in a real office environment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2022.
- “Polyphonic sound event detection and localization using a two-stage strategy,” in DCASE Workshop, 2019.
- “SALSA: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1749–1762, 2022.
- “The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge technical report,” in Tech. Rep. of DCASE Challenge, 2020.
- “An improved event-independent network for polyphonic sound event localization and detection,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2021.
- “PLDISET: Probabilistic localization and detection of independent sound events with transformers,” DCASE Workshop, 2023.
- “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020.
- “A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 1251–1264, 2023.
- “Event-independent network for polyphonic sound event localization and detection,” in DCASE Workshop, 2020.
- “SALSA-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2022.
- “Multi-ACCDOA: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training,” in IEEE Inter. Conf. on Acoustics, Speech and Signal Processing, 2022.
- “Quo vadis, action recognition? A new model and the kinetics dataset,” IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
- “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
- “Audio-visual cross-attention network for robotic speaker tracking,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 31, pp. 550–562, 2023.
- “Audio-visual sound event localization and detection based on CRNN using depth-wise separable convolution,” in Tech. Report of DCASE Challenge, 2023.
- “The NERC-SLIP system for sound event localization and detection of DCASE2023 challenge,” in Tech. Report of DCASE Challenge, 2023.
- “The distillation system for sound event localization and detection of DCASE2023 challenge,” in Tech. Report of DCASE Challenge, 2023.
- “Data augmentation, neural networks, and ensemble methods for sound event localization and detection,” in Tech. Report of DCASE Challenge, 2023.
- “Attention is all you need,” in Int. Conf. on Neural Information Processing Systems, 2017.
- “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” DCASE Workshop, 2022.