w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training (2312.06907v2)
Abstract: Sound Event Detection and Localization (SELD) constitutes a complex task that depends on extensive multichannel audio recordings with annotated sound events and their respective locations. In this paper, we introduce a self-supervised approach for SELD adapted from the pre-training methodology of wav2vec 2.0, which learns representations directly from raw audio data, eliminating the need for supervision. By applying this approach to SELD, we can leverage a substantial amount of unlabeled 3D audio data to learn robust representations of sound events and their locations. Our method comprises two primary stages: pre-training and fine-tuning. In the pre-training phase, unlabeled 3D audio datasets are utilized to train our w2v-SELD model, capturing intricate high-level features and contextual information inherent in audio signals. Subsequently, in the fine-tuning stage, a smaller dataset with labeled SELD data fine-tunes the pre-trained model. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed self-supervised approach for SELD. The model surpasses baseline systems provided with the datasets and achieves competitive performance comparable to state-of-the-art supervised methods. The code and pre-trained parameters of our w2v-SELD model are available in this repository.
- Seld-tcn: Sound event localization & detection via temporal convolutional networks. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 16–20. IEEE, 2021.
- Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, 2018.
- L3das21 challenge: Machine learning for 3d audio signal processing. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2021.
- Sound source detection, localization and classification using consecutive ensemble of crnn models. arXiv preprint arXiv:1908.00766, 2019.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1–11, 2023. doi:10.1109/TASLP.2023.3284525.
- The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge. IEEE AASP Chall. Detect. Classif. Acoust. Scenes Events, July 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
- Ensemble of accdoa-and einv2-based systems with d3nets and impulse response simulation for sound event localization and detection. arXiv preprint arXiv:2106.10806, November 2021a.
- Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 915–919. IEEE, 2021b.
- Exploiting attention-based sequence-to-sequence architectures for sound event localization. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 231–235. IEEE, 2021a.
- PILOT: Introducing Transformers for Probabilistic Sound Event Localization. In Proc. Interspeech 2021, pages 2117–2121, 2021b. doi:10.21437/Interspeech.2021-124.
- Sound event localization and detection with pre-trained audio spectrogram transformer and multichannel separation network. omni (1ch), 13(4ch):5x13x497, 2022.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. arXiv preprint arXiv:2206.01948, 2022.
- Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147–6151. IEEE, 2022.
- Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi:10.1109/ICASSP49357.2023.10095036.
- The challenge of realistic music generation: modelling raw audio at scale. Advances in Neural Information Processing Systems, 31, 2018.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
- Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer Nature, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
- Asteroid: the pytorch-based audio source separation toolkit for researchers. arXiv preprint arXiv:2005.04132, 2020.
- DL Huang and Ricardo Falcon Perez. Sseldnet: a fully end-to-end sample-level framework for sound event localization and detection. DCASE, 2021.
- L3das22 challenge: Learning 3d audio sources in a real office environment. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9186–9190. IEEE, 2022.
- Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
- A dataset and taxonomy for urban sound research. In 22nd ACM International Conference on Multimedia (ACM-MM’14), pages 1041–1044, Orlando, FL, USA, Nov. 2014.
- A multi-room reverberant dataset for sound event localization and detection. arXiv preprint arXiv:1905.08546, June 2019.
- A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv preprint arXiv:2006.01919, June 2020.
- Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016.