Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training (2312.06907v2)

Published 12 Dec 2023 in eess.AS and cs.SD

Abstract: Sound Event Detection and Localization (SELD) constitutes a complex task that depends on extensive multichannel audio recordings with annotated sound events and their respective locations. In this paper, we introduce a self-supervised approach for SELD adapted from the pre-training methodology of wav2vec 2.0, which learns representations directly from raw audio data, eliminating the need for supervision. By applying this approach to SELD, we can leverage a substantial amount of unlabeled 3D audio data to learn robust representations of sound events and their locations. Our method comprises two primary stages: pre-training and fine-tuning. In the pre-training phase, unlabeled 3D audio datasets are utilized to train our w2v-SELD model, capturing intricate high-level features and contextual information inherent in audio signals. Subsequently, in the fine-tuning stage, a smaller dataset with labeled SELD data fine-tunes the pre-trained model. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed self-supervised approach for SELD. The model surpasses baseline systems provided with the datasets and achieves competitive performance comparable to state-of-the-art supervised methods. The code and pre-trained parameters of our w2v-SELD model are available in this repository.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Seld-tcn: Sound event localization & detection via temporal convolutional networks. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 16–20. IEEE, 2021.
  2. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1):34–48, 2018.
  3. L3das21 challenge: Machine learning for 3d audio signal processing. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2021.
  4. Sound source detection, localization and classification using consecutive ensemble of crnn models. arXiv preprint arXiv:1908.00766, 2019.
  5. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  7. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  8. Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1–11, 2023. doi:10.1109/TASLP.2023.3284525.
  9. The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge. IEEE AASP Chall. Detect. Classif. Acoust. Scenes Events, July 2020.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  11. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  12. Ensemble of accdoa-and einv2-based systems with d3nets and impulse response simulation for sound event localization and detection. arXiv preprint arXiv:2106.10806, November 2021a.
  13. Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 915–919. IEEE, 2021b.
  14. Exploiting attention-based sequence-to-sequence architectures for sound event localization. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 231–235. IEEE, 2021a.
  15. PILOT: Introducing Transformers for Probabilistic Sound Event Localization. In Proc. Interspeech 2021, pages 2117–2121, 2021b. doi:10.21437/Interspeech.2021-124.
  16. Sound event localization and detection with pre-trained audio spectrogram transformer and multichannel separation network. omni (1ch), 13(4ch):5x13x497, 2022.
  17. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  18. Starss22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. arXiv preprint arXiv:2206.01948, 2022.
  19. Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147–6151. IEEE, 2022.
  20. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi:10.1109/ICASSP49357.2023.10095036.
  21. The challenge of realistic music generation: modelling raw audio at scale. Advances in Neural Information Processing Systems, 31, 2018.
  22. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
  23. Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer Nature, 2019.
  24. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  25. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  26. Asteroid: the pytorch-based audio source separation toolkit for researchers. arXiv preprint arXiv:2005.04132, 2020.
  27. DL Huang and Ricardo Falcon Perez. Sseldnet: a fully end-to-end sample-level framework for sound event localization and detection. DCASE, 2021.
  28. L3das22 challenge: Learning 3d audio sources in a real office environment. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9186–9190. IEEE, 2022.
  29. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021.
  30. A dataset and taxonomy for urban sound research. In 22nd ACM International Conference on Multimedia (ACM-MM’14), pages 1041–1044, Orlando, FL, USA, Nov. 2014.
  31. A multi-room reverberant dataset for sound event localization and detection. arXiv preprint arXiv:1905.08546, June 2019.
  32. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. arXiv preprint arXiv:2006.01919, June 2020.
  33. Metrics for polyphonic sound event detection. Applied Sciences, 6(6):162, 2016.

Summary

We haven't generated a summary for this paper yet.