Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks (1807.00129v3)

Published 30 Jun 2018 in cs.SD and eess.AS

Abstract: In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

Citations (447)

View on Semantic Scholar

Summary

The paper introduces SELDnet, a CRNN architecture that jointly performs sound event detection and 3D direction-of-arrival estimation for overlapping sound sources.
It demonstrates robust performance with SED error rates as low as 0.04 and F-scores up to 97.7%, outperforming baseline methods.
The study benchmarks against methods like MUSIC and highlights future directions for enhancing spatial accuracy and detecting multiple instances per sound class.

Evaluating Sound Event Localization and Detection with CRNNs

The paper presents a convolutional recurrent neural network (CRNN) method for joint sound event localization and detection (SELD) of overlapping sound events in 3D space. The proposed network, referred to as SELDnet, processes sequences of spectrogram frames and maps them to two parallel outputs: sound event detection (SED) and direction-of-arrival (DOA) estimation, expressed in 3D Cartesian coordinates.

Methodology and Architecture

SELDnet integrates convolutional and recurrent layers to capture spatial and temporal features. The architecture employs convolutional layers to extract spatial features and recurrent layers to capture temporal dependencies, culminating in two fully connected branches for SED and DOA outputs.

The SED branch performs multi-label classification, estimating the presence of multiple sound events per frame. The DOA branch uses multi-output regression to estimate the 3D coordinates of active sound events. These outputs provide continuous predictions, enhancing the network’s ability to generalize to unseen spatial locations.

Evaluation and Results

The network was evaluated on diverse datasets, including Ambisonic and circular array formats, with synthetic and real-life impulse responses. Across various scenarios featuring anechoic and reverberant conditions and different levels of sound event overlap, the authors report robust performance.

Numerical Results:

SELDnet consistently outperformed baseline methods in recall of DOAs, indicating its enhanced ability to manage multiple overlapping sound events.
In terms of SED metrics, SELDnet achieved error rates of 0.04 to 0.19 and F-scores of 85.6% to 97.7% for the ANSYN dataset, superseding baseline SED methods in recall precision.
DOA error metrics indicated superior results in most tested scenarios, although some variance was noted with increasing event overlap.

Baseline Comparisons

The paper benchmarks SELDnet against multiple baselines, including MUSIC (a parametric method) and DOAnet, across different metrics. The comparisons revealed SELDnet’s advantage in handling real-life reverberations and unseen spatial configurations. However, the clarity in spatial precision of DOA was better in some classification-based methods due to the inherent challenges in regression-based DOA estimation.

Implications and Future Directions

The research illuminates the potential of CRNN-based architectures in complex auditory scene analysis, offering a path forward in SELD tasks. The practical implications extend to numerous applications in robotics, surveillance, and augmented reality.

Future advancements may focus on enhancing the spatial accuracy of DOA estimation while maintaining the network’s robust generalization capabilities. Additionally, extending the network to localize multiple instances of the same sound class remains a promising direction.

Overall, SELDnet represents a significant evolution in integrating SED and localization in overlapping scenarios, showcasing the potential of CRNNs in 3D sound analysis.

PDF Markdown