Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network (1710.10059v2)

Published 27 Oct 2017 in cs.SD, cs.LG, and eess.AS

Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

Citations (230)

View on Semantic Scholar

Summary

The paper introduces DOAnet, a CRNN that leverages CNN for spatial features and RNN for temporal dynamics to enhance DOA estimation accuracy.
It outperforms conventional methods like MUSIC and ESPRIT by reducing estimation errors, especially in reverberant and low-SNR conditions.
The approach effectively estimates overlapping sound sources, paving the way for scalable, data-driven acoustic analysis.

Direction of Arrival Estimation Using a Convolutional Recurrent Neural Network

The paper at hand introduces an innovative approach for Direction of Arrival (DOA) estimation, which is crucial in various applications such as speech enhancement, spatial audio coding, and multichannel sound source separation. The authors present a deep learning architecture named DOAnet, a convolutional recurrent neural network (CRNN) designed to estimate DOA for multiple sound sources without requiring explicit feature engineering or prior knowledge of the number of active sources.

Methodological Overview

The authors highlight notable differences between traditional methods of DOA estimation and the proposed DOAnet. Conventional methods like MUSIC and ESPRIT, although effective in certain scenarios, demand knowledge of the number of active sources and suffer under low SNR conditions or in reverberant environments. In contrast, DOAnet leverages the powerful representation capability of deep neural networks to directly learn from the raw spectrogram data of audio channels.

DOAnet employs both convolutional neural networks (CNN) and recurrent neural networks (RNN) to harness both spatial and temporal audio features. The CNN layers capture local shift-invariant features, whereas the RNN layers model the temporal dependencies, enabling the architecture to process the dynamic nature of sound. This architecture allows for robust estimation of DOA, producing a Spatial Pseudo-Spectrum (SPS) similar to MUSIC, but with improved accuracy and reduced computational constraints. Moreover, the method uses both magnitude and phase information from the spectrogram of audio channels, offering an enriched representation over previous methods that rely solely on phase data.

Results and Evaluation

The performance of DOAnet is evaluated on synthetic datasets under various conditions, including anechoic and reverberant environments with different levels of source overlap. The results indicate that DOAnet achieves a high signal-to-noise ratio for the estimated SPS, especially in cases with up to two overlapping sources, which underscores its robust performance. In conditions with known and unknown numbers of active sources, DOAnet consistently outperforms traditional methods such as MUSIC, demonstrating lower DOA estimation errors across both matched and unmatched reverberant settings.

The results also show a substantial reduction in estimation errors for DOAnet compared to MUSIC, particularly emphasizing its robustness in complex acoustic scenarios involving multiple concurrent sources and reverberation. It is noted that, while traditional methods struggle with the theoretical problem of estimating more sources than channels, DOAnet's architecture allows it to handle two concurrent sources effectively. However, in scenarios with three sources, performance deteriorates due to the limitations in training data derived from MUSIC, indicating potential areas for improvement with better training datasets.

Implications and Future Directions

The introduction of DOAnet represents a shift towards data-driven approaches for DOA estimation, moving away from model-based strategies. This paradigm benefits from the richness of data representations captured by neural networks, providing a more versatile and scalable solution for audio and acoustics applications. Practically, this can transform how acoustic environments are analyzed and sound source localization is performed, enabling more efficient soundfield visualizations and room acoustics analyses.

Looking forward, the insights from DOAnet could lead to developments in end-to-end audio analysis systems where sound localization is integrated with other acoustic tasks, such as source separation and ambient sound analysis. Further work is warranted to explore real-world applications and expand the capabilities of DOAnet, particularly in more diverse audio environments and with advanced network architectures that can further capitalize on large-scale data.

In summary, this paper effectively illustrates how convolutional and recurrent architectures can be exploited for sophisticated audio processing tasks, providing a compelling case for the expanded role of neural networks in spatial acoustics processing. The results and methodology herald significant potential for ongoing research and application in acoustic signal processing, with DOAnet paving the way for future innovations in multi-source audio environments.

PDF Markdown