Multi-Speaker DOA Estimation with Deep CNNs

Updated 17 March 2026

The paper demonstrates deep CNNs robustly localize multiple speakers by leveraging phase map features and framing DOA estimation as a multi-label classification task.
It details advanced architectures such as CRNNs, U-Net variants, and modal coherence CNNs that extract spatial patterns from STFT representations for accurate localization.
Empirical results show significant performance gains over classical methods, achieving sub-5° accuracy and real-time estimation even in noisy, reverberant conditions.

Multi-speaker direction-of-arrival (DOA) estimation using deep convolutional networks is an advanced signal processing paradigm that leverages data-driven learning to robustly localize multiple concurrent acoustic sources. Typical target domains are speech-dominated environments where traditional subspace techniques (e.g., MUSIC, SRP-PHAT) are challenged by overlapping sources, adverse noise, or reverberation. Recent supervised learning approaches based on deep convolutional architectures have shown marked improvements in both accuracy and robustness by learning spatial mixture patterns end-to-end, adapting to diverse acoustic conditions, and enabling high-resolution localization in real-world and simulated environments.

1. Problem Formulation and Input Representations

The multi-speaker DOA estimation problem is posed as a multi-class, multi-label classification task over a discrete set of DOA bins (e.g., 0°–180° in 5° increments, $C$ total classes). The primary acoustic observation comprises $M$ -channel microphone array recordings, transformed via STFT to capture rich time–frequency (TF) spatial structure.

For each time frame $n$ and frequency bin $k$ , the complex STFT at microphone $m$ is denoted:

$X_m(n,k) = |X_m(n,k)|e^{j\angle X_m(n,k)}$

The most common feature representation is the "phase map":

$\Phi_{n} = \bigl[\angle X_{1}(n,1)\;\angle X_{1}(n,2)\;\dots\;\angle X_{M}(n,K)\bigr] \in \mathbb{R}^{M \times K}$

This representation discards magnitude, focusing solely on inter-microphone phase differences, which are direct encoders of source directionality and are robust to spectral coloration, reverberation, and amplitude distortions (Chakrabarty et al., 2018).

Alternative input modalities include generalized cross-correlation with phase transform (GCC-PHAT), inter-channel phase differences (IPD) and inter-level ratios (ILR), instant RTF, or synthesized variant-specific representations such as spherical harmonic modal coherences (Kowalk et al., 2022, Hammer et al., 2020, Fahim et al., 2020, Jazaeri et al., 23 Sep 2025).

2. Deep Convolutional Network Architectures

The core approach is to design a deep convolutional neural network (CNN) that ingests the high-dimensional spatial phase (or related) maps and outputs, per time frame, posterior probabilities for each discretized DOA class. Several variants have been proposed:

Basic feedforward convolutional block: $M-1$ layers of $2 \times 1$ filters (no pooling), with ReLU activations, followed by two dense (fully connected) layers with dropout, and $C$ sigmoid outputs—one per DOA class (Chakrabarty et al., 2018). The choice of $M-1$ convolutional layers is dictated by the necessity to span all microphone pairs in the receptive field.
CRNN and U-Net variants: Stacking 2D convolutions with recurrent layers (e.g., bi-directional GRUs or Elman RNNs) or fully convolutional encoder–decoders (U-Net topology) for TF-wise inference. These architectures enable both local TF-pattern extraction and temporal continuity modeling (Adavanne et al., 2017, Hammer et al., 2020, Jazaeri et al., 23 Sep 2025).
Modal coherence CNN: Deep (e.g., 8-layer) convolutional network operating on spherical harmonic modal coherence feature tensors per TF bin, supporting 3D localization (azimuth and elevation heads), offering robust classification under adverse SNR and heavy reverberation (Fahim et al., 2020).
Partitioned or cascaded architectures: The search region is split, and an ensemble of parallel CNN regressors each "learns" the pseudo-spectrum in a subregion (e.g., the DeepMUSIC framework, which estimates MUSIC-like spectra directly) (Elbir, 2019).
Auxiliary modules and multi-head strategies: Variants supplement the core CNN with explicit speaker-counting heads, source splitting (distinct streams per source), permutation-invariant training, and fusion with non-acoustic cues (e.g., external microphone, visual anchor) (Jazaeri et al., 23 Sep 2025, Subramanian et al., 2021, Wang et al., 2022).

3. Multi-Speaker DOA Supervision and Training

Multi-speaker scenarios require the network to produce multiple positive DOA posteriors per frame. The target label $\mathbf{y}_n \in \{0,1\}^C$ is a multi-hot vector, with 1s placed at all directions with active sources. The primary training loss is the binary cross-entropy:

$\mathcal{L} = -\sum_{n}\sum_{c=1}^{C}\bigl[y_{n,c}\log p_{n,c} + (1-y_{n,c})\log(1-p_{n,c})\bigr]$

where $p_{n,c}$ is the output posterior for class $c$ in frame $n$ (Chakrabarty et al., 2018).

Supervised training leverages synthesized noise signals convolved with simulated and/or measured room impulse responses (RIRs). Critically, to mimic speech-like W-disjoint orthogonality, multi-speaker mixtures are constructed by randomizing TF bins across sources so that each bin contains phase patterns from only a single direction. This enables training with noise signals rather than costly annotated speech mixtures (Chakrabarty et al., 2018).

Augmentation across array positions, RIRs (reverberation time $RT_{60}$ grid), source distances, and broad SNR conditions is used to realize robust, domain-adaptable models. Optimization is most often performed with Adam, large batch sizes, and high-classification resolution (e.g., $C=37$ for 5°-binning).

4. Postprocessing, Source Counting, and Fusion Strategies

Inference in multi-speaker CNNs typically proceeds by taking the per-frame output vector $p_{n}$ , identifying local maxima (peaks) exceeding a set threshold, or selecting the top- $K$ entries if the number of active sources $K$ is known. Sliding- or block-averaging of frame-level posteriors helps smooth temporally dynamic scenarios.

Innovative variants incorporate auxiliary modules for source counting. For instance, explicit concurrent speaker detection (CSD) heads, or fusion of estimated speaker count as an auxiliary feature (early/mid/late fusion), have been shown to improve multi-source detection F1 by up to 14% in binaural hearing aid applications, especially when fused late in the model (Jazaeri et al., 23 Sep 2025). However, joint dual-task (DOA+count) training often does not further improve DOA estimation, as multi-label heads already encode cardinality information.

Hybrid frameworks also exist:

CNN outputs can be composed to mimic spatial pseudo-spectra (as in CRNN-based pseudo-spectrum regression), after which conventional peak-picking or clustering yields DOA estimates and source numbers (Adavanne et al., 2017).
In ad-hoc arrays, node-wise CNN DOA estimates are fused by geometric triangulation and clustering to yield robust source localization even under node unreliability (Liu et al., 2022).

Single-source training of per-TF-bin classifiers on modal coherence or similar spatial signatures enables a single model to generalize at test time to multi-source and 3D settings, followed by clustering of per-bin predictions (Fahim et al., 2020).

5. Empirical Performance and Comparative Evaluation

Deep CNN-based multi-speaker DOA estimators consistently outperform classical signal processing and subspace methods:

System	Setting	MAE (deg)	Accuracy (%)
CNN with phase map (Chakrabarty et al., 2018)	2-speak, 20dB SNR, sim	≈3.5	≈93
MUSIC (baseline)	"	≈16	≈63
FCN U-Net (per-TF) (Hammer et al., 2020)	2-speak, static, sim	0.3–1.7	94–99.5
DeepMUSIC (Elbir, 2019)	$K=6$ sources, SNR >0dB	N/A	matches/near MUSIC
Modal coherence CNN (Fahim et al., 2020)	$L=2$ , S2, 20dB, sim	8.4	79.3
DOAnet (CRNN) (Adavanne et al., 2017)	O1A (anechoic, 1 source)	1.14	—
DOAnet (CRNN), matched source count	O2A (anechoic, 2 source)	27.5	—

Empirical results show strong generalization to simulated and real measured RIRs, robustness under high reverberation ( $RT_{60}$ to 0.8 s), and preserved sub-5° accuracy even for measured data (Chakrabarty et al., 2018, Hammer et al., 2020). The capacity to resolve very closely spaced sources (down to 1° grid resolution and sub-degree RMSE under favorable conditions) far exceeds the resolution of classical methods, which typically degrade or fail for source separation $<4°$ or low SNR (Papageorgiou et al., 2020).

CNN-based systems also offer real-time feasibility: forward propagation per array covariance sample takes a few ms, outperforming spectral grid search algorithms in computational efficiency (Elbir, 2019).

6. Advanced Extensions: Dynamic Scenarios, Source Splitting, and Application Domains

Recent architectures support localization in dynamic acoustic environments, where speaker count and positions may vary over time. By tracking moving peaks over frame-averaged or blockwise posteriors, these systems realize online tracking and instantaneous adaptation to sources entering or leaving the region of interest (Chakrabarty et al., 2018, Hammer et al., 2020). Source splitting with dedicated branches or masks enables source-specific posteriors, facilitating downstream tasks such as source-aware speech recognition, where explicit DOA features reduce word error rates by up to 50% in far-field multi-talker ASR (Subramanian et al., 2021).

Novel multi-modal and information-informed architectures further extend performance. Signal-informed masking using external microphones physically attached to the target speaker can improve DOA accuracy by up to 36% median error reduction, even in the presence of four interfering speakers (Kowalk et al., 2022). Auxiliary modalities (visual, external voiceprint anchors) or multi-stage systems (triangulation, confidence-based node selection) increase robustness and adaptability in complex or distributed environments (Wang et al., 2022, Li et al., 2024).

7. Limitations, Practical Insights, and Future Directions

Key limitations and open challenges include:

Permutation ambiguity: In the absence of source anchoring, label permutation remains an issue, though permutation-invariant objectives and spatial priors help alleviate this (Wang et al., 2022).
Source counting: Implicit in multi-label architectures but may benefit from explicit fusion, especially in ROI-constrained or low-SNR regimes (Jazaeri et al., 23 Sep 2025).
Scalability: For increasing speaker counts or large ad-hoc arrays, combinatorial explosion in triangulation or clustering may introduce computation or accuracy bottlenecks, suggesting need for hybrid or confidence-weighted fusion strategies (Liu et al., 2022).
Physical and geometric limits: Small array apertures, low SNR, and reverberant energy impose phase ambiguity or spatial aliasing, potentially limiting DOA resolution. Deep models are robust to some degree but not immune.

Future directions include full 3D extension (azimuth + elevation), end-to-end training of all fusion (e.g., node selection + triangulation), integration with speech enhancement and separation (as in DBnet (Aroudi et al., 2020)), and domain adaptation to arbitrary microphone geometries or distributed sensor networks.

In summary, deep convolutional networks for multi-speaker DOA estimation exploit raw phase-based spatial structure, multi-label learning, and extensive simulated augmentation to achieve robust, real-time, and high-resolution localization in complex acoustic scenes. These systems have demonstrated marked improvements over classical approaches and provide a flexible foundation for downstream source separation, robust multi-speaker speech recognition, and spatial audio analysis (Chakrabarty et al., 2018).