Sound Source Localization

Updated 1 September 2025

Sound source localization is the process of determining an acoustic emitter’s position using microphone measurements and models of acoustic propagation.
It integrates classical signal processing, compressive sensing, manifold learning, and neural network approaches to accurately estimate time-differences-of-arrival and spatial position.
Practical applications in robotics, teleconferencing, and augmented reality benefit from robust, efficient techniques even under noisy and dynamic acoustical conditions.

Sound source localization (SSL) is the process of inferring the spatial position of an acoustic emitter using signal measurements collected by one or more microphones. Accurate SSL is a critical component for technologies in robotics, surveillance, teleconferencing, augmented reality, and many sensor-networked systems. The contemporary research landscape spans classical array signal processing, compressive sensing, probabilistic learning, and deep learning architectures, each with different trade-offs in data efficiency, robustness, scalability, and real-world deployability.

1. Signal Models and Mathematical Foundations

At the core of SSL is a physical and statistical model of acoustic propagation from a source to each microphone, typically written as

$x^{(i)}(t) = (h^{(i)} * s)(t - \tau_i) + n^{(i)}(t)$

where $s(t)$ is the source signal, $h^{(i)}(t)$ the channel impulse response, $\tau_i$ the propagation delay to microphone $i$ , and $n^{(i)}(t)$ is additive noise (Jiang et al., 2013). The recovery of $\{\tau_i\}$ from $x^{(i)}(t)$ is central, as these delays (or more generally, relative transfer functions) encode geometric information about the source's location. Discretization yields a vector model:

$\mathbf{x}^{(i)} = \Psi_0 \mathbf{h}^{(i)} + \mathbf{n}^{(i)}\ ,$

with $\Psi_0$ constructed from a reference sensor's (full-sample) waveform. Estimating the sparse vector $\mathbf{h}^{(i)}$ allows inference of Time-Difference-of-Arrival (TDOA), which is the foundation for most geometry-based localization.

2. Classical and Compressive Sensing Approaches

Conventional methods (cross-correlation, GCC, MUSIC, SRP-PHAT) require full-bandwidth sampling for TDOA or subspace estimation. However, compressive sensing (CS) introduces an alternative: only one sensor samples at the Nyquist rate, while others acquire $M \ll N$ compressive linear measurements $y = \Phi x$ (e.g., using shifted m-sequences for hardware simplicity) (Jiang et al., 2013). Solving for the sparse $\mathbf{h}$ proceeds via nonlinear $\ell_1$ -minimization:

$\min \|\mathbf{h}\|_1 \quad \text{subject to} \quad \Phi \Psi_0 \mathbf{h} = y$

or its relaxed, noisy variant. The TDOA is then extracted via $\hat \tau^{(i)} = \arg\max_j |h^{(i)}_j|$ .

This approach achieves dramatic (>100:1) compression ratios, supports robust TDOA recovery up to the sample period, and is validated for both clean and adverse (reverberant, noisy) environments. Detection confidence is evaluated by repeated reconstructions with varying measurements to filter unreliable estimates. CS-based SSL is particularly beneficial for sensor networks with power or bandwidth constraints.

3. Learning and Manifold-Based Methods

Limitations of classical methods in adverse acoustic environments motivate statistical learning approaches. The manifold regularization framework (Laufer-Goldshtein et al., 2015) captures the geometric structure of high-dimensional feature spaces (e.g., relative transfer functions) by assuming that acoustic features from within a fixed spatial region lie on a smooth, low-dimensional manifold.

Given $l$ labeled and $u$ unlabeled data points, SSL is cast as a semi-supervised regression problem:

$\min_{f \in \mathcal{H}_k} \frac{1}{l} \sum_{i=1}^l [p_i - f(\mathbf{h}_i)]^2 + \gamma_k \|f\|^2_\mathcal{H} + \gamma_m f^\top L f$

where $L$ is a (graph) Laplacian computed from all feature samples. The resulting Representer Theorem solution uses all labeled and unlabeled data to optimize both empirical and manifold smoothness. This design yields increased robustness under high noise and reverberation compared to both GCC and prior manifold diffusion search, and supports adaptive, online model updates as new data are acquired.

4. Probabilistic and Neural Network Approaches

SSL in highly adverse interiors (e.g., $T_{60}$ up to 600 ms, SNR as low as $-10$  dB) can also be addressed by recasting the regression as a nonlinear classification problem (Sun et al., 2017). The Generalized Cross Correlation Classification Algorithm (GCA) divides the environment into spatial clusters, extracts a concatenated set of GCC features from all microphone pairs, and applies a probabilistic neural network (PNN) to assign likelihoods to each cluster:

$P_{ij}(\text{GCC}) = \frac{1}{(2\pi\sigma^2)^{D/2}} \exp\left(-\frac{\| \text{GCC} - \text{GCC}_{ij} \|^2}{2\sigma^2}\right)$

Final localization is refined by a Weighted Location Decision Method (WLDM), which leverages the posterior over cluster probabilities and interpolates to sub-cluster resolution. GCA demonstrates mean azimuth and elevation errors of approximately 4.6° and 3.1° under strong reverberation and noise, with success rates significantly surpassing competing methods (TDE, TL-SSC, LS‑SVM) by integrating GCC robustness, machine-learned environment models, and postprocessing for smooth localization boundaries.

5. Acoustics-Aware Ray-Tracing and Particle Filter Techniques

In geometrically complex and non-line-of-sight (NLOS) indoor settings, purely TDOA-based techniques are insufficient. Algorithms integrating ray tracing (forward and inverse) with physical acoustic modeling and particle filters have proven effective:

Reflection-aware SSL (An et al., 2017): Direct and reflected ray paths are generated by inverse ray tracing, exploiting environment reconstructions (from SLAM-produced occupancy maps) to simulate specular reflections. Monte Carlo localization is performed by initializing a cloud of candidate particles in 3D; each is reweighted based on proximity to traced acoustic paths and iteratively resampled until convergence. Including reflections improves average localization accuracy by ≈40% compared to direct-only models (typ. 0.8 m in a 7 m × 7 m × 3 m space), and enables rapid (single-frame) estimates, even for intermittent and mobile sources.
Diffraction-aware SSL (An et al., 2018): Ray tracing is extended using the Uniform Theory of Diffraction (UTD) to handle diffraction effects at wedges. When a back-propagated ray approaches a precomputed edge in the 3D mesh, multiple diffraction rays are spawned, modeled as emanating from virtual sources on the edge. Source localization is the convergence region of all direct, reflected, and diffracted rays, inferred by a particle filter. Substantial accuracy boosts (up to 130%) over reflection-only approaches in NLOS scenarios are observed.
Back-Propagation Signal Similarity (An et al., 2019): Back-propagation signals are reconstructed at candidate source locations along raytraced paths (inverting loss and attenuation), with the hypothesis that physical sources yield high signal similarity after back-propagation, while noise and spurious paths do not. A particle filter with a weight incorporating both geometric proximity and signal cross-correlation is used, resulting in average errors of ≈0.51 m and substantial accuracy improvements (65–220%) over prior ray-convergence based SSL.

6. Hardware Architectures and Practical Systems

SSL systems exhibit a range of sensor and array modalities:

Cubical microphone arrays combined with grid search over a spherical candidate grid and GCC-PHAT TDOA estimation can resolve both azimuth and elevation, supporting robotics for disaster search with directional errors of ≈1° over meter-scale distances (Khanal et al., 2021).
Massive and sparse sensor arrays (e.g., an 8×8 URA for MASSLOC (Fischer et al., 16 Aug 2025); 16-microphone circular arrays in outdoor UGVs (Liu et al., 29 Jul 2025)) enable massive multi-source tracking. Sparse geometries, when combined with spatial smoothing and sequence-based excitation (complementary Zadoff–Chu waveforms), allow simultaneous identification and DoA estimation of up to 14 sources, support Perspective-n-Point (PnP) calibration, and achieve sub–decimeter positioning errors in high-reverberation environments.
Mobile microphone arrays with SLAM (Michaud et al., 2020): Using two mobile robots each with multi-microphone arrays and SLAM-based pose tracking, 3D source localization is performed via triangulation based on global DoA vectors, achieving errors below 30 cm when the baseline geometry is favorable (inter-robot angle >30°).

Scheme	Key Technique(s)	Core Accuracy/Benefit
Compressive Sensing	Sparse encoding, $\ell_1$ solvers	$100{:}1$ data compression, $<1$ sample TDOA error
Manifold Learning	RKHS + Laplacians on RTFs	Adaptive refinement, robust to reverberation
PNN+GCC	Probabilistic neural network	$4.6^\circ$ azimuth, $3.1^\circ$ elevation (T60=600 ms)
Ray/Particle Methods	Inverse tracing, MCL, UTD	$0.6$–$0.8$ m in $7\times7\times3$  m, robust to NLOS
Massive Arrays	MUSIC, Zadoff–Chu IDs, PnP	$0.84^\circ$ angular, $55.7$ mm positional median error
Mobile/SLAM Arrays	Multi-robot triangulation	$<$ 0.3 m when $\theta > 30^\circ$ between robots

7. Applications, Limitations, and Future Challenges

SSL has impactful applications in robotics (navigation, search and rescue), human–machine interaction (e.g., UGV operator tracking (Liu et al., 29 Jul 2025)), surveillance, hearing aids, and asset tracking. Data-efficient and low-power schemes (e.g., compressive sensing) are crucial for sensor networks or embedded applications with tight resource constraints.

Each method reveals specific limitations. Compressive sensing relies on accurate construction of the sparsity basis using a reference sensor and may require careful calibration. Manifold-based learning presumes a stable, low-dimensional acoustic manifold, which is valid primarily in fixed environments; significant configuration changes degrade performance unless retraining is feasible. Diffraction/ray-tracing methods are limited by the quality of environment reconstruction and often computationally demanding for real-time deployment. Multi-source tracking with high numbers of sources puts pressure on both array hardware and source discrimination methods.

Emerging areas involve SSL in non-line-of-sight and highly dynamic scenes, large-scale multi-source localization with sparse microphones, and advanced learning-based hybrid methods that can adapt to unknown, time-varying sensor geometry and conditions. Robustness to faulty or missing sensors, ad-hoc array topologies, and seamless integration with vision, SLAM, or semantic scene understanding represent prominent open challenges.