Sound Event Localization & Detection

Updated 5 December 2025

Sound Event Localization and Detection (SELD) is defined as a unified task that simultaneously identifies the temporal activity, class, and spatial origin of sound events in multichannel audio scenes.
Advanced methodologies like CRNNs, attention-driven models, and permutation-invariant training tackle the challenges of overlapping events and reverberation in complex acoustic environments.
Key innovations including data augmentation, sensor fusion, and 3D localization are driving progress towards robust, real-time, and embedded SELD applications.

Sound Event Localization and Detection (SELD) is the computational task of jointly identifying the temporal activity, class, and spatial origin (direction of arrival, or DOA) of sound events in multichannel audio scenes. Unlike standard sound event detection (SED), which recognizes classes and activities, or classic DOA estimation, which localizes sources, SELD unifies these in a single framework, outputting temporally resolved, class-labeled, and spatially indexed acoustic events. SELD tasks are central to spatial audio analysis, machine listening, environmental context awareness, mobile and wearable audio applications, and human-computer interaction.

1. Problem Formulation and Output Representations

SELD is typically defined for a fixed set of $E$ event classes. At each time frame $t$ (e.g., every 20 ms), binary activity labels $y_{e,t} \in \{0,1\}$ indicate class presence. If $y_{e,t}=1$ , a spatial location is attributed per event, most commonly parameterized as azimuth/elevation angles $(\theta_{e,t}, \phi_{e,t})$ or as a unit-norm Cartesian vector $(x_{e,t}, y_{e,t}, z_{e,t})$ :

$x_{e,t} = \cos \phi_{e,t} \cdot \cos \theta_{e,t}, \qquad y_{e,t} = \cos \phi_{e,t} \cdot \sin \theta_{e,t}, \qquad z_{e,t} = \sin \phi_{e,t}$

Recent architectures unify detection and localization in joint outputs, such as the Activity-Coupled Cartesian DOA (ACCDOA) vector, which encodes detection and DOA as a single 3D regression target per class per frame: $\mathbf{v}_{e,t} = y_{e,t} \cdot \mathbf{u}_{e,t}$ . Permutation-invariant trackwise formats and multi-task (multi-branch) outputs are employed to address overlapping, same-class events in polyphonic scenes, leveraging model heads or explicit tracks per simultaneous event (Cao et al., 2020, Hu et al., 2022).

Typical SELD models are trained with objective functions composed of a detection loss (binary cross-entropy or equivalent) and a localization loss (usually mean squared error on angular or Cartesian coordinates):

$L_{\mathrm{total}} = L_{\mathrm{CE}}(y_{e,t}, \hat{y}_{e,t}) + \lambda L_{\mathrm{MSE}}([\theta_{e,t},\phi_{e,t}], [\hat{\theta}_{e,t}, \hat{\phi}_{e,t}])$

$\lambda$ balances the detection and localization terms (Adavanne et al., 2019).

2. Dataset Construction and Evaluation Protocol

The development of robust SELD systems requires datasets that capture realistic spatial, temporal, and physical scenarios. Reference datasets are synthesized by convolving isolated sound events with spatially dense real-world impulse responses, covering multiple rooms and microphone types. For instance, the DCASE 2019 dataset (Adavanne et al., 2019) uses five rooms, with over 500 distinct spatial positions per room, and generates scenes from isolated event samples (e.g., 11 classes) overlapped according to specified polyphony. Formats include first-order format Ambisonics (FOA) and tetrahedral arrays.

Stimulus signals are rendered with room-specific impulse responses at controlled SNRs, and ambient noise is added to approximate realistic reverberant backgrounds. Recent challenges extend this paradigm to moving sources, randomizable polyphonies, and more varied room configurations (Politis et al., 2020).

Evaluation employs cross-validation splits. Metrics are computed via optimal assignment matching (e.g., Hungarian algorithm) between predictions and references:

Sound event detection: error rate (ER = (S+D+I)/N) and F1-score
DOA estimation: average angular error (degrees) and frame recall (fraction of frames with correct detection count)
Location-dependent measures for joint evaluation: precision, recall, and F1-score with an angular tolerance threshold, and localization recall and error on class-matched events

No single official composite "SELD score" is universally used, but an average of metric ranks or combined error/F-score + DOA error is typical (Adavanne et al., 2019, Hu et al., 2022).

3. Model Architectures and Techniques

The canonical architecture for SELD is the Convolutional Recurrent Neural Network (CRNN) (Adavanne et al., 2018, Adavanne et al., 2019). The pipeline comprises:

Feature extraction: channel-wise STFT, yielding both magnitude and phase, or log-mel spectrograms. Additional spatial features (e.g., generalized cross-correlation (GCC), intensity vectors) are often concatenated.
Convolutional front-end: 2D CNNs (e.g., 3 layers, 64 filters) extract local time-frequency/channel patterns and preserve temporal resolution through frequency-only pooling.
Temporal modeling: 2 bidirectional GRU (or LSTM) layers (typically 128 units); these capture event onsets/offsets and temporal context.
Output branches: separate SED head (sigmoid) for class activity and DOA head (linear) regressing angles or (x, y, z) coordinates.

Squeeze-excitation (SE) blocks (Naranjo-Alcazar et al., 2020) and residual connections further improve channel selectivity and training stability. Transformer-based and attention-driven models (e.g., Channel-Spectro-Temporal attention) achieve state-of-the-art performance by enabling explicit modeling of spatial, spectral, and temporal dependencies (Shul et al., 2023).

In deployment-oriented scenarios, lightweight variants using shallow CRNNs and optimized feature extraction (e.g., log-power spectrograms + inter-channel phase differences) are used for embedded applications (Yeow et al., 18 Sep 2025).

4. Advancements: Output Formats, Data Augmentation, and Robustness

Multiple innovations address the challenges of polyphonic scenes, reverberation, SNR variability, and data efficiency:

Trackwise and Permutation-Invariant Training (PIT): To resolve label ambiguity in overlapping events of the same class, EINV2 and related architectures produce multiple per-frame tracks and solve the optimal assignment during training (Cao et al., 2020, Hu et al., 2022). mACCDOA and ADPIT extensions accommodate multiple, possibly co-located events.
Data augmentation: Aggressive chains of mixup, cutout, multichannel SpecAugment, synthetic reverberation, spatial (inc. Ambisonic) rotations, and random channel dropout create models robust to unseen conditions and microphone geometries (Hu et al., 2022, Shimada et al., 2020).
Self-supervised and cross-modal pretraining: Strategies such as w2v-SELD (wav2vec2.0 adapted to spatial audio), DOA-aware audio-visual contrastive learning, and large-scale SEC pretraining (PSELDNets) further reduce label requirements and improve domain transfer (Santos et al., 2023, Fujita et al., 30 Oct 2024, Hu et al., 10 Nov 2024).
Sensor fusion: For wearables or robotics, systems incorporate inertial/motion signals to account for self-motion, thereby conditioning audio features on sensor-derived velocity and orientation for accurate moving-listener SELD (Yasuda et al., 4 Mar 2024).

5. Extensions: 3D SELD, Open-set, and Application-Specific Frameworks

Recent research extends SELD in several directions:

3D SELD (with distance estimation): Models now predict not only DOA but also the distance to each event, by extending ACCDOA or multi-task outputs with a distance dimension. Both single-task (joint) and two-branch (separate) schemes have been demonstrated to provide sub-meter errors without degrading SED/DOA performance (Krause et al., 18 Mar 2024, Vo et al., 23 Jul 2025).
Text-queried or open-set SELD: Fixed class sets are replaced with text-conditioned detection/localization, using pretrained audio-text encoders for semantic guidance (Zhao et al., 23 Jun 2024). This supports user-specified queries but currently results in less precise localization compared to closed-set systems.
Hierarchical and two-step approaches: Hierarchical models first detect events then localize (or vice versa); two-step frameworks decouple training to prevent SED/DOA conflicts and then fuse representations for robust downstream SELD (Pertilä et al., 2021, Yu, 30 Jul 2025).
Resource-constrained/edge SELD: ASC-conditioned SELD enables context-aware thresholding and on-device computation for wearables and low-power systems, with real-time latency and low model complexity (Yeow et al., 18 Sep 2025).

6. Error Analysis and Limitations

Comprehensive error analyses reveal the major bottlenecks for SELD:

Polyphony: Error rates and localization deviations rise sharply with increased event overlap, primarily due to missed events rather than increased mislocalizations. Systems tend to best match the polyphony degree most prevalent in the training set. Addressing “homogeneous overlap” (same-class, different-DOA) requires trackwise or PIT-based models (Nguyen et al., 2021, Cao et al., 2020).
Reverberation and motion: Scenes with high RT60 or moving sources yield higher localization errors without significant loss in detection, indicating that spatial cues are more easily disrupted by room acoustics and source dynamics (Adavanne et al., 2019, Politis et al., 2020).
Class-DOA association: Many models rely on a hard-wired association of class and DOA outputs; errors arise in correctly pairing these, especially for simultaneous same-class events.
Generalization: Domain shift, device-specific array response, or environmental variation can degrade performance even for strong models unless countered by extensive data augmentation or pretraining (Hu et al., 2022).
Open-set/text-queried SELD: While text-guided localization offers natural interface flexibility, current implementations suffer from higher mean absolute errors and lack explicit detection gating (Zhao et al., 23 Jun 2024).

7. Impact, Benchmarks, and Future Directions

SELD has catalyzed a progression from modular pipelines to unified, end-to-end deep learning systems resilient to environmental complexity and polyphonic overlap. Benchmark datasets and rigorous evaluation protocols have driven reproducible advances. Foundation models pre-trained on large-scale synthetic spatial audio now achieve strong transfer, with fine-tuning strategies (e.g., AdapterBit) enabling efficiency even for monophonic recordings (Hu et al., 10 Nov 2024).

Future research directions include more accurate 3D localization including distance, joint audio-visual models, physically informed or uncertainty-aware loss formulations, open-set and prompt-based SELD, broader cross-device adaptation, and integration into smart audio, AR/VR, and context-aware robotics.

SELD remains a complex, unsolved problem characterized by the need for robust spatiotemporal reasoning in dynamic, noisy, and highly overlapping real-world sound scenes, bridging signal processing, machine learning, and perceptual modeling (Adavanne et al., 2018, Adavanne et al., 2019, Cao et al., 2020, Hu et al., 2022, Shul et al., 2023, Hu et al., 10 Nov 2024, Nguyen et al., 2021).