DCASE2025 Task3 Stereo SELD Dataset

Updated 20 July 2025

DCASE2025 Task3 Stereo SELD Dataset is a large-scale benchmark resource designed to assess sound event localization and detection using adapted stereo audio.
The dataset employs innovative audio and video conversion techniques, including mid-side processing and field-of-view cropping, to transform multichannel signals into stereo format.
Baseline methodologies utilize multi-task CRNN and hybrid audiovisual models to enhance azimuth estimation, distance regression, and onscreen/offscreen classification.

The DCASE2025 Task3 Stereo SELD Dataset is a large-scale benchmark resource specifically developed for the paper and evaluation of Sound Event Localization and Detection (SELD) using stereo audio. Recognizing the prevalence of two-channel audio in practical consumer and media scenarios, the dataset represents a significant departure from previous DCASE SELD challenges that utilized multichannel, full-field-of-view (FOA) audio. The DCASE2025 iteration introduces not only stereo SELD as the main challenge but also a dedicated dataset, updated baseline methodologies, and new metrics—reflecting advances and research themes consolidated from numerous preceding works (Wilkins et al., 2023, Krause et al., 18 Mar 2024, Shimada et al., 16 Jul 2025).

1. Dataset Composition and Signal Conversion

The DCASE2025 Task3 Stereo SELD Dataset is derived from the Sony-TAU Realistic Spatial Soundscapes (STARSS23), which originally comprises multichannel (FOA) audio and 360° video recordings with detailed spatiotemporal annotations for 13 sound event classes. To adapt these recordings to the stereo SELD domain, each 5-second clip is processed as follows:

Audio Conversion:

The FOA signals are first rotated to a canonical orientation (corresponding to the front of the intended field of view). The stereo conversion is realized via a mid-side process, using the formula:

$\begin{align*} L(n) &= W(n) + Y(n) \ R(n) &= W(n) - Y(n) \end{align*}$

where $L(n)$ and $R(n)$ represent the left and right stereo channels, $W(n)$ is the omnidirectional component, and $Y(n)$ is the horizontal dipole component of the FOA signal (Shimada et al., 16 Jul 2025).

Video Conversion:

The associated 360° videos are cropped to a 100° horizontal field-of-view, yielding perspective video at a resolution of 640×360 pixels (16:9), more typical of everyday media.

Label Processing:

Ground truth DOA annotations are rotated to center on the chosen view, with elevation information discarded, and azimuths outside the FOV are folded to the front hemisphere. A binary flag is assigned to each event as “onscreen” or “offscreen,” based on whether the spatial label falls within the visual frame.

The resulting dataset consists of 30,000 development clips (≈41.7 hours) and 10,000 evaluation clips (≈13.9 hours), with carefully synchronized audio, video, and label streams (Shimada et al., 16 Jul 2025).

2. Acoustic and Spatial Properties in Stereo SELD

In contrast to four-channel FOA, stereo audio inherently restricts the available spatial cues. Core implications include:

Azimuth Estimation:

SELD with stereo audio is restricted to DOA estimation in the azimuth (left–right) plane. Elevation and front–back localization are fundamentally ambiguous due to the two-channel format. Nearly 48% of front-quadrant sources may be misassigned to the back when using stereo, a phenomenon also observed in comparative studies (Wilkins et al., 2023).

Distance Estimation:

The challenge incorporates source distance estimation. However, direct regression over the dataset’s large dynamic range (0.04–7.64 m) requires careful normalization to avoid biasing loss functions toward distant sources. Normalization is achieved by standardizing the distances and then scaling to [–1, 1], yielding more stable training and improved mean absolute errors (Yeow et al., 1 Jul 2025).

Onscreen/Offscreen Classification:

In the audiovisual track, the limited FOV of the video motivates an additional binary classification task, mirroring common audiovisual scene analysis where many sources may be spatially localized but visually absent (Shimada et al., 16 Jul 2025).

3. Baseline Methodologies and Model Architectures

The baseline system for DCASE2025 Task3 is designed around a multi-task CRNN pipeline:

Audio-Only Track:

Stereo log-mel spectrograms (64 bands) are processed by three convolutional layers (64 filters each), bidirectional GRUs, and multi-head self-attention (8 heads, 128-dimensional). The outputs are used to jointly predict sound event activity, azimuth DOA (via x,y Cartesian regression), and distance in a multi-ACCDOA framework capable of supporting up to three overlapping events per class (Shimada et al., 16 Jul 2025).

Audiovisual Track:

The audio branch is augmented with a ResNet-50-based visual front-end; audio and visual embeddings are fused using transformer decoders with cross-attention. The unified output head jointly predicts event class, localization, distance, and onscreen/offscreen flags (Shimada et al., 16 Jul 2025, Berghi et al., 7 Jul 2025).

Significant recent research extends this baseline:

Pseudo-FOA Feature Conversion: Systems convert stereo input to pseudo-FOA for compatibility with pre-trained FOA-oriented models:

$\begin{align*} W(n) &= \frac{L(n) + R(n)}{2} \ Y(n) &= \frac{L(n) - R(n)}{2} \ X(n) &= 0 \ Z(n) &= 0 \end{align*}$

This enables reuse of robust encoders and pre-training strategies (Gao et al., 16 Jun 2025, Gao et al., 13 Jul 2025).
Sequence Modeling Innovations: The decoder module increasingly replaces Conformer or transformer-based structures with bidirectional Mamba (BiMamba) blocks. BiMamba leverages selective state-space modeling with both forward and backward temporal dynamics, offering reduced computational complexity and improved long-range context modeling (Mu et al., 9 Aug 2024, Gao et al., 16 Jun 2025, Gao et al., 13 Jul 2025).
Acoustic Feature Engineering:

Incorporation of perceptually inspired features—including mid-side intensity vectors, magnitude-squared coherence, and inter-channel level differences—provides richer spatial and distance cues in the absence of elevation information (Yeow et al., 1 Jul 2025, Berghi et al., 7 Jul 2025).

4. Data Augmentation and Training Strategies

Robust model performance on stereo SELD requires diverse data augmentation and training practices:

Channel Swapping:

Audio Channel Swapping (ACS) exchanges left and right channels and inverts labels about the frontal axis, effectively doubling training examples and mitigating any lateral bias (Wilkins et al., 2023, Yeow et al., 1 Jul 2025, Berghi et al., 7 Jul 2025). This technique is especially critical for stereo where spatial diversity is limited.

Spectrogram Domain Augmentation:

FilterAugment introduces realistic band-specific gain distortions, frequency shifting simulates mild pitch changes, and inter-channel-aware time-frequency masking (ITFM) applies SpecAugment-like masking that preserves stereo spatial cues (Yeow et al., 1 Jul 2025).

Pre-Training and Transfer Learning:

Scene-dedicated strategies such as pre-training on synthetic/FOA data with subsequent fine-tuning on stereo recordings bridge domain gaps, improving event detection and localization robustness (Huang et al., 2023, Gao et al., 16 Jun 2025).

Loss Function Balancing:

For distance estimation, mean squared error (MSE), mean absolute error (MAE), mean squared percentage error (MSPE), and mean absolute percentage error (MAPE) losses are evaluated. While MSE preserves SELD performance, MAE often yields lower distance error; careful hybridization or weighting is recommended (Krause et al., 18 Mar 2024).

5. Metrics and Evaluation Protocols

Evaluation protocols for the stereo SELD dataset are revised to account for the restricted field of spatial inference:

Task/Metric	Description	Unit
Localization-Dependent F1	Detection F-score, requiring <20° azimuth + distance within a threshold	%
DOA Estimation Error (CD)	Average azimuth error for correct-class detections	Degrees
Relative Distance Error	Mean relative deviation between estimated and ground-truth source distances	%
Onscreen/Offscreen Acc.	Correctness of classifying events as visually onscreen or offscreen (AV track only)	%

To reflect the practical challenges, the global system ranking is based on the localization-dependent F1 score, with the AV track additionally considering onscreen/offscreen macro F1 (Shimada et al., 16 Jul 2025).

6. Research Findings, Limitations, and Outcomes

Empirical results indicate that while stereo SELD systems achieve acceptable lateral (left–right) localization, there is a marked increase in front–back confusion and loss of elevation discrimination compared to FOA-based approaches (Wilkins et al., 2023). Distance estimation remains an open research problem, with best systems leveraging normalization and loss balancing to improve performance (Yeow et al., 1 Jul 2025).

Hybrid audio-visual systems demonstrate modest improvements in detection and localization, and substantial gains in onscreen/offscreen classification accuracy (approximately 77.8% for the baseline AV system) (Shimada et al., 16 Jul 2025). Ensembling and the integration of semantic embeddings from large pre-trained models (e.g., CLAP for audio, OWL-ViT for visuals) further advance the state of the art by leveraging both spatial and semantic context (Berghi et al., 7 Jul 2025).

7. Impact, Applications, and Future Directions

The DCASE2025 Task3 Stereo SELD Dataset establishes a reference benchmark for sound event localization and detection in ordinary two-channel audio environments, enabling new research into practical, consumer-grade media conditions. It motivates the development of:

Efficient sequence models (BiMamba) suitable for real-time and embedded applications (Mu et al., 9 Aug 2024, Gao et al., 16 Jun 2025, Gao et al., 13 Jul 2025).
Feature engineering and augmentation schemes preserving limited spatial cues while augmenting data diversity (Yeow et al., 1 Jul 2025).
Hybrid audio-visual reasoning supporting new tasks such as onscreen/offscreen event mapping (Berghi et al., 7 Jul 2025, Shimada et al., 16 Jul 2025).

Significant challenges remain, notably in precise distance regression and resolving angular ambiguities that are intrinsic to stereo signals. The integration of visual context, semantic embeddings, and advanced sequence modeling forms a promising direction for future advancements.

In sum, the DCASE2025 Task3 Stereo SELD Dataset marks a substantial evolution for SELD research, providing a foundation for robust and accessible event localization and detection systems in everyday media content. It is characterized by a meticulous derivation from established multichannel datasets, thoughtful adaptation to the stereo domain, and a suite of progressive baseline and advanced modeling strategies targeting the unique challenges of the stereo SELD setting.