DCASE 2025 Challenge Task 4

Updated 15 December 2025

DCASE 2025 Challenge Task 4 is a dual-track initiative combining device-aware acoustic scene classification and spatial semantic segmentation to tackle real-world sound analysis using innovative models.
The challenge exploits device identity for tailored fine-tuning, enabling compact model adaptation and improved performance in resource-constrained, multi-device environments.
It introduces robust baselines with detailed evaluation metrics, achieving measurable gains in macro-averaged accuracy and class-aware SDR improvement under diverse, realistic settings.

The DCASE 2025 Challenge Task 4 encompasses two independent technical subtracks, both situated at the intersection of computational audition and real-world machine learning deployment—"Low-Complexity Acoustic Scene Classification with Device Information" and "Spatial Semantic Segmentation of Sound Scenes" (S5). Each subtrack targets distinct, state-of-the-art challenges: the former addresses device-aware acoustic scene classification (ASC) for resource-constrained scenarios, while the latter advances the semantic detection and separation of spatially complex, multi-event soundscapes. Both are formulated to benchmark progress in realistic, diverse, device- and scene-variable conditions, using publicly available, rigorously structured datasets and strict evaluation protocols (Schmid et al., 3 May 2025, Yasuda et al., 12 Jun 2025, Park et al., 26 Jun 2025).

1. Task Definitions and Technical Scope

Low-Complexity Acoustic Scene Classification with Device Information

This subtask is a ten-class ASC problem in which input audio consists of 1 s, 44.1 kHz single-channel snippets labeled by one of ten everyday environments (e.g., "Metro station," "Urban park"). The pivotal change for 2025 is that the recording device identity is revealed at inference, enabling device-specific models or model selection. The central objectives are:

Determining whether device-aware inference improves classification over strictly device-agnostic models.
Exploring methods for adapting compact models to known devices with few device-specific samples.
Assessing the impact of transfer learning from large external scene datasets on classification performance (Schmid et al., 3 May 2025).

Spatial Semantic Segmentation of Sound Scenes (S5)

This subtrack formalizes the joint detection and separation of multiple, overlapping sound events from four-channel first-order Ambisonics (FOA) input, incorporating spatial information. For each event in a 10 s mixture, the system must output both an isolated ("wet") source assigned to a reference microphone and the corresponding event class label. The input model is:

$y^{(m)} = \sum_{k=1}^K h_k^{(m)} * s_k + n^{(m)}$

where $K$ is the (unknown) number of events, $s_k$ is the anechoic waveform for event $k$ , $h_k^{(m)}$ is the room impulse response to microphone $m$ , and $n^{(m)}$ is noise. The primary evaluation metric is class-aware SDR improvement (CA-SDRi), which rewards both accurate separation and correct class assignment (Yasuda et al., 12 Jun 2025).

2. Dataset Composition and Protocol

Acoustic Scene Classification Dataset

All audio is sourced from the TAU Urban Acoustic Scenes 2022 Mobile corpus. Devices represent distinct domains: high-quality binaural recorder (A), three consumer mobile devices (B, C, D), and ten simulated devices (S1–S10) using device-specific impulse responses. The development set (≈64 h) consists of a 25% subset of the 2024 data, with train-dev samples from devices A, B, C, and S1–S3; test-dev additionally includes unseen simulated devices S4–S6. The evaluation set introduces device D and simulated S7–S10, as well as clips from new cities. Scene labels are withheld at evaluation, with device IDs provided (unknown devices labeled as such) (Schmid et al., 3 May 2025).

S5 Dataset Construction

The DCASE 2025 S5 dataset comprises spatial mixtures constructed using:

Anechoic one-shot recordings of 18 target classes (curated/integrated from new recordings, FSD50K, EARS).
540 first-order Ambisonics RIRs spanning 3 rooms, 5 positions, 20° azimuth steps, 3 elevations, multiple distances.
10 s mixtures combining 1–3 target events + 1–2 interference events, random SNRs (target 5–20 dB, interference 0–15 dB), all sharing the same RIR per mixture instance.
JSON/CSV annotations specify class set, event times, RIR identity, and SNR per mixture (Yasuda et al., 12 Jun 2025, Park et al., 26 Jun 2025).

3. Modeling Frameworks and Complexity Constraints

ASC: Device-Aware Model Architecture

The official baseline uses a receptive-field-regularized, factorized CNN inspired by DCASE 2023. Input audio is resampled to 32 kHz, transformed to a 256-band Mel spectrogram (96 ms window, 16 ms hop), and processed by channel-wise factorized convolutions. Two-step protocol:

General model $f(x;\theta^G)$ trained on pooled devices.
Device-specific fine-tuning: initialize $\theta^k \leftarrow \theta^G$ , fine-tune on device $D_k$ to obtain $\theta^k$ .

Inference uses device-indexed $\theta^k$ for known devices and $\theta^G$ for unknowns. Complexity constraints: $P \leq 128$ k parameters, $F \leq 30 \times 10^6$ MACs per 1 s—baseline: 61,148 parameters (122.3 kB), 29.4 MMACs (Schmid et al., 3 May 2025). Freq-MixStyle augmentation transfers band-wise statistics across devices during training.

S5: Semantic Segmentation and Source Extraction Baselines

Baseline systems employ two-stage pipelines:

Stage 1: M2D (Masked Modeling Duo) audio tagger producing class-wise masks/confidence scores.
Stage 2: Universal source separation via ResUNet (one-to-one extraction) or ResUNetK (multi-query extraction). The latter supports parallel extraction for multiple simultaneous events, substantially aiding heavily overlapped scenes.

Losses include multi-label cross-entropy for classification and scale-invariant signal reconstruction loss ( $-\mathrm{SDR}$ ) for separation (Yasuda et al., 12 Jun 2025).

4. Evaluation Metrics and Experimental Outcomes

ASC Evaluation: Macro-Averaged Metrics

Primary metric: class-wise macro-averaged accuracy on the evaluation set. Multi-class cross-entropy is also reported. Baseline results:

Model	Macro Accuracy (%)
General model	50.72 (±0.47)
Device-specific suite	51.89 (±0.05)

Device-level fine-tuning confers consistent accuracy improvements for known devices; no gain for unseen devices (e.g., S4–S6) (Schmid et al., 3 May 2025).

S5 Evaluation: Class-Aware SDR Improvement (CA-SDRi)

Baseline CA-SDRi and accuracy are:

System	Eval CA-SDRi	Eval Acc (%)
ResUNetK	6.60	51.48
ResUNet	5.72	51.48

CA-SDRi is calculated as

$\mathrm{CA}\text{-}\mathrm{SDRi} = \frac{1}{|C \cup \hat{C}|} \sum_{c\in C\cup \hat{C}} P_c$

where $P_c$ reflects class-specific SDRi if $c$ is correctly detected; false positives/negatives incur zero reward (Yasuda et al., 12 Jun 2025).

5. Recent Methodological Advances

A notable submission (Park et al., 26 Jun 2025) extended S5 performance through:

Feature enrichment: Incorporating spectral roll-off and chroma features alongside the standard mel-spectrogram M2D embedding, resulting in improved separation of tonally/spectrally confusable classes.
Agent-based postprocessing: A rule-driven agent cross-validates initial event labels by re-tagging separated sources, pruning false positives with minimal overhead.
Training dataset refinement: Outlier/short-duration samples were removed, and targeted external data added for sparse classes, rebalancing class distributions.

Ablation studies confirmed that:

Dataset refinement yields approximately 10.9% CA-SDRi increase versus the baseline.
Feature enrichment provides an additional 0.22 dB gain.
Agent correction slightly increases CA-SDRi and reduces false positives.
An ensemble of all variants achieves a 14.7% relative improvement over baseline CA-SDRi (12.721 dB vs. 11.088 dB) (Park et al., 26 Jun 2025).

6. Discussion of Challenges and Research Directions

Key ongoing challenges include:

For ASC, compact device-specialized models are most effective when sufficient device-specific data is available; very small device domains may require meta-learning or aggressive regularization.
Unseen device generalization remains an open problem, motivating further research into domain generalization (adversarial learning, MixStyle methods, device IR augmentation).
For S5, accurate separation and classification degrade in heavily reverberant or sparse data scenarios, especially for spectrally similar classes or rare event types.
Joint optimization of detection, separation, and spatial consistency is not fully solved, especially when the number of events per mixture is variable.
Recommendations include exploring foundational models (e.g., pretraining with M2D, SoundBeam), integrating explicit spatial/DOA cues into separation pipelines, and investigating permutation-invariant approaches with class-aware queries (Yasuda et al., 12 Jun 2025, Park et al., 26 Jun 2025).

A plausible implication is that as label and metadata availability (e.g., device IDs, spatial cues) increases, resource-efficient, adaptive models will become increasingly viable for on-device and edge deployment.

7. Outlook and Future Challenge Directions

The DCASE 2025 Challenge Task 4 demonstrates that high macro accuracies and robust event separation are attainable under extreme resource and training data constraints. The 2025 design choices, including device-aware inference and relaxation of external-data prohibitions, are empirically justified by consistent gains across both baseline and enhanced systems (Schmid et al., 3 May 2025, Park et al., 26 Jun 2025). Promising future research avenues include:

Adaptive model ensembling responsive to device metadata.
On-device continual learning or meta-adaptation for previously unseen domains.
Joint inference leveraging both scene and device/city metadata.
More rigorous domain generalization strategies for model robustness.

The integration of device and spatial domain information represents a key inflection point in the progression toward resilient, deployment-ready computational auditory systems.