DeepASA: Unified Auditory Scene Analysis

Updated 28 September 2025

DeepASA is a unified framework for auditory scene analysis that integrates object-oriented processing and iterative chain-of-inference to deliver aligned, robust multi-modal outputs.
The system simultaneously handles tasks such as source separation, dereverberation, sound event detection, audio classification, and direction-of-arrival estimation with state-of-the-art benchmark performance.
Its architecture combines dynamic STFT modules, transformer-based feature aggregation, and dual attention mechanisms to ensure precise object-centric representations in complex auditory environments.

DeepASA is a unified framework for auditory scene analysis that simultaneously addresses multi-input multi-output (MIMO) source separation, dereverberation, sound event detection (SED), audio classification, and direction-of-arrival estimation (DoAE). Designed to operate in complex spatial auditory scenes with overlapping and moving sources, DeepASA integrates object-oriented processing (OOP) and an iterative chain-of-inference (CoI) mechanism to produce robust, aligned, and dynamically refined object-centric representations. This article details the architecture, mathematical foundations, procedural strategies, task associations, and benchmark performance that define DeepASA (Lee et al., 21 Sep 2025).

1. Architecture and Processing Pipeline

DeepASA accepts multichannel audio waveforms as input, modeled as

$x[n] = \sum_{j=1}^{J} (s_j[n] + h_j[n]) + v[n]$

where $s_j$ is the direct signal for source $j$ , $h_j$ its reverberant component, and $v[n]$ the background noise term. The front-end uses a dynamic STFT module with learnable, time-varying Gaussian window functions: $w_t(n) = \exp\Bigl(-\frac{(n-\mu_t)^2}{2\sigma_t^2}\Bigr)$ enabling adaptive focus on informative regions according to spectral and temporal scene structure.

Subsequent spectrogram features are aggregated via a transformer-based block (modified DeFT-Mamba) that combines time-domain (T-Hybrid Mamba) and frequency-domain (F-Hybrid Mamba) dependencies. The aggregator encodes complex relationships across channels, time, and frequency, supporting the model’s ability to disentangle and classify sources.

Following aggregation, a 2D convolutional object separator generates $J+1$ semantically consistent object-centric features (one per source, plus noise). These representations allow downstream task decoders to process all object cues (waveform, SED, DoA) in aligned fashion.

2. Object-Oriented Processing (OOP)

Traditional track-wise approaches often misalign parameter estimates across separated sources and subsequent task modules. DeepASA’s OOP strategy defines a single, early object separation, producing object-centric feature arrays. Each sub-decoder receives the same ordered features, ensuring that multi-task outputs (e.g., classification, detection, localization) correspond to the same physical entity. This holistic encapsulation eliminates association ambiguities, facilitating robust multi-task inference in polyphonic, nonstationary environments.

A plausible implication is that stable object ordering throughout the pipeline reduces the need for permutation-invariant network designs and complex post-hoc track association procedures, streamlining model complexity and error sources.

3. Chain-of-Inference and Temporal Coherence Matching

Downstream tasks may still suffer misalignments or reliability issues due to early object separation. The chain-of-inference (CoI) mechanism introduces an iterative refinement phase based on temporal coherence matching (TCM). TCM employs a dual attention scheme: $\text{Attention}(Q,K,V) = \text{softmax}\Bigl(\frac{QK^T}{\sqrt{d}}\Bigr)V$ where SED outputs form queries and DoA estimates are keys/values (and vice versa in a reciprocal branch). This bidirectional multi-cue attention enables weak or unreliable estimates in one modality to be reinforced using complementary evidence from the other.

Fused cues from TCM are interpolated to match object feature lengths and injected via Feature-wise Linear Modulation (FiLM): $\mathbf{z}' = \gamma \odot \mathbf{z} + \beta$ where modulation parameters $\gamma$ , $\beta$ encode the fused information. This iterative process refines object separation, improving feature consistency and decoding performance.

4. Multi-Task Association and Decoding

Object features output by the separator are routed to dedicated decoders for: - Source separation and dereverberation (waveform recovery) - Sound event detection (classification and temporal activation) - Direction-of-arrival estimation (spatial localization) - Audio classification (source identification)

Each decoder operates over a consistent ordering, preserving associations between waveform, event, and location for the same object throughout inference. This parameter alignment underpins the model’s robustness to spatial and temporal complexity in real-world scenes.

In summary, DeepASA’s “object-oriented one-for-all” approach is defined by early object separation, persistent feature alignment, and multi-cue refinement across all downstream auditory inference tasks.

5. Experimental Evaluation and Benchmark Analysis

DeepASA was evaluated on spatial audio benchmarks including ASA2, MC-FUSS, and STARSS23. Results indicate the model achieves state-of-the-art performance in simultaneous source separation, event detection, and DoA estimation:

ASA2: SI-SDR improvement (SI-SDRi) of 11.2–12.0 dB and SELD score of 0.206.
MC-FUSS: SI-SDRi of 18.5 dB, demonstrating clear noise handling.
STARSS23: SELD score of 0.253, with reduced localization error and increased recall relative to previous top-performing models.

These outcomes substantiate the effectiveness of the OOP strategy and chain-of-inference. The explicit modeling of noise as an object in MC-FUSS facilitates enhanced scene separation in complex auditory environments.

6. Context, Relationships, and Significance

DeepASA’s architectural and procedural innovations address core challenges in auditory scene analysis, notably parameter association ambiguity and robust generalization to diverse, multi-object scenes. By aligning all cue estimations for each object and introducing iterative multi-cue fusion, DeepASA improves upon traditional modular approaches.

This design suggests applicability to a range of domains beyond ASA where multi-modal, multi-object inference and consistent parameter association are critical. The dynamic temporal kernels and feature fusion strategies may inspire advancements in time-frequency representation and cross-modal learning.

7. Limitations and Prospective Directions

Early-stage object separation, while resolving association ambiguity, risks downstream task failure if separation quality is suboptimal. DeepASA’s chain-of-inference mitigates but does not eliminate this risk. Future directions include optimizing separation reliability, extending the architecture to accommodate broader input modalities, and refining multi-cue fusion for further robustness. The framework’s modularity may facilitate adaptation to new auditory domains and integration with emerging transformer-based sequence aggregation schemes.

DeepASA represents a comprehensive object-oriented solution for the multi-faceted problem of auditory scene analysis, achieving strong results across standard benchmarks via unified processing and iterative refinement approaches (Lee et al., 21 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

DeepASA: An Object-Oriented One-for-All Network for Auditory Scene Analysis (2025)

Follow Topic

Get notified by email when new papers are published related to DeepASA.