Denoising Diffusion SED

Updated 16 March 2026

Denoising diffusion SED is a probabilistic generative approach that iteratively refines noisy latent event queries to accurately estimate sound event boundaries.
It employs Denoising Diffusion Probabilistic Models and a Transformer-based decoder that jointly optimize classification and boundary localization without manual post-processing.
Empirical results highlight superior accuracy and a 40% reduction in training epochs compared to conventional and Transformer-based SED methods.

Denoising diffusion for Sound Event Detection (SED) encompasses a class of probabilistic generative methods that directly model the event boundary estimation process as iterative denoising of latent proposals or noisy event queries. Unlike conventional frame-level or discriminative event-level SED, these approaches formulate detection as conditional generative modeling, typically leveraging the formalism of Denoising Diffusion Probabilistic Models (DDPMs) to sequentially refine noisy candidate event descriptors into accurate predictions of event onsets, offsets, and labels. The principal instantiation in the audio domain is DiffSED, which introduces a dedicated event query diffusion scheme integrated with a Transformer decoder for robust sound event boundary generation and classification (Bhosale et al., 2023).

1. Mathematical Formulation of Diffusion for SED

The denoising diffusion process in SED operates on a matrix of event queries $z_0\in\mathbb{R}^{N\times D}$ , where each row represents a learnable, high-dimensional descriptor for a potential event (with $N$ as the maximum number of events per audio clip and $D$ the embedding dimension). The forward (noising) process applies a fixed stochastic progression governed by a noise schedule $\{\beta_t\}_{t=1}^T\subset(0,1)$ , typically chosen (e.g., cosine or linear schedule) for stable training and expressivity:

$\alpha_t=1-\beta_t$ , $\bar\alpha_t=\prod_{i=1}^t\alpha_i$ ;
At each step, $q(z_t|z_{t-1})=\mathcal{N}(z_t;\sqrt{\alpha_t}\,z_{t-1},\beta_t I)$ ;
Closed form: $z_t=\sqrt{\bar\alpha_t} z_0 + \sqrt{1-\bar\alpha_t}\,\epsilon$ , $\epsilon\sim \mathcal{N}(0,I)$ .

The reverse (denoising) process utilizes a neural network $\epsilon_\theta(z_t,t,A)$ , parameterized to predict the added noise, or equivalently, the clean proposal $\hat{z}_0$ , conditioned on both the noisy queries $z_t$ and an audio conditioning $A$ . The learned parameterization for the mean of the reverse kernel is:

$\mu_\theta(z_t,t,A) = \frac{1}{\sqrt{\alpha_t}}\left(z_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(z_t,t,A)\right) \,.$

The training objective is the noise-matching loss:

$L_t = \mathbb{E}_{z_0, \epsilon, t}\left[\|\epsilon - \epsilon_\theta(z_t, t, A)\|^2_2\right],$

which provides a tractable surrogate for the variational lower bound typical in DDPMs.

2. Architecture: Latent Event Queries and Transformer Denoiser

DiffSED introduces learnable event queries as the primary diffusion targets. Each $z_0$ is initialized $N(0,I)$ and upscaled linearly to a bounded range $[-\mathrm{scale},\mathrm{scale}]$ via $z_0\leftarrow (2b-1)\cdot \mathrm{scale}$ . The corruption occurs at a randomly sampled timestep, and the denoising task is to reconstruct the original $z_0$ vectors from $z_t$ .

The denoising network is realized as a set-prediction Transformer decoder akin to DETR, ingesting the noisy queries, query time embeddings, and global audio conditioning features. The audio encoder processes a mel-spectrogram $A$ , produces latent features via a CNN (e.g., ResNet-50), with a subsequent temporal Transformer capturing long-range temporal dependencies, outputting $C_a$ . The decoder applies multi-head self-attention, cross-attention to the conditioned audio context, and per-query output heads for both class probabilities (softmax over $C$ classes) and boundary localization (sigmoid regression of onset/offset).

3. Training, Matching, and Losses

Training proceeds by randomly corrupting the latent queries, passing the noisy ( $z_t$ , $t$ ) and audio condition $C_a$ through the decoder to generate predictions for events. The Hungarian algorithm matches the $N$ predictions to $M$ ground-truth events based on a cost that typically incorporates both class and boundary differences.

The compound detection loss per matched pair $(i\leftrightarrow j)$ contains:

Classification: $L_\mathrm{cls} = -y_j \log p(y_j|i)$ ;
Boundary regression: $L_\mathrm{box} = \|[\Psi_j,\xi_j] - [\hat\Psi_i,\hat\xi_i]\|_1 + \lambda \,\ell_{GIoU}([\Psi_j,\xi_j], [\hat\Psi_i,\hat\xi_i])$ ;

where $[\Psi_j, \xi_j]$ and $y_j$ are the ground-truth boundary and class for event $j$ . The diffusion loss $L_\mathrm{diff}$ is typically absorbed by supervising decoder outputs to reconstruct $z_0$ directly.

4. Inference: Iterative Refinement without Post-Processing

At inference, the model initializes $z_T\sim \mathcal{N}(0,I)$ , then applies the denoising network at a sequence of timesteps $t=T,T-s,\ldots,0$ (with $s$ as the sampling stride). The denoising steps refine the queries toward high-probability event hypotheses. In the DDIM setting, efficient update rules further reduce sampling steps. At completion, onset/offset and class predictions are returned—no non-maximum suppression or ad hoc post-processing is required (Bhosale et al., 2023).

5. Empirical Results and Comparative Performance

DiffSED demonstrates consistently superior accuracy and convergence rates relative to discriminative baselines and Transformer-based SEDT methods on both synthetic and real-world SED benchmarks:

Model	Event-F1	Seg-F1	Audio-F1 (URBAN-SED test set)
CRNN-CWin	36.75	65.74	74.19
Ctrans-CWin	34.36	64.73	74.05
SEDT (DETR-style)	37.27	65.21	74.37
DiffSED	43.89	69.24	77.87

On EPIC-Sounds, DiffSED achieves higher Top-1/Top-5 accuracy and mAP compared to ASF and SSAST (e.g., +3.1 pp Top-1, +0.023 mAP).

Notably, training convergence is accelerated: DiffSED reaches optimality at $\sim$ 115 epochs versus $\sim$ 200 for SEDT, representing a $40\%$ reduction in training duration. Ablation analyses confirm the benefit of query-space denoising, the effectiveness of single-step inference for audio tagging, and the robustness of the method across seeds and noise hyperparameters.

6. Methodological Advantages of Denoising Diffusion for SED

Diffusion-based SED confers several theoretical and practical strengths:

Boundary uncertainty is naturally addressed: The stochastic forward process injects variability into initial event proposals, while the denoising pathway systematically nudges proposals toward accurate boundaries, mitigating overconfident regression and local minima.
Global event interaction: The iterative process allows the decoder to refine the set of event hypotheses in a holistic, joint fashion.
Training stability: Ground-truth queries function as fixed denoising targets, facilitating robust assignment and stable convergence in the set-prediction framework.
Generative flexibility: The latent query formulation and the ability to trade off the number of queries versus refinement steps at test time permit dynamic control over latency and accuracy within a single trained model.

The denoising diffusion SED approach belongs to a broader trend of integrating diffusion generative modeling with structured detection or estimation tasks. For example, in seismic processing, diffusion methods are exploited for robust denoising by training directly on field noise distributions rather than idealized clean signals, enabling enhanced downstream event detection sensitivity, particularly in low-SNR regimes (Zhu et al., 3 Sep 2025). This suggests that tight coupling of diffusion-based denoising and detection/picking could yield further improvements in localization accuracy and recall in challenging acoustic or geophysical datasets.

Related lines of work in signal-dependent noise (e.g., multiplicative speckle), image denoising, and general bias-variance control in diffusion sampling have also motivated algorithmic and architectural innovations, such as homomorphic log-domain modeling (Guha et al., 2023) and kernel-smoothed score regularization (Gabriel et al., 28 May 2025), which address both domain-specific noise characteristics and generic overfitting phenomena.

Denoising diffusion SED frameworks thus represent a rigorous, generative alternative to classical SED pipelines, fusing probabilistic modeling with high-capacity neural architectures for state-of-the-art detection accuracy and efficiency in both synthetic and real audio environments (Bhosale et al., 2023).