Papers
Topics
Authors
Recent
Search
2000 character limit reached

Denoising Diffusion SED

Updated 16 March 2026
  • Denoising diffusion SED is a probabilistic generative approach that iteratively refines noisy latent event queries to accurately estimate sound event boundaries.
  • It employs Denoising Diffusion Probabilistic Models and a Transformer-based decoder that jointly optimize classification and boundary localization without manual post-processing.
  • Empirical results highlight superior accuracy and a 40% reduction in training epochs compared to conventional and Transformer-based SED methods.

Denoising diffusion for Sound Event Detection (SED) encompasses a class of probabilistic generative methods that directly model the event boundary estimation process as iterative denoising of latent proposals or noisy event queries. Unlike conventional frame-level or discriminative event-level SED, these approaches formulate detection as conditional generative modeling, typically leveraging the formalism of Denoising Diffusion Probabilistic Models (DDPMs) to sequentially refine noisy candidate event descriptors into accurate predictions of event onsets, offsets, and labels. The principal instantiation in the audio domain is DiffSED, which introduces a dedicated event query diffusion scheme integrated with a Transformer decoder for robust sound event boundary generation and classification (Bhosale et al., 2023).

1. Mathematical Formulation of Diffusion for SED

The denoising diffusion process in SED operates on a matrix of event queries z0RN×Dz_0\in\mathbb{R}^{N\times D}, where each row represents a learnable, high-dimensional descriptor for a potential event (with NN as the maximum number of events per audio clip and DD the embedding dimension). The forward (noising) process applies a fixed stochastic progression governed by a noise schedule {βt}t=1T(0,1)\{\beta_t\}_{t=1}^T\subset(0,1), typically chosen (e.g., cosine or linear schedule) for stable training and expressivity:

  • αt=1βt\alpha_t=1-\beta_t, αˉt=i=1tαi\bar\alpha_t=\prod_{i=1}^t\alpha_i;
  • At each step, q(ztzt1)=N(zt;αtzt1,βtI)q(z_t|z_{t-1})=\mathcal{N}(z_t;\sqrt{\alpha_t}\,z_{t-1},\beta_t I);
  • Closed form: zt=αˉtz0+1αˉtϵz_t=\sqrt{\bar\alpha_t} z_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, ϵN(0,I)\epsilon\sim \mathcal{N}(0,I).

The reverse (denoising) process utilizes a neural network ϵθ(zt,t,A)\epsilon_\theta(z_t,t,A), parameterized to predict the added noise, or equivalently, the clean proposal z^0\hat{z}_0, conditioned on both the noisy queries ztz_t and an audio conditioning AA. The learned parameterization for the mean of the reverse kernel is:

μθ(zt,t,A)=1αt(ztβt1αˉtϵθ(zt,t,A)).\mu_\theta(z_t,t,A) = \frac{1}{\sqrt{\alpha_t}}\left(z_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(z_t,t,A)\right) \,.

The training objective is the noise-matching loss:

Lt=Ez0,ϵ,t[ϵϵθ(zt,t,A)22],L_t = \mathbb{E}_{z_0, \epsilon, t}\left[\|\epsilon - \epsilon_\theta(z_t, t, A)\|^2_2\right],

which provides a tractable surrogate for the variational lower bound typical in DDPMs.

2. Architecture: Latent Event Queries and Transformer Denoiser

DiffSED introduces learnable event queries as the primary diffusion targets. Each z0z_0 is initialized N(0,I)N(0,I) and upscaled linearly to a bounded range [scale,scale][-\mathrm{scale},\mathrm{scale}] via z0(2b1)scalez_0\leftarrow (2b-1)\cdot \mathrm{scale}. The corruption occurs at a randomly sampled timestep, and the denoising task is to reconstruct the original z0z_0 vectors from ztz_t.

The denoising network is realized as a set-prediction Transformer decoder akin to DETR, ingesting the noisy queries, query time embeddings, and global audio conditioning features. The audio encoder processes a mel-spectrogram AA, produces latent features via a CNN (e.g., ResNet-50), with a subsequent temporal Transformer capturing long-range temporal dependencies, outputting CaC_a. The decoder applies multi-head self-attention, cross-attention to the conditioned audio context, and per-query output heads for both class probabilities (softmax over CC classes) and boundary localization (sigmoid regression of onset/offset).

3. Training, Matching, and Losses

Training proceeds by randomly corrupting the latent queries, passing the noisy (ztz_t, tt) and audio condition CaC_a through the decoder to generate predictions for events. The Hungarian algorithm matches the NN predictions to MM ground-truth events based on a cost that typically incorporates both class and boundary differences.

The compound detection loss per matched pair (ij)(i\leftrightarrow j) contains:

  • Classification: Lcls=yjlogp(yji)L_\mathrm{cls} = -y_j \log p(y_j|i);
  • Boundary regression: Lbox=[Ψj,ξj][Ψ^i,ξ^i]1+λGIoU([Ψj,ξj],[Ψ^i,ξ^i])L_\mathrm{box} = \|[\Psi_j,\xi_j] - [\hat\Psi_i,\hat\xi_i]\|_1 + \lambda \,\ell_{GIoU}([\Psi_j,\xi_j], [\hat\Psi_i,\hat\xi_i]);

where [Ψj,ξj][\Psi_j, \xi_j] and yjy_j are the ground-truth boundary and class for event jj. The diffusion loss LdiffL_\mathrm{diff} is typically absorbed by supervising decoder outputs to reconstruct z0z_0 directly.

4. Inference: Iterative Refinement without Post-Processing

At inference, the model initializes zTN(0,I)z_T\sim \mathcal{N}(0,I), then applies the denoising network at a sequence of timesteps t=T,Ts,,0t=T,T-s,\ldots,0 (with ss as the sampling stride). The denoising steps refine the queries toward high-probability event hypotheses. In the DDIM setting, efficient update rules further reduce sampling steps. At completion, onset/offset and class predictions are returned—no non-maximum suppression or ad hoc post-processing is required (Bhosale et al., 2023).

5. Empirical Results and Comparative Performance

DiffSED demonstrates consistently superior accuracy and convergence rates relative to discriminative baselines and Transformer-based SEDT methods on both synthetic and real-world SED benchmarks:

Model Event-F1 Seg-F1 Audio-F1 (URBAN-SED test set)
CRNN-CWin 36.75 65.74 74.19
Ctrans-CWin 34.36 64.73 74.05
SEDT (DETR-style) 37.27 65.21 74.37
DiffSED 43.89 69.24 77.87

On EPIC-Sounds, DiffSED achieves higher Top-1/Top-5 accuracy and mAP compared to ASF and SSAST (e.g., +3.1 pp Top-1, +0.023 mAP).

Notably, training convergence is accelerated: DiffSED reaches optimality at \sim115 epochs versus \sim200 for SEDT, representing a 40%40\% reduction in training duration. Ablation analyses confirm the benefit of query-space denoising, the effectiveness of single-step inference for audio tagging, and the robustness of the method across seeds and noise hyperparameters.

6. Methodological Advantages of Denoising Diffusion for SED

Diffusion-based SED confers several theoretical and practical strengths:

  • Boundary uncertainty is naturally addressed: The stochastic forward process injects variability into initial event proposals, while the denoising pathway systematically nudges proposals toward accurate boundaries, mitigating overconfident regression and local minima.
  • Global event interaction: The iterative process allows the decoder to refine the set of event hypotheses in a holistic, joint fashion.
  • Training stability: Ground-truth queries function as fixed denoising targets, facilitating robust assignment and stable convergence in the set-prediction framework.
  • Generative flexibility: The latent query formulation and the ability to trade off the number of queries versus refinement steps at test time permit dynamic control over latency and accuracy within a single trained model.

The denoising diffusion SED approach belongs to a broader trend of integrating diffusion generative modeling with structured detection or estimation tasks. For example, in seismic processing, diffusion methods are exploited for robust denoising by training directly on field noise distributions rather than idealized clean signals, enabling enhanced downstream event detection sensitivity, particularly in low-SNR regimes (Zhu et al., 3 Sep 2025). This suggests that tight coupling of diffusion-based denoising and detection/picking could yield further improvements in localization accuracy and recall in challenging acoustic or geophysical datasets.

Related lines of work in signal-dependent noise (e.g., multiplicative speckle), image denoising, and general bias-variance control in diffusion sampling have also motivated algorithmic and architectural innovations, such as homomorphic log-domain modeling (Guha et al., 2023) and kernel-smoothed score regularization (Gabriel et al., 28 May 2025), which address both domain-specific noise characteristics and generic overfitting phenomena.


Denoising diffusion SED frameworks thus represent a rigorous, generative alternative to classical SED pipelines, fusing probabilistic modeling with high-capacity neural architectures for state-of-the-art detection accuracy and efficiency in both synthetic and real audio environments (Bhosale et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Denoising Diffusion SED.