Generalization of lower-parameterized diffusion-based speech enhancement models

Determine whether diffusion-based speech enhancement models that employ lower-parameterized neural networks—such as compact 2D convolutional UNet variants used as score models—can maintain the strong generalization to unseen noisy speech conditions that has been observed for larger-parameter score-based generative models (e.g., SGMSE/SGMSE+).

Background

The paper introduces the Diffusion Buffer, an online generative diffusion-based speech enhancement approach that reduces computational demands by aligning physical time with diffusion time-steps and performing only one neural network call per incoming frame. To run on consumer-grade GPUs, the authors reduce the parameter count of the underlying 2D convolutional UNet.

Prior work has shown that large-parameter diffusion models (e.g., SGMSE/SGMSE+) can generalize well to unseen noisy speech conditions. However, whether this favorable generalization persists when the score network is significantly smaller remains explicitly uncertain. Resolving this question is crucial for designing online-capable generative systems that retain robustness to mismatched noise while meeting real-time constraints on consumer hardware.

References

However, it remains unclear if a lower-parameterized NN still results in the good generalization to unseen data reported in [richter_sgmse].

— Diffusion Buffer for Online Generative Speech Enhancement (2510.18744 - Lay et al., 21 Oct 2025) in Introduction (Section 1), Footnote after the 'Video and code' link

Generalization of lower-parameterized diffusion-based speech enhancement models

Background

References

Related Problems