Generative Speech Enhancement Models

Updated 21 January 2026

Generative Speech Enhancement (GSE) models are deep generative frameworks that reconstruct clean speech from degraded signals using learned latent representations.
They employ diverse architectures including embedding-centric pipelines, GANs, diffusion models, flow matching, and language model-driven tokenization to enhance audio quality.
GSE systems achieve superior perceptual quality and real-time efficiency by integrating adversarial, perceptual, and confidence-based loss functions in modular, scalable designs.

Generative Speech Enhancement (GSE) models define a class of speech enhancement systems that leverage deep generative modeling to transform degraded speech signals into perceptually clean, high-fidelity speech. In contrast to traditional mask-based or discriminative regression approaches, GSE systems operate by learning powerful data-driven priors over clean speech, frequently acting in latent, time-frequency, or tokenized domains, and often synthesizing output by re-generating entire waveforms or feature representations. This class spans architectures based on pre-trained generative audio encoders, GANs, flow-matching, diffusion, token-driven LLMs, and hybrid formulations.

1. Architectural Paradigms in Generative Speech Enhancement

GSE systems instantiate diverse architectural blueprints, unified by their generative reconstruction principle. Notable design paradigms include:

a) Embedding-centric Two-Stage Pipelines: A pre-trained generative audio encoder extracts dense latent representations from a noisy input. A light denoising network refines these embeddings, which are then decoded into waveforms by a pre-trained neural vocoder. This modular approach, exemplified by the Dasheng audioencoder + transformer-based denoiser + Vocos vocoder pipeline, decouples invariant feature extraction, denoising, and synthesis into distinct, efficiently swappable components (Sun et al., 13 Jun 2025).

b) Generative Adversarial Networks (GANs) and Hybrid GAN Architectures: GAN-based GSEs, such as EffiFusion-GAN and SEFGAN, employ invertible flows, efficient convolutional stacks, or depthwise-separable architectures as generators, with multi-scale adversarial discriminators supervising perceptual fidelity. Hybrid frameworks may combine maximum likelihood estimation with adversarial losses to optimize both density modeling and sample quality (Wen et al., 20 Aug 2025, Strauss et al., 2023).

c) Score-based Diffusion and Schrödinger Bridge Models: These models learn to reverse a stochastic process that corrupts clean speech to noise, iteratively "denoising" towards high-quality speech. Pioneering works include SGMSE+ and SB-UFOGen, with design choices in noise schedules, ODE/SDE discretizations, and GAN-augmented few-step refinement for real-time inference (Richter et al., 2022, Han et al., 2 Jun 2025).

d) Flow Matching and Mean Flows: FlowSE and MeanSE replace stochastic iterative denoising with deterministic transformations from noise to speech in a continuous or averaged (mean flow) velocity field. MeanSE achieves state-of-the-art one-step enhancement and notably superior generalization under distribution shift (Wang et al., 26 May 2025, Wang et al., 25 Sep 2025).

e) Discrete-Token and Language-Model–Driven Approaches: Hierarchical tokenization (residual vector quantization) and conditional autoregressive language modeling (e.g., GenSE, LLaSE-G1, OmniGSE) offer semantically informed, scalable GSE solutions, reconstructing both high-level linguistic content and fine acoustic detail via multi-stage token prediction and neural codecs (Yao et al., 5 Feb 2025, Kang et al., 1 Mar 2025, Mu et al., 25 Jul 2025).

f) Specialized Edge and Online Architectures: For low-latency and edge processing, WSR-MGAN leverages efficient U-Net backbones integrated with attention and metric-driven objectives. Diffusion Buffer architectures align diffusion time with streaming signal arrival to enable truly online generative enhancement with tunable latency (Pal et al., 2024, Lay et al., 21 Oct 2025).

2. Mathematical Formulations and Training Objectives

Generative speech enhancement recasts SE as a conditional generation problem, typically involving the following mathematical elements:

Latent-space denoising: Embedding-level denoising is optimized via MSE or latent-consistency losses (e.g., $L_\mathrm{embed} = \|D_\theta(Z_\mathrm{noisy}) - Z_\mathrm{clean}\|_2^2$ ), with embeddings $Z_\mathrm{noisy}$ extracted from noisy waveforms via a frozen generative encoder, and $Z_\mathrm{clean}$ as the clean reference (Sun et al., 13 Jun 2025).
Adversarial and Perceptual Losses: GAN-based objectives combine $\ell_1$ or complex spectral losses, perceptual feature-matching, and adversarial divergence minimization (e.g., $L_\mathrm{GAN}$ , $L_\mathrm{rec}$ , $L_\mathrm{FM}$ ). Metric discriminators predict human-derived quality metrics (e.g., PESQ), directly integrating non-differentiable objective supervision (Cao et al., 2022, Wen et al., 20 Aug 2025).
Diffusion/Flow Matching: Stochastic score-matching minimizes the denoising-score matching (DSM) objective, while flow-matching approaches learn continuous ODE fields to deterministically bridge noisy and clean speech. Mean-flow approaches further train networks to match the integral of velocity fields, yielding single-step inference (Wang et al., 26 May 2025, Wang et al., 25 Sep 2025).
Hierarchical Language Modeling: Discrete-token GSE models use next-token cross-entropy for semantic and acoustic token prediction. Hierarchically structured LMs (e.g., GenSE, OmniGSE) factorize the conditional likelihoods over different codebook levels and domains (Yao et al., 5 Feb 2025, Mu et al., 25 Jul 2025).
Hallucination- and Fidelity-Aware Objectives: Advanced frameworks like PASE introduce explicit metrics to minimize linguistic and acoustic hallucinations (e.g., Word Error Rate, Speaker Similarity), leveraging distillation from robust speech SSL models and dual-stream vocoders (Rong et al., 17 Nov 2025).

3. Experimental Results and Comparative Metrics

GSE models are evaluated using both intrusive and non-intrusive metrics, as well as human listening tests:

Objective quality (PESQ, ESTOI, SI-SDR, SSNR): Models based on pre-trained generative encoders (Dasheng) coupled with light denoisers and vocoders outperform discriminative and baseline generative encoders in both distortion and intelligibility metrics, e.g., PESQ 2.32 vs. 1.27/1.31 (Valentini) for generative vs. Whisper/WavLM encoders (Sun et al., 13 Jun 2025). Diffusion and SB-UFOGen achieve state-of-the-art denoising and dereverberation with 1–4 reverse steps, offering remarkable real-time factors (Han et al., 2 Jun 2025).
Perceptual metrics (DNSMOS, NISQA, Speaker Fidelity): Generative encoder systems and advanced LM-driven models achieve higher DNSMOS and speaker similarity scores than traditional or discriminative approaches. For example, Dasheng GSE achieves ECAPA-TDNN cosine sim 0.783 (Dasheng) vs. 0.408/0.406 (WavLM/Whisper) (Sun et al., 13 Jun 2025), and PASE achieves SpkSim=0.80, WER=7.49% vs. 0.42–0.63, 21–46% for other leading GSEs (Rong et al., 17 Nov 2025).
Subjective listening (MOS, SMOS/NMOS): Human opinion scores indicate clear perceptual advantages for GSE models employing latent denoising, adversarially trained flows, or high-capacity LLMs, outperforming state-of-the-art waveform SE models by 0.76 MOS points in subjective evaluations (Sun et al., 13 Jun 2025).
Real-time efficiency: The parameter efficiency and runtime depend on architectural choices. Flow-matching, mean-flow, and GAN-based models drastically reduce network calls compared to classical diffusion, achieving real-time inference (RTF ≪ 1) on standard GPUs (Wang et al., 26 May 2025, Wang et al., 25 Sep 2025, Wen et al., 20 Aug 2025).

4. Hallucination, Robustness, and Dataset Curation

GSE models can introduce hallucination errors, including phoneme omission and speaker inconsistency, especially under severe noise conditions. Discrete-token GSEs (e.g., Genhancer) are especially vulnerable to such errors due to sampling and model uncertainty (Yamauchi et al., 18 Jan 2026).

Non-intrusive detection and filtration: Confidence-based filtering (utterance-averaged log-probabilities over the primary quantizer) strongly correlates with intrusive metrics (e.g., SRCC up to 0.88 with ESTOI/PESQ) and effectively identifies hallucinations missed by standard non-intrusive quality metrics (e.g., UTMOS/DNSMOS) (Yamauchi et al., 18 Jan 2026).
Dataset curation: Incorporating confidence-based filtering in pipeline curation improves downstream TTS performance, producing substantial improvements in both subjective (UTMOS) and recognition (WER) metrics. Curation practices leveraging internal GSE model confidence show strong empirical gains (Yamauchi et al., 18 Jan 2026).
Mitigating hallucination in design: PASE addresses hallucination by distilling robust phonological priors from large pretrained models (WavLM) and employing dual-stream vocoding, directly improving linguistic/phonetic and speaker consistency metrics (Rong et al., 17 Nov 2025).

5. Advantages, Limitations, and Future Directions

Explicit timbre/prosody preservation: By denoising in generative encoders’ latent spaces or through dual-stream architectures, GSE models preserve timbre, prosody, and spectral detail more reliably than discriminative feature-predictors. Decoupling encoding, denoising, and synthesis enables flexible architecture selection and parameter/quality trade-offs (Sun et al., 13 Jun 2025, Rong et al., 17 Nov 2025).
Efficiency/Modularity: Modularized GSE frameworks permit independent evolution and fine-tuning of encoders, denoisers, and vocoders. Recent models demonstrate parameter-efficient designs (denoisers with ≤5 M parameters), real-time capable pipelines, and pruning strategies for edge deployment (Sun et al., 13 Jun 2025, Wen et al., 20 Aug 2025, Wang et al., 25 Sep 2025).
Limitation—Vocoder and Distribution Shift: GSE pipeline performance may degrade if the encoder or vocoder's domain is mismatched by noise or unseen conditions. Hallucinations or artifacts can result if the embedding distribution departs from training support (Sun et al., 13 Jun 2025, Rong et al., 17 Nov 2025).
Applicability to Non-Typical Speech: There are documented limitations of diffusion-based and other generative models for pathological or dysarthric speech; these systems trained on healthy data tend to remove paralinguistic cues important for clinical assessment (Reszka et al., 2024). Hybrid or validation-integrated approaches are advised for preserving such cues.
Future research: Key directions include (1) integrating adversarial or perceptual embedding losses; (2) multi-scale temporal modeling in denoiser networks; (3) exploring token-level or latent-space confidence criteria for hallucination detection; (4) adaptation and validation against non-typical and cross-domain data; (5) streaming and low-latency architectures using causal transformers and buffer-aligned diffusion frameworks (Lay et al., 21 Oct 2025, Wang et al., 25 Sep 2025).

6. Comparative Overview of Representative GSE Systems

Model/System	Core Architecture	Intrusive Metric	Real-Time Factor	Hallucination Control
Dasheng GSE (Sun et al., 13 Jun 2025)	Frozen generative encoder + ViT denoiser + Vocos vocoder	PESQ=2.32/1.27; STOI=0.90/0.81	Real-time	Latent denoising, parameter ablation
EffiFusion-GAN (Wen et al., 20 Aug 2025)	DSConv-GAN + Dual Norm Attention	PESQ=3.45	Low (1.08 M params)	Pruning, attention
FlowSE/MeanSE (Wang et al., 26 May 2025, Wang et al., 25 Sep 2025)	Diffusion Transformer, flow-matching/mean flow	DNSMOS=3.614 / 3.332	RTF=0.31 / 1-NFE	1-NFE, out-of-domain generalization
GenSE (Yao et al., 5 Feb 2025)	Hierarchical LM (semantic + acoustic tokens)	DNSMOS-OVL=3.43	Slow (autoregressive)	Token prompting, hierarchy
PASE (Rong et al., 17 Nov 2025)	Distilled WavLM + dual-stream vocoder	UTMOS=3.09; SpkSim=0.80	Moderate (382 M params)	WavLM prior, dual stream
Diffusion Buffer (Lay et al., 21 Oct 2025)	Block-causal UNet, streaming diffusion	PESQ=2.02 (g=32, d=9)	RTF=0.44	DP loss, latency-quality trade-off

*Comparisons: generative vs. discriminative audio encoders; see ref. (Sun et al., 13 Jun 2025).

7. Theoretical and Empirical Impact

Generative Speech Enhancement has catalyzed a shift from task-specific, mask-based, or pure discriminative frameworks toward modular, generalizable models capable of flexible enhancement and restoration across varied noise, distortion, and degradation types. GSE frameworks have demonstrated superior perceptual quality, improved generalization to unseen domains, more natural error structures, and the ability to balance computational cost with communication and application constraints. The field continues to expand into real-time, streaming, and universal enhancement, and to tackle the fundamental challenge of preserving both the intelligibility and unique characteristics of every speech signal (Sun et al., 13 Jun 2025, Rong et al., 17 Nov 2025, Mu et al., 25 Jul 2025).