- The paper introduces a dual-branch architecture that fuses generative and predictive paradigms for universal speech enhancement, reducing generative hallucinations.
- The generative branch employs DeWavLM-Omni with an optimized Adapter and Vocoder, while the predictive branch refines noise and artifact suppression in the spectrogram domain.
- The integrated fusion approach, validated by top rankings in objective evaluations, sets a new benchmark for multilingual and distortion-robust speech restoration.
GAP-URGENet: A Generative-Predictive Fusion Architecture for Universal Speech Enhancement
Introduction
GAP-URGENet represents a substantial advancement in the field of universal robust speech enhancement (USE), explicitly developed for the ICASSP 2026 URGENT Challenge Track 1. The presented approach integrates both generative and predictive paradigms into a unified, large-scale architecture, emphasizing robustness to diverse distortions, sampling rates, and languages. The design addresses the inherent trade-offs between generative models' superior perceptual quality—often accompanied by hallucinations and degraded signal faithfulness—and predictive models' tendency to maximize objective metrics at the expense of naturalness. By employing a generative-predictive fusion strategy, the framework achieves enhanced speech fidelity and robustness, validated by competitive performance and a first-place outcome in objective evaluation for the URGENT Challenge.
Figure 1: Overview of the GAP-URGENet framework, illustrating the dual-branch generative-predictive fusion architecture and post-processing pipeline.
Model Architecture
Generative Branch
The generative branch is centered around DeWavLM-Omni, an extension of the PASE (Phonological prior Aided Speech Enhancement) paradigm, scaling self-supervised speech representations for full-stack restoration. Noisy input is subjected to a spectrum of distortions, leveraging a packet-loss detection method to exploit the masked-prediction capabilities of WavLM for packet loss concealment by introducing learnable mask embeddings in affected CNN outputs. The architecture generates dual representations: early-layer Transformer outputs encapsulate fine-grained acoustics with residual signal artifacts, while final-layer outputs yield high-level purified phonetic features. The Adapter maps these refined representations to an acoustic space, conditioned on the original input via element-wise addition to facilitate high-fidelity waveform reconstruction by the neural Vocoder.
Both the Adapter and the Vocoder adopt the improved Vocos design, discarding the iSTFT head for efficiency. The Adapter is optimized using MSE, adversarial, and feature-matching losses, whereas the Vocoder, trained independently on clean speech, utilizes a multi-scale Mel-spectrogram and adversarial loss amalgam with multiple discriminators.
Predictive Branch
The predictive branch, implemented via the TF-GridNet architecture, focuses on direct spectrogram-domain enhancement, targeting noise, reverberation, and clipping artifacts through discriminative modeling. Its STFT-domain loss formulation enforces signal detail preservation, providing complementary cues to the generative branch. This branch is tasked primarily with suppressing additive and convolutive distortions—scenarios in which predictive models excel.
Post-Processing (PostNet) and Fusion
Outputs from both branches are concatenated and processed by the PostNet, based on the CWS-TF-GridNet architecture from TS-URGENet. This module performs both feature fusion and bandwidth extension, producing a high-resolution 48 kHz waveform prior to downsampling. Loss functions are extended to include PESQ- and UTMOS-aware terms, enhancing correspondence with psychoacoustic and human-in-the-loop evaluation criteria.
Experimental Protocol
The system is trained on a large, meticulously curated corpus encompassing various standard and URGENT-specific datasets, with comprehensive data cleaning procedures including DNSMOS-based quality filtering and neural noise suppression for low-quality sources. The model configuration employs WavLM-Large for DeWavLM-Omni and high-capacity variants (over 567M parameters, 472.84 GMACs/s) for all modules, underscoring the system's suitability for large-scale, real-world deployment. 10,000 simulated RIRs with elevated RT60 expand reverberation coverage, and music/noise data is further pre-processed to remove vocal contamination.
Results
GAP-URGENet achieves robust gains over both the discriminative (BSRNN, TF-GridNet) and fully generative baselines:
- Superior objective and subjective quality: The system delivers state-of-the-art performance on canonical metrics (DNSMOS, NISQA, UTMOS, SCOREQ, PESQ).
- Speaker and linguistic fidelity: The generative branch shows enhanced performance in SpkSim and LPS metrics, evidencing improved speaker and phonetic consistency.
- Balanced fusion: The full system outperforms each constituent branch, validating the complementarity and efficacy of generative-predictive integration.
- First-place Challenge ranking: GAP-URGENet secures top position in the blind-test objective ranking for ICASSP 2026 URGENT, attesting to its practical efficacy.
The architecture succeeds in minimizing generative hallucinations compared to prior fully generative models, while maintaining higher perceptual quality than purely predictive approaches.
Implications and Future Directions
GAP-URGENet establishes a robust technical precedent for universal, task-agnostic speech enhancement architectures. Its effective fusion is likely to motivate adoption of similar dual-path strategies in other high-variance, low-resource, or multilingual enhancement settings. The intricate use of self-supervised representations for both denoising and packet loss concealment opens avenues for extending such mechanisms to more general audio restoration, including synthesis, separation, and dereverberation. The high-capacity model and curated training regime also underline the crucial role of data preparation and appropriate model scaling for USE.
Looking forward, future research may explore end-to-end joint optimization of the vocoder and upstream modules, broader language family and domain transferability, and real-time or streaming inference adaptations. The integration of large spoken LLMs, coupled with robust discriminative modules, represents a generalized and extensible framework with potential for broader multi-modal and cross-lingual speech interface applications.
Conclusion
GAP-URGENet demonstrates the tangible advantages of fusing generative and predictive paradigms for universal speech enhancement under diverse and challenging conditions. Its architectural innovations, robust performance across metrics, and practical success in the URGENT Challenge underscore the efficacy of generative-predictive fusion, setting a strong benchmark for subsequent research in universal robust speech restoration (2604.01832).