Universal Score-based Speech Enhancement with High Content Preservation

Published 18 Jun 2024 in eess.AS and cs.SD | (2406.12194v1)

Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces UNIVERSE++ which integrates score-based diffusion with adversarial training to significantly enhance speech quality and content preservation.
It employs architectural improvements like normalization, anti-aliasing filters, and optimized Fourier embeddings to boost training robustness and output naturalness.
The approach uses low-rank adaptation and phoneme fidelity loss to minimize hallucinations and ensure accurate, linguistically coherent speech enhancement.

Universal Score-based Speech Enhancement with High Content Preservation: An Overview

The paper, "Universal Score-based Speech Enhancement with High Content Preservation," presents UNIVERSE++, a novel approach aimed at enhancing speech quality by leveraging score-based diffusion and adversarial training. The work builds upon the existing UNIVERSE model, introducing key improvements to augment training stability, overall performance, and content preservation of enhanced speech.

Conducted by researchers at LY Corporation, this study emphasizes three primary contributions:

Architectural upgrades to enhance training robustness and outcomes.
Integration of adversarial loss to improve speech feature extraction quality.
Implementation of a low-rank adaptation scheme accompanied by a phoneme fidelity loss to ensure content preservation during the enhancement process.

Background and Context

Universal Speech Enhancement (USE) encompasses the restoration of clear speech from various forms of degraded signals, such as those affected by noise, reverberation, clipping, and other distortions. Traditional speech enhancement methods, typically grounded in deep neural networks (DNNs) and either operating in the time or time-frequency domains, face challenges like residual noise and artifacts. Generative models, including GANs and score-based diffusion models, offer promising alternatives by focusing on high-quality speech generation without residual noise.

UNIVERSE, a score-based diffusion method for USE, has demonstrated exceptional potential. However, preliminary experimentations by the authors indicated issues, such as training difficulty and speech hallucinations. UNIVERSE++ addresses these limitations with strategic modifications.

Methodology

Network Architecture Improvements

UNIVERSE++ incorporates several architectural enhancements:

Normalization: Utilizing the re-parameterization approach suggested by Karras et al., which ensures input and target variances are unitary.
Anti-aliasing Filters: These filters are introduced in the down/up-sampling stages to mitigate aliasing artifacts. This practice, borrowed from image generation, retains high-frequency content processing to upper network stages.
Miscellaneous Modifications: The model employs weight normalization and optimized Fourier embeddings for noise variance handling.

Adversarial Training with HiFi-GAN

A novel element in UNIVERSE++ is the use of HiFi-GAN adversarial loss. This integration replaces the original Mixture Density Network (MDN) loss, shifting from a sample-wise discriminative focus to an adversarial framework. This adjustment promotes high-quality feature extraction, utilizing multi-period and multi-resolution discriminators, giving rise to more natural-sounding enhanced speech.

Low-rank Adaptation and Phoneme Fidelity Loss

To address hallucinations and ensure linguistic content preservation, UNIVERSE++ employs a fine-tuning process leveraging low-rank adaptation. This involves adapting weights with minimal memory impact, underpinned by a phoneme predictor and connectionist temporal classification (CTC) loss to align enhanced speech phonemes with clean speech phonemes. This mechanism fine-tunes the model to sustain linguistic integrity.

Experimental Evaluation

The model's effectiveness was validated against various benchmark datasets:

Voicebank+DEMAND (VB): Demonstrating significant improvements in quality and naturalness, evidenced by metrics such as PESQ and DNSMOS.
VB Bandwidth Extension (VB-BWE) and Packet Loss Concealment (PLC): Highlighting the model's versatility in handling different speech distortions.
Signal Improvement Challenge (SIG): Emphasizing its capacity to enhance real-world distorted speech.

Implications and Future Directions

The results indicate that UNIVERSE++ excels in producing natural-sounding speech while preserving content clarity across diverse degradation types. It outperforms both discriminative and generative baselines, particularly in naturalness and content integrity, making it a robust solution for universal speech enhancement challenges.

In future research, exploring phoneme loss application during the initial training phase may yield further improvements. The adaptability and high-quality speech generation capabilities of UNIVERSE++ signal significant advancements in the field of speech enhancement, particularly for practical applications requiring high content preservation and naturalness.

Markdown Report Issue