- The paper demonstrates that integrating semantic KD into MaskSR2 reduces word error rate by up to 37.9%, markedly improving speech intelligibility.
- It leverages a pre-trained HuBERT model within the speech encoder to predict semantic features used for masked acoustic token reconstruction.
- Extensive tests on VCTK and LibriSpeech show improved quality and clarity, outperforming baseline models in both perceptual and objective metrics.
Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration With Improved Intelligibility
The paper "Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration With Improved Intelligibility" addresses the challenge of restoring full-band speech (44.1 kHz) with high quality and intelligibility from corrupted signals. The proposed model, MaskSR2, builds upon previous work on the MaskSR generative model. While MaskSR achieves high perceptual quality, MaskSR2 aims to significantly enhance speech intelligibility measured by word error rate (WER).
Overview of the Proposed Approach
The authors' main contribution is integrating semantic knowledge distillation (KD) into the speech encoder component of MaskSR during training. The speech encoder for MaskSR2 is trained to predict semantic representations of the target speech using a pre-trained self-supervised model, HuBERT, which encodes phonetic patterns. This is aimed at reducing the WER without relying on large transcribed datasets.
Key Components
- Speech Encoder: During training, the encoder leverages a pre-trained HuBERT model, which provides semantic targets extracted from downsampled 16 kHz recordings. The model learns to predict these targets through three variations: L9-K500 (discrete tokens), L9-feature (continuous features), and Avg-feature (averaged features across HuBERT layers). The encoder output then conditions the generative model for masked token prediction.
- Knowledge Distillation: The KD process involves using a loss function to train the speech encoder to predict the HuBERT-derived semantic targets. This harnesses SSL representations that encapsulate phonetic information, enhancing the model's capacity to produce more intelligible restored speech.
- Masked Acoustic Modeling: Similar to MaskSR, the generative model predicts masked acoustic tokens (using the MaskGIT paradigm) and iteratively synthesizes the target codegram that the audio tokenizer (e.g., DAC) converts into waveforms.
Results
The evaluation shows MaskSR2's substantial improvements in both intelligibility and overall quality metrics. Specifically, MaskSR2 achieves a relative WER reduction of up to 37.9% compared to MaskSR and even surpasses several strong regression models on standard test sets.
Experimental Findings
- Full-band Restoration: On a VCTK test set with multi-faceted distortions (noise, reverb, low bandwidth, and clipping), MaskSR2-S and MaskSR2-L demonstrated significant WER reduction and quality improvement over MaskSR. Avg-feature KD consistently yielded the best outcomes.
- Subjective Listening: Expert listeners rated MaskSR2-L nearly on par with the target speech in terms of quality, outperforming strong baselines like DFNet3 and VoiceFixer.
- Wide-band Denoising: On the LibriSpeech test set, which contains longer and more complex sentences, MaskSR2 models showed competitive WER results compared to specialized denoising models, despite MaskSR2's broader restoration capability. Quality metrics like DNSMOS and SESQA also favored MaskSR2.
Implications and Future Directions
The integration of semantic KD using SSL models like HuBERT in MaskSR2 highlights a promising direction for speech restoration, emphasizing the necessity of semantic understanding for intelligibility. Moving forward, exploring more powerful SSL models and multitask speech encoders could further enhance performance. Additionally, the paper suggests further investigating span masking strategies for potentially better context modeling.
In summary, MaskSR2 sets an advanced standard for speech restoration, combining perceptual quality with improved intelligibility. This integrated approach shows significant promise, particularly in applications where speech clarity and content preservation are crucial.