Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 92 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Kimi K2 175 tok/s Pro
2000 character limit reached

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility (2409.09357v1)

Published 14 Sep 2024 in cs.SD, cs.AI, eess.AS, and eess.SP

Abstract: Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked LLM is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that integrating semantic KD into MaskSR2 reduces word error rate by up to 37.9%, markedly improving speech intelligibility.
  • It leverages a pre-trained HuBERT model within the speech encoder to predict semantic features used for masked acoustic token reconstruction.
  • Extensive tests on VCTK and LibriSpeech show improved quality and clarity, outperforming baseline models in both perceptual and objective metrics.

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration With Improved Intelligibility

The paper "Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration With Improved Intelligibility" addresses the challenge of restoring full-band speech (44.1 kHz) with high quality and intelligibility from corrupted signals. The proposed model, MaskSR2, builds upon previous work on the MaskSR generative model. While MaskSR achieves high perceptual quality, MaskSR2 aims to significantly enhance speech intelligibility measured by word error rate (WER).

Overview of the Proposed Approach

The authors' main contribution is integrating semantic knowledge distillation (KD) into the speech encoder component of MaskSR during training. The speech encoder for MaskSR2 is trained to predict semantic representations of the target speech using a pre-trained self-supervised model, HuBERT, which encodes phonetic patterns. This is aimed at reducing the WER without relying on large transcribed datasets.

Key Components

  • Speech Encoder: During training, the encoder leverages a pre-trained HuBERT model, which provides semantic targets extracted from downsampled 16 kHz recordings. The model learns to predict these targets through three variations: L9-K500 (discrete tokens), L9-feature (continuous features), and Avg-feature (averaged features across HuBERT layers). The encoder output then conditions the generative model for masked token prediction.
  • Knowledge Distillation: The KD process involves using a loss function to train the speech encoder to predict the HuBERT-derived semantic targets. This harnesses SSL representations that encapsulate phonetic information, enhancing the model's capacity to produce more intelligible restored speech.
  • Masked Acoustic Modeling: Similar to MaskSR, the generative model predicts masked acoustic tokens (using the MaskGIT paradigm) and iteratively synthesizes the target codegram that the audio tokenizer (e.g., DAC) converts into waveforms.

Results

The evaluation shows MaskSR2's substantial improvements in both intelligibility and overall quality metrics. Specifically, MaskSR2 achieves a relative WER reduction of up to 37.9% compared to MaskSR and even surpasses several strong regression models on standard test sets.

Experimental Findings

  • Full-band Restoration: On a VCTK test set with multi-faceted distortions (noise, reverb, low bandwidth, and clipping), MaskSR2-S and MaskSR2-L demonstrated significant WER reduction and quality improvement over MaskSR. Avg-feature KD consistently yielded the best outcomes.
  • Subjective Listening: Expert listeners rated MaskSR2-L nearly on par with the target speech in terms of quality, outperforming strong baselines like DFNet3 and VoiceFixer.
  • Wide-band Denoising: On the LibriSpeech test set, which contains longer and more complex sentences, MaskSR2 models showed competitive WER results compared to specialized denoising models, despite MaskSR2's broader restoration capability. Quality metrics like DNSMOS and SESQA also favored MaskSR2.

Implications and Future Directions

The integration of semantic KD using SSL models like HuBERT in MaskSR2 highlights a promising direction for speech restoration, emphasizing the necessity of semantic understanding for intelligibility. Moving forward, exploring more powerful SSL models and multitask speech encoders could further enhance performance. Additionally, the paper suggests further investigating span masking strategies for potentially better context modeling.

In summary, MaskSR2 sets an advanced standard for speech restoration, combining perceptual quality with improved intelligibility. This integrated approach shows significant promise, particularly in applications where speech clarity and content preservation are crucial.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com