Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast FullSubNet: Accelerate Full-band and Sub-band Fusion Model for Single-channel Speech Enhancement (2212.09019v2)

Published 18 Dec 2022 in eess.AS and eess.SP

Abstract: FullSubNet is our recently proposed real-time single-channel speech enhancement network that achieves outstanding performance on the Deep Noise Suppression (DNS) Challenge dataset. A number of variants of FullSubNet have been proposed, but they all focus on the structure design towards better performance and are rarely concerned with computational efficiency. For many speech enhancement applications, a key feature is that system runs on a real-time, latency-sensitive, battery-powered platform, which strictly limits the algorithm latency and computational complexity. In this work, we propose a new architecture named Fast FullSubNet dedicated to accelerating the computation of FullSubNet. Specifically, Fast FullSubNet processes sub-band speech spectra in the mel-frequency domain by using cascaded linear-to-mel full-band, sub-band, and mel-to-linear full-band models such that frequencies involved in the sub-band computation are vastly reduced. After that, a down-sampling operation is proposed for the sub-band input sequence to further reduce the computational complexity along the time axis. Experimental results show that, compared to FullSubNet, Fast FullSubNet has only 13\% computational complexity and 16\% processing time, and achieves comparable or even better performance. Code and audio samples are available at https://github.com/Audio-WestlakeU/FullSubNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Xiang Hao (40 papers)
  2. Xiaofei Li (71 papers)
Citations (9)

Summary

Accelerating Speech Enhancement with Fast FullSubNet

The paper, "Fast FullSubNet: Accelerate Full-band and Sub-band Fusion Model for Single-channel Speech Enhancement," presents an optimized approach to the previously proposed FullSubNet architecture to enhance computational efficiency without sacrificing performance. This proposal addresses a key challenge in single-channel speech enhancement, particularly the need for low-latency, real-time processing on constrained hardware.

Overview of FullSubNet

FullSubNet is notable for its integration of full-band and sub-band models that allow it to capture both global spectral information and local spectral patterns. The full-band model processes the entire frequency spectrum, extracting long-distance dependencies, while the sub-band model targets specific frequency bands to leverage local characteristics. This dual approach delivers high-quality speech enhancement, as evidenced by its performance in the Deep Noise Suppression (DNS) Challenge dataset.

Introduction to Fast FullSubNet

The primary innovation of Fast FullSubNet is its significant reduction in computational complexity, making it suitable for latency-sensitive platforms. To achieve this, the authors propose processing speech spectra in the mel-frequency domain. By transforming the linear-frequency spectra to a more compact mel-frequency representation, the number of frequency bands is substantially reduced. This transformation aligns well with human auditory perception, ensuring minimal information loss.

Architectural Modifications

Fast FullSubNet introduces several key architectural changes:

  1. Mel-frequency Processing: Speech spectra are initially transformed from linear to mel-frequency domains using a cascaded model approach. This reduces the frequency-related computations significantly.
  2. Sub-band Model Optimization: A novel down-sampling operation is incorporated to reduce the number of time frames involved in sub-band processing. This optimization is crucial for maintaining performance while reducing the computational burden.
  3. Mel-to-Linear Transformation: After processing in the mel domain, an additional step maps the output back to the linear domain. This step is facilitated by a full-band mel-to-linear model, similar in function to neural vocoders used in TTS systems.

Experimental Evaluation

Experimental comparisons demonstrate that Fast FullSubNet maintains or surpasses the performance of the original FullSubNet across several metrics, including PESQ, STOI, and SI-SDR. Remarkably, the computational complexity drops to 13% and RTF to 16% of the original FullSubNet model, ensuring faster processing times. These results are achieved without losing enhancement quality, thanks to strategic mel-frequency and down-sampling operations.

The paper further discusses the impact of varying the down-sampling factor, revealing an optimal balance that minimizes processing demands without degrading performance. Notably, using a factor of two achieves substantial speed improvements while maintaining comparable enhancement metrics.

Implications and Future Directions

The advancements made with Fast FullSubNet have significant implications for real-time speech enhancement applications, especially in devices with limited processing capabilities. By demonstrating that substantial complexity reductions are achievable without adverse effects on performance, this architecture opens pathways for broader adoption of advanced speech processing algorithms in commercial devices.

The suggested framework may be adapted beyond the scope of FullSubNet to other state-of-the-art models, reinforcing its potential impact across the field of speech enhancement.

As the field advances, further exploration into adaptive down-sampling mechanisms and domain transformations could yield even greater efficiencies. Collaborative comparisons with emerging architectures and refinement based on practical deployment scenarios will reinforce Fast FullSubNet's relevance and applicability.

This research contributes a practical solution to a prevalent issue in speech processing, marking a step forward in efficient, high-performance real-time speech enhancement technology.