Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

Published 4 Nov 2024 in eess.AS, cs.LG, and cs.SD | (2411.02019v2)

Abstract: Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 62.5 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.

Abstract PDF HTML Upgrade to Chat

References (34)

Summary

The paper presents a novel dual-branch SlowFast framework that uses SSM modulation to reduce computational costs and meet ultra low-latency requirements.
It employs a slow branch for environmental analysis and a fast branch for rapid speech enhancement, achieving 2 ms latency and a 70% reduction in computational costs.
Experimental results on the Voice Bank + DEMAND dataset show competitive PESQ-NB (3.12) and SISNR (16.62) scores while enhancing efficiency.

Modulating State Space Model with SlowFast Framework for Compute-Efficient Ultra Low-Latency Speech Enhancement

Introduction

The paper introduces a novel approach called the SlowFast framework for speech enhancement (SE), specifically designed to meet the stringent requirements of ultra low-latency applications. This approach aims to tackle the computational challenges posed by conventional SE methods that struggle with latency constraints due to the large number of frames they must process in a given timeframe. By employing a dual-branch architecture with distinct operational dynamics, the SlowFast framework significantly reduces the computational costs associated with low-latency SE systems while maintaining performance metrics.

Proposed Method

The SlowFast framework leverages a dual-branch structure, comprising a slow branch for environment analysis at lower frame rates and a fast branch for speech enhancement at higher frame rates.

Figure 1: Illustration of proposed framework for compute-efficient low-latency speech enhancement. (A) Processing when $\delta=3$ . (B) Framing and OLA Process

SlowFast Framework

The framework integrates two branches with different framing and Overlap-and-Add (OLA) processes. The slow branch processes larger segments with larger hop sizes, capturing comprehensive acoustic characteristics while operating at a reduced computational load. Conversely, the fast branch processes shorter segments with smaller hop sizes, directly addressing low-latency requirements.

The efficiency of the framework is achieved through a novel State Space Model (SSM) modulation strategy. The fast branch utilizes SSM, where state transitions are dynamically modulated based on the characteristics modeled by the slow branch:

$h^F_i = A^S_j \times h^F_{i-1} + g^S_j \cdot \mathcal{F}_{\rm IN}(x^F_i)$

$\hat{s}_i = \mathcal{F}_{\rm OUT}(h^F_i)$

Here, $A^S_j$ and $g^S_j$ are derived from the slow branch's output, enabling dynamic adaptation to the acoustic environment's changes, and $\mathcal{F}_{\rm IN}$ and $\mathcal{F}_{\rm OUT}$ are mappings that process input and output, respectively.

Experimental Results

The evaluation on the Voice Bank + DEMAND dataset showcased the framework's capability to reduce computational costs significantly compared to conventional methods while maintaining competitive enhancement quality.

Figure 2: Two other methods investigated in this work for integrating the Slow and Fast branches.

Performance Metrics

Experiments demonstrated that the SlowFast framework could achieve single sample-level latency with a computational cost of 100 M MACs/s, and still provide robust signal-to-noise ratio (SNR) improvements:

PESQ-NB: Achieved a score of 3.12.
SISNR: Reached 16.62.

Under the 2 ms latency scenario, the framework reduced computational costs by 70% without performance degradation, meeting the stringent demands of edge deployment.

Conclusion

The SlowFast framework presents a compute-efficient alternative for ultra low-latency SE systems. By employing a dual-branch architecture and SSM modulation, the proposed method effectively addresses the redundancy inherent in low-latency SE tasks. Future work could explore applying this framework to additional audio signal processing applications, further tightening algorithmic latency specifications and exploring real-time deployment scenarios, such as Active Noise Control (ANC).

Markdown