VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition (2009.04323v1)

Published 9 Sep 2020 in eess.AS, cs.LG, cs.SD, eess.SP, and stat.ML

Abstract: We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.

Citations (80)

View on Semantic Scholar

Summary

The paper introduces VoiceFilter-Lite, a novel low-latency source separation model designed for real-time, on-device speech recognition in challenging acoustic environments.
Key innovations include an asymmetric loss function to prevent over-suppression and adaptive strength, showing improved Word Error Rate (WER) in noisy conditions.
This model's efficiency and low latency enable robust, real-time speech recognition deployment on resource-constrained mobile and embedded devices.

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

The paper "VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition" introduces VoiceFilter-Lite, a novel single-channel source separation model tailored for on-device speech recognition systems. The development of such a model addresses the acute need for improving automatic speech recognition (ASR) performance in scenarios with overlapping speech or variable noise conditions, specifically tailored for deployment on resource-constrained devices.

Core Contributions and Innovations

VoiceFilter-Lite builds on the foundation of previous voice filtering systems, yet it addresses several unique challenges associated with real-time on-device processing. The key contributions of the model are as follows:

Asymmetric Loss Function: The introduction of an asymmetric loss function is a significant innovation to counteract over-suppression. This mechanism penalizes over-suppression errors more than under-suppression ones, mitigating potential ASR performance degradation particularly in non-speech noise conditions.
Adaptive Suppression Strength: To further improve performance across varying acoustic environments, VoiceFilter-Lite incorporates an adaptive suppression strength mechanism. This model component modulates in real-time based on the noise type, thereby optimizing the extraction of target speech while minimizing the suppression of relevant background noise.
Model Architecture and Quantization: VoiceFilter-Lite leverages a lean architectural design with predominantly LSTM layers, stripped of bi-directional or temporal convolutional layers, to maintain low latency. Furthermore, the model is quantized to an 8-bit integer format, significantly reducing memory and computational overhead, thus making it viable for on-device applications.

Empirical Results

The paper demonstrates the empirical effectiveness of VoiceFilter-Lite using the Word Error Rate (WER) as the primary metric across different setups. Notably, the model maintains or improves ASR performance under various noise conditions, exhibiting substantial reductions in WER for speech noise situations without degrading performance in clean or non-speech noise environments.

The results from evaluations on both the LibriSpeech dataset and a realistic set of speech queries indicate that format variations — such as stacked filterbanks versus FFT magnitudes — provide comparable performance, though stacked filterbanks are preferred due to their enriched contextual information and reduced computational demands.

Implications and Future Directions

The implications of this research are far-reaching for the deployment of ASR systems on mobile and other resource-limited devices. By ensuring real-time, low-latency voice separation with minimal resource usage, VoiceFilter-Lite paves the way for more ubiquitous, robust speech recognition applications that can handle challenging acoustic scenarios.

Future research directions could focus on further reducing the model size while maintaining performance, enhancing the adaptive mechanisms to dynamically learn and adjust to evolving noise conditions without retraining, and applying these techniques to other domains requiring precise signal separation.

In summary, the contributions of this paper represent a significant advancement in real-time speech processing, particularly within the confines of strict on-device resource constraints. By harmonizing performance, efficiency, and adaptability, VoiceFilter-Lite shows promise for broad adoption in evolved ASR systems.

PDF Markdown

Related Papers

YouTube

Show All Videos