Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tiny Noise-Robust Voice Activity Detector for Voice Assistants

Published 29 Jul 2025 in eess.AS and cs.AI | (2507.22157v1)

Abstract: Voice Activity Detection (VAD) in the presence of background noise remains a challenging problem in speech processing. Accurate VAD is essential in automatic speech recognition, voice-to-text, conversational agents, etc, where noise can severely degrade the performance. A modern application includes the voice assistant, specially mounted on Artificial Intelligence of Things (AIoT) devices such as cell phones, smart glasses, earbuds, etc, where the voice signal includes background noise. Therefore, VAD modules must remain light-weight due to their practical on-device limitation. The existing models often struggle with low signal-to-noise ratios across diverse acoustic environments. A simple VAD often detects human voice in a clean environment, but struggles to detect the human voice in noisy conditions. We propose a noise-robust VAD that comprises a light-weight VAD, with data pre-processing and post-processing added modules to handle the background noise. This approach significantly enhances the VAD accuracy in noisy environments and requires neither a larger model, nor fine-tuning. Experimental results demonstrate that our approach achieves a notable improvement compared to baselines, particularly in environments with high background noise interference. This modified VAD additionally improving clean speech detection.

Summary

  • The paper presents a novel VAD pipeline that integrates noise-removal pre-processing and majority voting post-processing to improve detection in challenging environments.
  • The methodology employs spectral subtraction, energy gating, and RMS normalization to clean the audio input without increasing model size or needing additional fine-tuning.
  • Experimental evaluations demonstrate an 84.7% detection accuracy on the AVA dataset, indicating significant performance gains over baseline models.

Tiny Noise-Robust Voice Activity Detector for Voice Assistants

Introduction

The paper "Tiny Noise-Robust Voice Activity Detector for Voice Assistants" (2507.22157) addresses the persistent challenge of accurately detecting voice activity in the presence of background noise. This task, known as Voice Activity Detection (VAD), is pivotal for enhancing the performance of speech processing systems such as automatic speech recognition, especially in devices with computational constraints like AIoT devices. The authors propose a novel approach incorporating both pre-processing and post-processing modules to augment a lightweight VAD model, thereby significantly enhancing its performance in noisy environments without enlarging the model size or requiring further fine-tuning.

Background and Approach

Historical methods for VAD have evolved from classical signal processing techniques that leverage simple features, to modern deep learning approaches that excel in noisy conditions. However, these latter models are typically too large for deployment on edge devices, necessitating an interest in lightweight VAD systems. The method proposed in the paper combines a pre-trained lightweight VAD model, SGVAD, with additional pre-processing and post-processing modules to enhance its robustness against background noise.

Proposed Pipeline

The authors introduce a robust VAD pipeline comprising multiple stages for processing audio signals. This includes segmentation of the input, noise-removal pre-processing techniques like spectral subtraction and energy gating, and a majority-voting-based post-processing method. Figure 1

Figure 1: Block diagram of the proposed VAD pipeline, including segmentation, noise-removal pre-processing, inference with SGVAD, sliding-window grouping, and post-processing using majority voting.

Pre-Processing

The pre-processing module involves three techniques aimed at augmenting signal clarity before analysis by the VAD:

  • Spectral Subtraction: This technique estimates and subtracts the power spectrum of background noise to yield a clearer signal representation, implemented to counteract stationary noises.
  • Energy Gating: By examining short-time energy across frames, this method filters out segments falling below a certain energy threshold, typically identifying non-speech regions.
  • RMS Normalization: This process scales audio signals to a fixed RMS level, accounting for dynamic range discrepancies and improving the effectiveness of subsequent energy-based decision mechanisms.

Post-Processing

The post-processing module employs a majority voting technique to enhance decision reliability. Grouping predictions over a sliding window allows the method to accommodate scenarios with intermittent speech, which is common in real-life applications involving voice assistants.

Experimental Evaluation

The paper presents a rigorous assessment of the proposed VAD system across multiple datasets, including AVA Speech, VBD, ESC50, and MS-SNSD. Through these evaluations, the proposed model exhibits superior performance, particularly in detecting noisy speech compared to existing SGVAD and other baseline models.

  • Detection Accuracy: The proposed method demonstrated significant gains in speech detection accuracy, particularly for noisy speech (84.7% in the AVA dataset compared to 45.4% for the baseline).
  • ROC Analysis: The ROC curves illustrate a lower false positive rate at high true positive rates for the proposed model, indicating its robustness in differentiating speech from noise effectively. Figure 2

    Figure 2: ROC curves for SGVAD (baseline), VAD 1 (with post-processing), and VAD 2 (with post-processing and pre-processing), demonstrating superior performance of VAD 2 in noise-saturated environments.

Conclusion

The paper effectively showcases a novel VAD model design tailored to overcome the challenges of detecting human voice within noisy settings, without compromising computational efficiency. By leveraging state-of-the-art pre- and post-processing strategies, this model achieves marked improvements in detecting not only clean but also noisy speech, paving the way for efficient deployment in resource-constrained, real-world environments. Such advancements enhance practical voice assistant technology and broaden the application scope for intelligent audio analysis systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 6 likes about this paper.