Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing

Published 12 Apr 2021 in eess.AS, cs.SD, and eess.SP | (2104.05481v1)

Abstract: Voice activity detection (VAD) remains a challenge in noisy environments. With access to multiple microphones, prior studies have attempted to improve the noise robustness of VAD by creating multi-channel VAD (MVAD) methods. However, MVAD is relatively new compared to single-channel VAD (SVAD), which has been thoroughly developed in the past. It might therefore be advantageous to improve SVAD methods with pre-processing to obtain superior VAD, which is under-explored. This paper improves SVAD through two pre-processing methods, a beamformer and a spatial target speaker detector. The spatial detector sets signal frames to zero when no potential speaker is present within a target direction. The detector may be implemented as a filter, meaning the input signal for the SVAD is filtered according to the detector's output; or it may be implemented as a spatial VAD to be combined with the SVAD output. The evaluation is made on a noisy reverberant speech database, with clean speech from the Aurora 2 database and with white and babble noise. The results show that SVAD algorithms are significantly improved by the presented pre-processing methods, especially the spatial detector, across all signal-to-noise ratios. The SVAD algorithms with pre-processing significantly outperform a baseline MVAD in challenging noise conditions.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that spatial pre-processing significantly improves SVAD performance under adverse noise conditions.
It introduces beamforming and spatial target speaker detection techniques to filter out irrelevant directional signals.
Experimental results show that these enhancements outperform baseline MVAD models, especially at -5 dB and 0 dB SNR levels.

Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing

Introduction

The paper "Improvement of Noise-Robust Single-Channel Voice Activity Detection with Spatial Pre-processing" (2104.05481) addresses the persistent issue of voice activity detection (VAD) in environments with significant noise interference. Despite the advances in multi-channel VAD (MVAD) methods, which leverage spatial cues such as ITD, ILD, and IPD for enhanced noise robustness, single-channel VAD (SVAD) methods remain prevalent due to their foundational development and lower complexity. This study aims to augment SVAD performance using spatial pre-processing techniques, specifically beamforming and a spatial target speaker detector, to refine VAD decision-making.

Methodology

The authors introduce two spatial pre-processing techniques: a beamforming method and a spatial target speaker detector. The spatial target speaker detector filters signal frames based on a predefined target direction, effectively nullifying irrelevant directional signals. This process involves discerning the ITD through dual microphone input, establishing a field of view (FOV) that identifies potential target speaker directions. The detector can be employed either to filter the input to SVAD algorithms (denoted as F-SVAD) or as a spatial VAD that contributes to the final SVAD decision through a logical AND operation (denoted as A-SVAD).

The beamforming technique complements SVAD by performing spatial filtering to enhance signals from the target direction. A delay-and-sum (DS) beamformer is utilized, albeit the discussion suggests that alternative beamformers could offer improved performance. The integration of beamforming with the proposed spatial detector creates the FB-SVAD and AB-SVAD configurations, reinforcing the single-channel signal processing pipeline with spatial filtering methodologies.

Evaluation

The paper’s evaluation framework involves simulations in controlled acoustic environments with simulated reverberations and noise interferences. Utilizing the Aurora 2 database for clean speech and simulations of reverberant environments using ISM, the authors tested the performance of SVAD algorithms, specifically Tan et al.'s rVAD, the ITU-T G.729B, and the statistical model-based VAD by Sohn et al. These evaluations contrasted SVAD algorithms with their pre-processed counterparts and an FS-NDPSD baseline MVAD algorithm. Signal-to-noise ratios (SNRs) were varied across -5 dB, 0 dB, 10 dB, and 20 dB, examining the effect of spatial pre-processing across diverse noise conditions.

Results

Experimental results demonstrated significant improvements in SVAD performance with the introduction of spatial pre-processing. Numerical analyses showcased that under low SNR conditions (-5 dB and 0 dB), SVAD with pre-processing typically outperformed the FS-NDPSD baseline MVAD. The spatial detector alone often yielded substantial improvements over unprocessed SVAD, and combining beamforming with spatial detection provided the most effective noise mitigation, consistently achieving superior AUC metrics.

Discussion

The study's findings affirm the value of spatial pre-processing methods in enhancing SVAD algorithms. Particularly in challenging noise environments where MVAD setups might traditionally be favored, these spatial enhancements facilitate robust, single-channel solutions. Future work could explore further optimization of pre-processing parameters and the employment of advanced beamformer algorithms to achieve even greater improvements.

Conclusion

The research highlights the untapped potential of spatial pre-processing techniques to fortify SVAD algorithms against adverse acoustic conditions. By exploiting spatial cues, the proposed enhancements yield SVAD solutions that can rival or even surpass MVAD capabilities in certain environments, thus broadening the scope for SVAD applicability despite lower complexity and resource requirements. The study's methodologies and conclusions pave the way for future explorations in refining VAD systems, offering promising avenues for both theoretical developments and practical implementations in audio processing across varied contexts.

Markdown